구성환경 및 버전
Linux 서버 2대
- - hadoop-3.2.1
- - apache-hive-3.1.2
- - apache-tez-0.9.2
- - sqoop-1.4.7
- - spark-2.4.7-bin-hadoop2.7
1. 기본적인 구성 환경
- 사용자 계정 : root 가 아닌 일반사용자로 진행한다
2. openjdk 8 설치 : 높은 버전은 hive 에서 오류 발생 하므로 8 을 설치한다.
- .bashrc 파일에 환경 설정
$ vi ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
$ source ~/.bashrc
3. password 없이 ssh 접속 가능한 환경 구성
$ ssh-keygen
$ ~/.ssh/id_rsa.pub 파일을 대상 호스트의 ~/.ssh/authorized_keys 에 append
4. hadoop 설치 및 설정
- 다운로드 받은 tar.gz 를 /opt/lib/hadoop-3.2.1 에 압축 해제 후 /opt/hadoop 으로 sym link
$ ln -s /opt/lib/hadoop-3.2.1 /opt/hadoop
- .bashrc 파일에 환경 설정
$ vi ~/.bashrc
export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_COMMON_LIB_NATIVE_DIR
$ source ~/.bashrc
- $ cd /opt/hadoop/etc/hadoop 에서 아래의 각 설정 파일들 수정
$ vi hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
$ vi core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://lab1:9000</value>
</property>
</configuration>
$ vi hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///mnt/data/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///mnt/data/hadoop/hdfs/datanode</value>
</property>
<property> <!-- tez - hadoop3 호환성 문제로 설정 -->
<name>dfs.client.datanode-restart.timeout</name>
<value>30</value>
</property>
</configuration>
$ vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/common/*,$HADOOP_MAPRED_HOME/share/hadoop/common/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/*,$HADOOP_MAPRED_HOME/share/hadoop/yarn/lib/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/*,$HADOOP_MAPRED_HOME/share/hadoop/hdfs/lib/*</value>
</property>
</configuration>
$ vi yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
- namenode 초기화
$ hdfs namenode -format
- dfs, yarn 시작
hadoop@lab1 /opt/hadoop/etc/hadoop $ start-dfs.sh
Starting namenodes on [lab1]
Starting datanodes
Starting secondary namenodes [lab1]
hadoop@lab1 /opt/hadoop/etc/hadoop $ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
종료는 실행의 역순.
여기까지 하면 single node hadoop dfs, yarn 설치는 완료 된다.
- hdfs 동작 확인
$ hdfs dfs -ls /
- namenode 접속
http://lab1:9870/
- datanode 접속
http://lab1:9864/
- Yarn 접속
http://lab1:8088
5. host lab2 를 추가 datanode 로 구성 (서버가 1대라면 skip)
- lab1 의 /opt/hadoop/etc/hadoop/yarn-site.xml 수정
$ vi yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>lab1:9010</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>lab1:9011</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>lab1:9012</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
</configuration>
- lab1 의 /opt/hadoop/etc/hadoop/workers 에 host 추가
$ vi workers
lab1
lab2
- 설정된 hadoop 디렉토리를 통채로 lab2 에 복사 후 node format
lab1 $ rsync -avxP /opt/lib/hadoop-3.2.1 lab2:/opt/lib/
lab1 $ rsync -avxP /home/hadoop/.bashrc lab2:/home/hadoop
lab1 $ ssh lab2
lab2 $ ln -s /opt/lib/hadoop-3.2.1 /opt/hadoop
lab2 $ hdfs namenode -format
- lab1 에서 dfs, yarn start
hadoop@lab1 /opt/hadoop/etc/hadoop $ start-dfs.sh
Starting namenodes on [lab1]
Starting datanodes
Starting secondary namenodes [lab1]
hadoop@lab1 /opt/hadoop/etc/hadoop $ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
- lab2 에서 datanode 가 자동 실행 되었는지 확인
hadoop@lab2 /opt/lib/hadoop-3.2.1 $ jps
4064 Jps
3874 DataNode
3998 NodeManager
6. hive 설치
- 다운로드 받은 tar.gz 를 /opt/lib/apache-hive-3.1.2-bin 에 압축 해제 후 /opt/hive 으로 sym link
$ ln -s /opt/lib/apache-hive-3.1.2-bin /opt/hive
- .bashrc 에 환경변수 설정
$ vi ~/.bashrc
export HIVE_HOME=/opt/hive
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HIVE_HOME/bin
$ source ~/.bashrc
- hive-site.xml (mysql 을 hive 의 metastore 로 사용)
<configuration>
<property>
<name>hive.metastore.db.type</name>
<value>mysql</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>${fs.defaultFS}/apps/hive/warehouse</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://dbserver:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&serverTimezone=UTC</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>비밀번호</value>
</property>
<property>
<name>datanucleus.autoStartMechanismMode</name>
<value>ignored</value>
</property>
<property>
<name>hive.server2.transport.mode</name>
<value>binary</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
</configuration>
- metastore thrift 설정 (추후 spark 에 사용)
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://lab:9083</value>
<description>Hive metastore Thrift server</description>
</property>
</configuration>
- hdfs 에 hive 디렉토리 생성
$ hdfs dfs -mkdir -p /apps/hive/warehouse
- hive 의 구 버전 guava 교체
$ cp /opt/hadoop/share/hadoop/common/lib/guava-27.0-jre.jar /opt/hive/lib
$ rm /opt/hive/lib/guava-19*.jar
- mysql jdbc driver 다운로드후 hive lib path 에 링크
$ ln -s /usr/lib/java/mysql-connector-java-8.0.22.jar /opt/hive/lib/
- hive metastore 초기화
hadoop@lab1 /opt/hive/lib $ schematool -initSchema -dbType mysql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/lib/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lib/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://tornado:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&serverTimezone=UTC
Metastore Connection Driver : com.mysql.cj.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to 3.1.0
Initialization script hive-schema-3.1.0.mysql.sql
Initialization script completed
schemaTool completed
- hive 설치 확인
$ hive
hive> create table stock ( date_ymd string, code_name string, close_price int ) row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile;
hive> insert into stock values ('20200101', 'SAM', 61000);
hive> insert into stock values ('20200102', 'SAM', 60000);
hive> insert into stock values ('20200103', 'SAM', 59900);
hive> insert into stock values ('20200104', 'SAM', 61500);
hive> insert into stock values ('20200105', 'SAM', 62000);
hive> select code_name, count(*), max(close_price), min(close_price) from stock group by code_name;
7. hive 의 metastore 및 server2 실행 및 원격 접속
- 서버 실행
$ hive --service metastore &
$ hive --service hiveserver2 &
- 원격 접속 (datagrip 과 같은 툴을 통해 원격접속도 가능하다)
$ beeline -u jdbc:hive2://lab1:10000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/lib/apache-hive-3.1.2-bin/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lib/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://lab1:10000
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 3.1.2 by Apache Hive
0: jdbc:hive2://lab1:10000> select count(*) from stock;
8. tez 엔진 설치
- 다운로드 받은 tar.gz 를 /opt/lib/apache-tez-0.9.2-bin 에 압축 해제 후 /opt/tez 으로 sym link
$ ln -s /opt/lib/apache-tez-0.9.2-bin /opt/tez
- .bashrc 에 환경설정
$ vi ~/.bashrc
export TEZ_CONF_DIR=/opt/tez/conf
export TEZ_JARS=/opt/tez/*:/opt/tez/lib/*
export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*:${HADOOP_CLASSPATH}
export CLASSPATH=$CLASSPATH:${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*
$ source ~/.bashrc
- tez-site.xml 설정
$ vi /opt/tez/conf/tez-site.xml
<configuration>
<property>
<name>tez.lib.uris</name>
<value>${fs.defaultFS}/apps/tez/tez.tar.gz</value>
</property>
<property>
<name>tez.use.cluster.hadoop-libs</name>
<value>true</value>
</property>
<property>
<name>hive.tez.container.size</name>
<value>3020</value>
</property>
</configuration>
- tez.lib.uris 에 설정된 tez.tar.gz 파일을 hdfs 에 업로드
$ hdfs dfs -mkdir -p /apps/tez
$ hdfs dfs -put /opt/tez/share/tez.tar.gz /apps/tez
- tez 의 hadoop 3 과 호환성 문제
hadoop3 은 timeout 에 단위를 포함해 30s 가 default 값인데
tez 에서 이 값을 Integer 형변환 하면서 오류가 발생한다.
hadoop3 의 설정값을 강제로 30s -> 30 으로 설정하면 오류가 발생하지 않는다.
$ vi /opt/hadoop/etc/hadoop/hdfs-site.xml
<property> <!-- tez - hadoop3 호환성 문제로 설정 -->
<name>dfs.client.datanode-restart.timeout</name>
<value>30</value>
</property>
- tez vs mr engine hive 쿼리 실행 속도 비교 (샘플 데이터 1.8 억건)
$ beeline -u jdbc:hive2://lab1:10000
0: jdbc:hive2://lab1:10000> set hive.execution.engine=mr;
0: jdbc:hive2://lab1:10000> select count(*) from stock;
+------------+
| _c0 |
+------------+
| 182826946 |
+------------+
1 row selected (196.108 seconds)
0: jdbc:hive2://lab1:10000> set hive.execution.engine=tez;
No rows affected (0.14 seconds)
0: jdbc:hive2://lab1:10000> select count(*) from stock;
+------------+
| _c0 |
+------------+
| 182826946 |
+------------+
1 row selected (77.739 seconds)
0: jdbc:hive2://lab1:10000>
9. spark 설치 및 구성
- 다운로드 받은 tar.gz 를 /opt/lib/spark-2.4.7-bin-hadoop2.7 에 압축 해제 후 /opt/spark 으로 sym link
$ ln -s /opt/lib/spark-2.4.7-bin-hadoop2.7 /opt/spark
- .bashrc 에 환경설정
$ vi ~/.bashrc
export SPARK_HOME=/opt/spark
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$SQOOP_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
$ source ~/.bashrc
- hive & spark 간 라이브러리 복사
$ cd /opt/hive/lib
$ ln -s /opt/spark/jars/scala-* .
$ ln -s /opt/spark/jars/spark-* .
$ cd /opt/spark/jars
$ rm hive-*.jar
$ ln -s /opt/hive/lib/hive-* .
- hive on spark engine 으로 쿼리 수행
hive> set hive.execution.engine=spark;
hive> select count(*) from stock;
Query ID = baeksj_20201103221329_b84091c0-9c13-43c3-b396-cac53746cb95
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Running with YARN Application = application_1604206121444_0033
Kill Command = /opt/hadoop/bin/yarn application -kill application_1604206121444_0033
Hive on Spark Session Web UI URL: http://lab2:34991
Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
STAGES ATTEMPT STATUS TOTAL COMPLETED RUNNING PENDING FAILED
--------------------------------------------------------------------------------------
Stage-0 ........ 0 FINISHED 23 23 0 0 0
Stage-1 ........ 0 FINISHED 1 1 0 0 0
--------------------------------------------------------------------------------------
STAGES: 02/02 [==========================>>] 100% ELAPSED TIME: 69.89 s
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 69.89 second(s)
OK
182826946
Time taken: 123.727 seconds, Fetched: 1 row(s)
10. sqoop 설치 및 구성
- 현재 sqoop 의 stable 버전 sqoop-1.4.7.bin__hadoop-2.6.0 를 설치한다
$ ln -s /opt/lib/sqoop-1.4.7.bin__hadoop-2.6.0 /opt/sqoop
- .bashrc 환경 변수 설정
$ vi ~/.bashrc
export HIVE_CONF_DIR=$HIVE_HOME/conf
export SQOOP_HOME=/opt/sqoop
export CLASSPATH=$CLASSPATH:${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*:${SQOOP_HOME}/lib/*
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$SQOOP_HOME/bin
$ source ~/.bashrc
- 필요한 라이브러리를 추가한다
$ ln -s /usr/lib/java/mysql-connector-java-8.0.22.jar /opt/sqoop/lib/
$ cp /opt/lib/apache-tez-0.9.2-bin/lib/commons-lang-2.6.jar /opt/sqoop/lib/
$ ln -s /opt/hive/lib/hive-common-3.1.2.jar /opt/sqoop/lib/
- Mysql DB 의 Table 을 HIVE 에 import 테스트
$ sqoop import --connect jdbc:mysql://dbserver:3306/dm?serverTimezone=UTC --username stockdb -P --table crawling_naver --warehouse-dir /apps/hive/warehouse -m 1 --hive-import --create-hive-table
.........
2020-11-01 17:50:02,384 INFO hive.HiveImport: Hive Session ID = 8e53543c-26ca-4977-9f0c-c458591b4a67
2020-11-01 17:50:04,446 INFO hive.HiveImport: OK
2020-11-01 17:50:04,447 INFO hive.HiveImport: Time taken: 2.013 seconds
2020-11-01 17:50:04,795 INFO hive.HiveImport: Loading data to table default.crawling_naver
2020-11-01 17:50:05,199 INFO hive.HiveImport: OK
2020-11-01 17:50:05,199 INFO hive.HiveImport: Time taken: 0.75 seconds
2020-11-01 17:50:05,683 INFO hive.HiveImport: Hive import complete.
2020-11-01 17:50:05,705 INFO hive.HiveImport: Export directory is not empty, keeping it.