该文章是基于 Hadoop2.7.6_01_部署 、 Hive-1.2.1_01_安装部署 进行的
1. 前言
代码语言:txt复制 在一个完整的大数据处理系统中,除了hdfs mapreduce hive组成分析系统的核心之外,还需要数据采集、结果数据导出、任务调度等不可或缺的辅助系统,而这些辅助工具在hadoop生态体系中都有便捷的开源框架,如图所示:
1.1. 概述
sqoop是apache旗下一款“Hadoop和关系数据库服务器之间传送数据”的工具。
导入数据:MySQL,Oracle导入数据到Hadoop的HDFS、HIVE、HBASE等数据存储系统;
导出数据:从Hadoop的文件系统中导出数据到关系数据库
1.3. 工作机制
将导入或导出命令翻译成mapreduce程序来实现
在翻译出的mapreduce中主要是对inputformat和outputformat进行定制
2. Sqoop的安装部署
2.1. 软件部署
代码语言:javascript复制 1 [yun@mini01 software]$ pwd
2 /app/software
3 [yun@mini01 software]$ tar xf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
4 [yun@mini01 software]$ mv sqoop-1.4.7.bin__hadoop-2.6.0 /app/sqoop-1.4.7
5 [yun@mini01 software]$ cd /app/
6 [yun@mini01 ~]$ ln -s sqoop-1.4.7/ sqoop
7 [yun@mini01 ~]$ ll
8 total 28
9 …………
10 lrwxrwxrwx 1 yun yun 12 Aug 2 11:40 sqoop -> sqoop-1.4.7/
11 drwxr-xr-x 9 yun yun 318 Dec 19 2017 sqoop-1.4.7
2.2. 配置修改
代码语言:javascript复制 1 [yun@mini01 conf]$ pwd
2 /app/sqoop/conf
3 [yun@mini01 conf]$ ll
4 total 32
5 -rw-rw-r-- 1 yun yun 3895 Dec 19 2017 oraoop-site-template.xml
6 -rw-rw-r-- 1 yun yun 1404 Dec 19 2017 sqoop-env-template.cmd
7 -rwxr-xr-x 1 yun yun 1345 Dec 19 2017 sqoop-env-template.sh
8 -rw-rw-r-- 1 yun yun 6044 Dec 19 2017 sqoop-site-template.xml
9 -rw-rw-r-- 1 yun yun 6044 Dec 19 2017 sqoop-site.xml
10 [yun@mini01 conf]$ cp -a sqoop-env-template.sh sqoop-env.sh
11 [yun@mini01 conf]$ cat sqoop-env.sh
12 # Licensed to the Apache Software Foundation (ASF) under one or more
13 ………………
14
15 # Set Hadoop-specific environment variables here.
16
17 #Set path to where bin/hadoop is available # 修改的配置
18 export HADOOP_COMMON_HOME=${HADOOP_HOME}
19
20 #Set path to where hadoop-*-core.jar is available # 修改的配置
21 export HADOOP_MAPRED_HOME=${HADOOP_HOME}
22
23 #set the path to where bin/hbase is available
24 #export HBASE_HOME=
25
26 #Set the path to where bin/hive is available # 修改的配置
27 export HIVE_HOME=${HIVE_HOME}
28
29 #Set the path for where zookeper config dir is
30 #export ZOOCFGDIR=
2.3. 加入mysql的jdbc驱动包
代码语言:javascript复制1 [yun@mini01 software]$ pwd
2 /app/software
3 [yun@mini01 software]$ cp -a mysql-connector-java-5.1.46.jar /app/sqoop/lib
2.4. 加入hive的执行包
代码语言:javascript复制1 [yun@mini01 lib]$ pwd
2 /app/hive/lib
3 [yun@mini01 lib]$ cp -a hive-exec-1.2.1.jar /app/sqoop/lib/
代码语言:txt复制 避免出现
2.5. 验证启动
代码语言:javascript复制 1 [yun@mini01 bin]$ pwd
2 /app/sqoop/bin
3 [yun@mini01 bin]$ ./sqoop-version
4 18/08/02 14:56:54 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
5 Sqoop 1.4.7
6 git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
7 Compiled by maugli on Thu Dec 21 15:59:58 STD 2017
8 [yun@mini01 bin]$ ./sqoop help # 查看帮助
9 18/08/02 15:30:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
10 usage: sqoop COMMAND [ARGS]
11
12 Available commands:
13 codegen Generate code to interact with database records
14 create-hive-table Import a table definition into Hive
15 eval Evaluate a SQL statement and display the results
16 export Export an HDFS directory to a database table
17 help List available commands
18 import Import a table from a database to HDFS
19 import-all-tables Import tables from a database to HDFS
20 import-mainframe Import datasets from a mainframe server to HDFS
21 job Work with saved jobs
22 list-databases List available databases on a server
23 list-tables List available tables in a database
24 merge Merge results of incremental imports
25 metastore Run a standalone Sqoop metastore
26 version Display version information
27
28 See 'sqoop help COMMAND' for information on a specific command.
3. 数据库信息
代码语言:javascript复制1 # 建库
2 CREATE DATABASE sqoop_test DEFAULT CHARACTER SET utf8 ;
3 # 建账号 数据库在mini03机器上
4 grant all on sqoop_test.* to sqoop_test@'%' identified by 'sqoop_test';
5 grant all on sqoop_test.* to sqoop_test@'mini03' identified by 'sqoop_test';
6 # 刷新权限
7 flush privileges;
3.1. 表信息
表emp:
id | name | deg | salary | dept |
---|---|---|---|---|
1201 | gopal | manager | 50000 | TP |
1202 | manisha | Proof reader | 50000 | TP |
1203 | khalil | php dev | 30000 | AC |
1204 | prasanth | php dev | 30000 | AC |
1205 | kranthi | admin | 20000 | TP |
表emp_add:
id | name | deg | salary |
---|---|---|---|
1201 | 288A | vgiri | jublee |
1202 | 108I | aoc | sec-bad |
1203 | 144Z | pgutta | hyd |
1204 | 78B | old city | sec-bad |
1205 | 720X | hitec | sec-bad |
表emp_conn:
id | name | deg |
---|---|---|
1201 | 2356742 | gopal@tp.com |
1202 | 1661663 | manisha@tp.com |
1203 | 8887776 | khalil@ac.com |
1204 | 9988774 | prasanth@ac.com |
1205 | 1231231 | kranthi@tp.com |
4. Sqoop的数据导入
代码语言:txt复制 “导入工具”导入单个表从RDBMS到HDFS。表中的每一行被视为HDFS的记录。所有记录都存储为文本文件的文本数据(或者Avro、sequence文件等二进制数据)
代码语言:javascript复制1 $ sqoop import (generic-args) (import-args)
2 $ sqoop-import (generic-args) (import-args)
4.1. 导入表数据到HDFS
代码语言:javascript复制 1 [yun@mini01 sqoop]$ pwd
2 /app/sqoop
3 [yun@mini01 sqoop]$ bin/sqoop import
4 --connect jdbc:mysql://mini03:3306/sqoop_test
5 --username sqoop_test
6 --password sqoop_test
7 --table emp
8 --m 1
9 18/08/02 15:57:55 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
10 18/08/02 15:57:55 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
11 ………………
12 18/08/02 15:58:47 INFO mapreduce.Job: map 100% reduce 0%
13 18/08/02 15:58:47 INFO mapreduce.Job: Job job_1533196573365_0001 completed successfully
14 ………………
15 18/08/02 15:58:47 INFO mapreduce.ImportJobBase: Transferred 151 bytes in 47.2865 seconds (3.1933 bytes/sec)
16 18/08/02 15:58:47 INFO mapreduce.ImportJobBase: Retrieved 5 records.
查看导入的数据
代码语言:javascript复制 1 [yun@mini02 ~]$ hadoop fs -ls /user/yun/emp
2 Found 2 items
3 -rw-r--r-- 2 yun supergroup 0 2018-08-02 15:58 /user/yun/emp/_SUCCESS
4 -rw-r--r-- 2 yun supergroup 151 2018-08-02 15:58 /user/yun/emp/part-m-00000
5 [yun@mini02 ~]$ hadoop fs -cat /user/yun/emp/part-m-00000
6 1201,gopal,manager,50000,TP
7 1202,manisha,Proof reader,50000,TP
8 1203,khalil,php dev,30000,AC
9 1204,prasanth,php dev,30000,AC
10 1205,kranthi,admin,20000,TP
4.2. 导入表到HDFS指定目录
代码语言:javascript复制1 [yun@mini01 sqoop]$ pwd
2 /app/sqoop
3 [yun@mini01 sqoop]$ bin/sqoop import --connect jdbc:mysql://mini03:3306/sqoop_test
4 --username sqoop_test --password sqoop_test
5 --target-dir /sqoop_test/table_emp/queryresult
6 --table emp --num-mappers 1
注意:如果没有目录,那么会创建
查看导入的数据
代码语言:javascript复制 1 [yun@mini02 ~]$ hadoop fs -ls /sqoop_test/table_emp/queryresult
2 Found 2 items
3 -rw-r--r-- 2 yun supergroup 0 2018-08-02 17:57 /sqoop_test/table_emp/queryresult/_SUCCESS
4 -rw-r--r-- 2 yun supergroup 151 2018-08-02 17:57 /sqoop_test/table_emp/queryresult/part-m-00000
5 [yun@mini02 ~]$ hadoop fs -cat /sqoop_test/table_emp/queryresult/part-m-00000
6 1201,gopal,manager,50000,TP
7 1202,manisha,Proof reader,50000,TP
8 1203,khalil,php dev,30000,AC
9 1204,prasanth,php dev,30000,AC
10 1205,kranthi,admin,20000,TP
4.3. 导入关系表到HIVE
代码语言:javascript复制 1 [yun@mini01 sqoop]$ pwd
2 /app/sqoop
3 [yun@mini01 sqoop]$ bin/sqoop import --connect jdbc:mysql://mini03:3306/sqoop_test
4 --username sqoop_test --password sqoop_test
5 --table emp --hive-import
6 --num-mappers 1
7 18/08/02 17:40:45 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
8 18/08/02 17:40:45 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
9 ………………
10 18/08/02 17:41:09 INFO mapreduce.ImportJobBase: Transferred 151 bytes in 20.6744 seconds (7.3037 bytes/sec)
11 18/08/02 17:41:09 INFO mapreduce.ImportJobBase: Retrieved 5 records.
12 18/08/02 17:41:09 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table emp
13 18/08/02 17:41:10 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
14 18/08/02 17:41:10 WARN hive.TableDefWriter: Column salary had to be cast to a less precise type in Hive
15 18/08/02 17:41:10 INFO hive.HiveImport: Loading uploaded data into Hive
16 18/08/02 17:41:12 INFO hive.HiveImport:
17 18/08/02 17:41:12 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/app/sqoop-1.4.7/lib/hive-exec-1.2.1.jar!/hive-log4j.properties
18 18/08/02 17:41:20 INFO hive.HiveImport: OK
19 18/08/02 17:41:20 INFO hive.HiveImport: Time taken: 1.677 seconds
20 18/08/02 17:41:20 INFO hive.HiveImport: Loading data to table default.emp
21 18/08/02 17:41:21 INFO hive.HiveImport: Table default.emp stats: [numFiles=1, totalSize=151]
22 18/08/02 17:41:21 INFO hive.HiveImport: OK
23 18/08/02 17:41:21 INFO hive.HiveImport: Time taken: 0.629 seconds
24 18/08/02 17:41:21 INFO hive.HiveImport: Hive import complete.
25 18/08/02 17:41:21 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
查看导入的数据
代码语言:javascript复制 1 hive (default)> show tables;
2 OK
3 emp
4 Time taken: 0.031 seconds, Fetched: 1 row(s)
5 hive (default)> select * from emp;
6 OK
7 1201 gopal manager 50000.0 TP
8 1202 manisha Proof reader 50000.0 TP
9 1203 khalil php dev 30000.0 AC
10 1204 prasanth php dev 30000.0 AC
11 1205 kranthi admin 20000.0 TP
12 Time taken: 0.489 seconds, Fetched: 5 row(s)
4.4. 导入表到HIVE指定库指定表
代码语言:javascript复制 1 [yun@mini01 sqoop]$ pwd
2 /app/sqoop
3 [yun@mini01 sqoop]$ bin/sqoop import --connect jdbc:mysql://mini03:3306/sqoop_test
4 --username sqoop_test --password sqoop_test
5 --table emp
6 --delete-target-dir
7 --fields-terminated-by 't'
8 --hive-import
9 --hive-database sqoop_test
10 --hive-table hive_emp
11 --num-mappers 1
代码语言:txt复制 注意:hive的库sqoop_test,必须先建立。否则会报:FAILED: SemanticException [Error 10072]: Database does not exist: sqoop_test
查看导入的数据
代码语言:javascript复制 1 0: jdbc:hive2://mini01:10000> use sqoop_test;
2 No rows affected (0.049 seconds)
3 0: jdbc:hive2://mini01:10000> show tables;
4 ----------- --
5 | tab_name |
6 ----------- --
7 | hive_emp |
8 ----------- --
9 1 row selected (0.076 seconds)
10 0: jdbc:hive2://mini01:10000> select * from hive_emp;
11 -------------- ---------------- --------------- ------------------ ---------------- --
12 | hive_emp.id | hive_emp.name | hive_emp.deg | hive_emp.salary | hive_emp.dept |
13 -------------- ---------------- --------------- ------------------ ---------------- --
14 | 1201 | gopal | manager | 50000.0 | TP |
15 | 1202 | manisha | Proof reader | 50000.0 | TP |
16 | 1203 | khalil | php dev | 30000.0 | AC |
17 | 1204 | prasanth | php dev | 30000.0 | AC |
18 | 1205 | kranthi | admin | 20000.0 | TP |
19 -------------- ---------------- --------------- ------------------ ---------------- --
20 5 rows selected (0.162 seconds)
4.5. 导入表数据子集
4.5.1. where子句的导入
代码语言:javascript复制1 [yun@mini01 sqoop]$ pwd
2 /app/sqoop
3 [yun@mini01 sqoop]$ bin/sqoop import --connect jdbc:mysql://mini03:3306/sqoop_test
4 --username sqoop_test --password sqoop_test
5 --table emp_add
6 --where "city ='sec-bad'"
7 --target-dir /sqoop_test/table_emp/queryresult2
8 --num-mappers 1
查看导入的数据
代码语言:javascript复制1 [yun@mini02 ~]$ hadoop fs -ls /sqoop_test/table_emp/queryresult2
2 Found 2 items
3 -rw-r--r-- 2 yun supergroup 0 2018-08-02 19:33 /sqoop_test/table_emp/queryresult2/_SUCCESS
4 -rw-r--r-- 2 yun supergroup 72 2018-08-02 19:33 /sqoop_test/table_emp/queryresult2/part-m-00000
5 [yun@mini02 ~]$ hadoop fs -cat /sqoop_test/table_emp/queryresult2/part-m-00000
6 1202,108I,aoc,sec-bad
7 1204,78B,old city,sec-bad
8 1205,720X,hitec,sec-bad
4.5.2. query按需导入
代码语言:javascript复制1 [yun@mini01 sqoop]$ pwd
2 /app/sqoop
3 [yun@mini01 sqoop]$ bin/sqoop import --connect jdbc:mysql://mini03:3306/sqoop_test
4 --username sqoop_test --password sqoop_test
5 --query 'select id,name,deg from emp WHERE id>1203 and $CONDITIONS'
6 --split-by id
7 --fields-terminated-by 't'
8 --target-dir /sqoop_test/table_emp/queryresult4
9 --num-mappers 1
查看导入的数据
代码语言:javascript复制1 [yun@mini02 ~]$ hadoop fs -cat /sqoop_test/table_emp/queryresult4/part-m-00000
2 1204 prasanth php dev
3 1205 kranthi admin
4.6. 增量导入
代码语言:txt复制 所需参数
代码语言:javascript复制1 --check-column (col)
2 --incremental (mode) # mode include append and lastmodified
3 --last-value (value)
示例
代码语言:javascript复制 1 [yun@mini01 sqoop]$ pwd
2 /app/sqoop
3 [yun@mini01 sqoop]$ bin/sqoop import --connect jdbc:mysql://mini03:3306/sqoop_test
4 --username sqoop_test --password sqoop_test
5 --table emp
6 --incremental append
7 --check-column id
8 --last-value 1202
9 --fields-terminated-by 't'
10 --target-dir /sqoop_test/table_emp/queryresult4
11 --num-mappers 1
查看导入的数据
代码语言:javascript复制 1 [yun@mini02 ~]$ hadoop fs -ls /sqoop_test/table_emp/queryresult4/
2 Found 3 items
3 -rw-r--r-- 2 yun supergroup 0 2018-08-02 19:47 /sqoop_test/table_emp/queryresult4/_SUCCESS
4 -rw-r--r-- 2 yun supergroup 41 2018-08-02 19:47 /sqoop_test/table_emp/queryresult4/part-m-00000
5 -rw-r--r-- 2 yun supergroup 88 2018-08-02 20:16 /sqoop_test/table_emp/queryresult4/part-m-00001
6 [yun@mini02 ~]$
7 [yun@mini02 ~]$ hadoop fs -cat /sqoop_test/table_emp/queryresult4/part-m-00000
8 1204 prasanth php dev
9 1205 kranthi admin
10 [yun@mini02 ~]$ hadoop fs -cat /sqoop_test/table_emp/queryresult4/part-m-00001
11 1203 khalil php dev 30000 AC
12 1204 prasanth php dev 30000 AC
13 1205 kranthi admin 20000 TP
5. Sqoop的数据导出
将数据从HDFS导出到RDBMS数据库
导出前,目标表必须存在于目标数据库中。
- 默认操作是从将文件中的数据使用INSERT语句插入到表中
- 更新模式下,是生成UPDATE语句更新表数据
1 $ sqoop export (generic-args) (export-args)
2 $ sqoop-export (generic-args) (export-args)
5.1. 示例
数据
代码语言:javascript复制 1 [yun@mini02 ~]$ hadoop fs -ls /sqoop_test/table_emp/queryresult
2 Found 2 items
3 -rw-r--r-- 2 yun supergroup 0 2018-08-02 17:57 /sqoop_test/table_emp/queryresult/_SUCCESS
4 -rw-r--r-- 2 yun supergroup 151 2018-08-02 17:57 /sqoop_test/table_emp/queryresult/part-m-00000
5 [yun@mini02 ~]$ hadoop fs -cat /sqoop_test/table_emp/queryresult/part-m-00000
6 1201,gopal,manager,50000,TP
7 1202,manisha,Proof reader,50000,TP
8 1203,khalil,php dev,30000,AC
9 1204,prasanth,php dev,30000,AC
10 1205,kranthi,admin,20000,TP
1、首先需要手动创建mysql中的目标表
代码语言:javascript复制 1 MariaDB [(none)]> use sqoop_test;
2 Database changed
3
4 MariaDB [sqoop_test]> CREATE TABLE employee (
5 id INT NOT NULL PRIMARY KEY,
6 name VARCHAR(20),
7 deg VARCHAR(20),
8 salary INT,
9 dept VARCHAR(10));
10 Query OK, 0 rows affected (0.00 sec)
11
12 MariaDB [sqoop_test]> show tables;
13 ----------------------
14 | Tables_in_sqoop_test |
15 ----------------------
16 | emp |
17 | emp_add |
18 | emp_conn |
19 | employee |
20 ----------------------
21 4 rows in set (0.00 sec)
2、然后执行导出命令
代码语言:javascript复制1 [yun@mini01 sqoop]$ pwd
2 /app/sqoop
3 [yun@mini01 sqoop]$ bin/sqoop export
4 --connect jdbc:mysql://mini03:3306/sqoop_test
5 --username sqoop_test --password sqoop_test
6 --table employee
7 --export-dir /sqoop_test/table_emp/queryresult/
3、验证表mysql命令行
代码语言:javascript复制 1 MariaDB [sqoop_test]> select * from employee;
2 ------ ---------- -------------- -------- ------
3 | id | name | deg | salary | dept |
4 ------ ---------- -------------- -------- ------
5 | 1201 | gopal | manager | 50000 | TP |
6 | 1202 | manisha | Proof reader | 50000 | TP |
7 | 1203 | khalil | php dev | 30000 | AC |
8 | 1204 | prasanth | php dev | 30000 | AC |
9 | 1205 | kranthi | admin | 20000 | TP |
10 ------ ---------- -------------- -------- ------
11 5 rows in set (0.00 sec)