Hive动态分区之多分区
在多分区的情况下,我们可以实现半自动分区和完全自动分区,可以根据需求进行选择。
1、 半自动分区
在半自动分区中,本人将分别展示英文和中文的分区
1.1 英文分区
(1)准备数据
代码语言:javascript复制hive> create table user_3g_age(id int ,name string ,city string ,age int)
row format delimited
fields terminated by 't';
OK
Time taken: 0.463 seconds
hive> load data local inpath '/root/book/3guserage.txt' overwrite into table user_3g_age;
Loading data to table test.user_3g_age
Table test.user_3g_age stats: [numFiles=1, numRows=0, totalSize=153, rawDataSize=0]
OK
Time taken: 0.678 seconds
hive> select * from user_3g_age;
OK
1 刘备 shu 30
2 关羽 shu 29
3 张飞 shu 28
4 曹操 wei 32
5 曹丕 wei 5
6 孙权 wu 30
7 孙策 wu 28
8 孙尚香 wu 20
9 夏侯? wei 30
Time taken: 0.427 seconds, Fetched: 9 row(s)
(2)创建分区表
代码语言:javascript复制hive> create table user_3g_age_area(id int, name string)
partitioned by(city string, age int)
row format delimited
fields terminated by 't';
OK
Time taken: 0.079 seconds
(3)设置动态分区参数
代码语言:javascript复制hive>set hive.exec.dynamici.partition=true;
hive>set hive.exec.dynamic.partition.mode=nonstrict; #半自动分区这个参数可以不设置。
(4)开始半自动分区
代码语言:javascript复制hive> insert into table user_3g_age_area
partition(city='wu', age)
select id, name, age #注意查询语句中只写普通字段和动态分区字段。
from user_3g_age
where city='wu';
Query ID = root_20200617140809_f71a64ab-8a96-4d30-b933-aecdc9c08b5f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0006, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0006/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job -kill job_1592152196890_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 14:08:18,789 Stage-1 map = 0%, reduce = 0%
2020-06-17 14:08:27,359 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.82 sec
MapReduce Total cumulative CPU time: 1 seconds 820 msec
Ended Job = job_1592152196890_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_age_area/city=wu/.hive-staging_hive_2020-06-17_14-08-09_773_4025041538196608852-1/-ext-10000
Loading data to table test.user_3g_age_area partition (city=wu, age=null)
Time taken for load dynamic partitions : 424
Loading partition {city=wu, age=20}
Loading partition {city=wu, age=28}
Loading partition {city=wu, age=30}
Time taken for adding to write entity : 2
Partition test.user_3g_age_area{city=wu, age=20} stats: [numFiles=2, numRows=2, totalSize=24, rawDataSize=22]
Partition test.user_3g_age_area{city=wu, age=28} stats: [numFiles=2, numRows=2, totalSize=18, rawDataSize=16]
Partition test.user_3g_age_area{city=wu, age=30} stats: [numFiles=2, numRows=2, totalSize=18, rawDataSize=16]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.82 sec HDFS Read: 4620 HDFS Write: 232 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 820 msec
OK
导入数据的查询语句中,最后的where子句,根据数据具体情况使用。
如下有几种错误的导入写法,仅供参考:
- 错误写法一:
hive> insert into table user_3g_age_area
partition(city='wu', age)
select id, name, city #将查询字段写成了静态分区字段
from user_3g_age where city='wu';
Query ID = root_20200617140142_fa4672a3-3d7d-43f6-901e-88b5392c119b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0003, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0003/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job -kill job_1592152196890_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 14:01:51,192 Stage-1 map = 0%, reduce = 0%
2020-06-17 14:02:00,899 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.97 sec
MapReduce Total cumulative CPU time: 1 seconds 970 msec
Ended Job = job_1592152196890_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_age_area/city=wu/.hive-staging_hive_2020-06-17_14-01-42_632_4964455617566151732-1/-ext-10000
Loading data to table test.user_3g_age_area partition (city=wu, age=null)
Time taken for load dynamic partitions : 120
Loading partition {city=wu, age=wu}
Time taken for adding to write entity : 1
Partition test.user_3g_age_area{city=wu, age=wu} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.97 sec HDFS Read: 4613 HDFS Write: 122 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 970 msec
OK
Time taken: 19.886 seconds
hive> select * from user_3g_age_area;
OK
6 孙权 wu NULL
7 孙策 wu NULL
8 孙尚香 wu NULL
Time taken: 0.107 seconds, Fetched: 3 row(s)
- 错误写法二:
hive> insert into table user_3g_age_area
partition(city='wu', age)
select id,name,city,age #将所有字段都写到了查询语句中
from user_3g_age;
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target table because column number/types are different 'age': Table insclause-0 has 3 columns, but query has 4 columns.
(5)查看分区
代码语言:javascript复制hive> show partitions user_3g_age_area;
OK
city=wu/age=20
city=wu/age=28
city=wu/age=30
Time taken: 0.064 seconds, Fetched: 7 row(s)
1.2 中文分区
想实现中文分区,如果遇到问题,请查看:Hive动态分区
创建中文分区步骤都一样,就不做详细介绍了,直接演示过程:
代码语言:javascript复制#创建表
hive> create table user_xy_race(id int, name string, city string , race string)
row format delimited
fields terminated by 't';
OK
Time taken: 0.04 seconds
#加载数据
hive> load data local inpath '/root/book/xyuserrace.txt' overwrite into table user_xy_race;
Loading data to table test.user_xy_race
Table test.user_xy_race stats: [numFiles=1, numRows=0, totalSize=272, rawDataSize=0]
OK
Time taken: 0.558 seconds
#查看原始数据
hive> select * from user_xy_race;
OK
1 孙悟空 花果山 妖
2 猪八戒 高老庄 妖
3 沙和尚 流沙河 妖
4 唐玄奘 西安 人
5 小白龙 清潭涧 仙
6 唐玄宗 西安 人
7 观世音 南海 佛
8 玉皇大帝 凌霄宝殿 仙
9 黄风怪 小雷音寺 妖
10 如来佛祖 雷音寺 佛
Time taken: 0.059 seconds, Fetched: 10 row(s)
#创建分区表
hive> create table user_xy_race_area(id int, name string)
partitioned by(city string,race string)
row format delimited
fields terminated by 't';
OK
Time taken: 0.059 seconds
#设置动态分区参数
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
#半自动动态分区
hive> insert into table user_xy_race_area
partition(city='西安',race)
select id,name,race
from user_xy_race
where city='西安';
Query ID = root_20200617174813_5d9c2185-926d-4948-9894-96e4e1b30749
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0009, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0009/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job -kill job_1592152196890_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:48:22,325 Stage-1 map = 0%, reduce = 0%
2020-06-17 17:48:31,876 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.78 sec
MapReduce Total cumulative CPU time: 1 seconds 780 msec
Ended Job = job_1592152196890_0009
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_xy_race_area/city=西安/.hive-staging_hive_2020-06-17_17-48-13_156_5420308159658982499-1/-ext-10000
Loading data to table test.user_xy_race_area partition (city=西安, race=null)
Time taken for load dynamic partitions : 93
Loading partition {city=西安, race=人}
Time taken for adding to write entity : 0
Partition test.user_xy_race_area{city=西安, race=人} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.78 sec HDFS Read: 4654 HDFS Write: 124 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
Time taken: 20.209 seconds
#查看数据
hive> select * from user_xy_race_area;
OK
4 唐玄奘 西安 人
6 唐玄宗 西安 人
Time taken: 0.063 seconds, Fetched: 2 row(s)
#查看分区
hive> show partitions user_xy_race_area;
OK
city�/race=�
Time taken: 0.098 seconds, Fetched: 1 row(s)
中文分区在hive中显示为乱码,但是不影响正常使用。Mysql中也会有显示乱码的情况,但是MySQL有一个names可以设置编码格式。
2、全自动分区
2.1英文
全自动还是用半自动的数据进行,操作如下:
代码语言:javascript复制#建表
hive> create table user_3g_area_all (id int,name string)
partitioned by (city string,age string) #此处将类型写错了
row format delimited
fields terminated by 't';
OK
Time taken: 0.17 seconds
#修改字段类型
hive> alter table user_3g_area_all partition column (age int);
OK
Time taken: 0.097 seconds
#设置参数
set hive.exec.dynamici.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
#开始自动分区导入数据
hive> insert into table user_3g_area_all
partition(city,age)
select id, name, city, age #这里要把字段写完整。
from user_3g_age;
Query ID = root_20200617171344_b3dae4d6-e248-440b-b977-2db4e191e298
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0008, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0008/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job -kill job_1592152196890_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:13:54,523 Stage-1 map = 0%, reduce = 0%
2020-06-17 17:14:03,066 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.4 sec
MapReduce Total cumulative CPU time: 1 seconds 400 msec
Ended Job = job_1592152196890_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_area_all/.hive-staging_hive_2020-06-17_17-13-44_647_9180685517392249162-1/-ext-10000
Loading data to table test.user_3g_area_all partition (city=null, age=null)
Time taken for load dynamic partitions : 632
Loading partition {city=wu, age=28}
Loading partition {city=shu, age=28}
Loading partition {city=shu, age=29}
Loading partition {city=wei, age=32}
Loading partition {city=wu, age=20}
Loading partition {city=wei, age=30}
Loading partition {city=wu, age=30}
Loading partition {city=shu, age=30}
Loading partition {city=wei, age=5}
Time taken for adding to write entity : 3
Partition test.user_3g_area_all{city=shu, age=28} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=shu, age=29} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=shu, age=30} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wei, age=30} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_3g_area_all{city=wei, age=32} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wei, age=5} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wu, age=20} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_3g_area_all{city=wu, age=28} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wu, age=30} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.4 sec HDFS Read: 4247 HDFS Write: 624 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 400 msec
OK
Time taken: 22.086 seconds
#查看数据
hive> select * from user_3g_area_all ;
OK
3 张飞 shu 28
2 关羽 shu 29
1 刘备 shu 30
9 夏侯惇 wei 30
4 曹操 wei 32
5 曹丕 wei 5
8 孙尚香 wu 20
7 孙策 wu 28
6 孙权 wu 30
Time taken: 0.125 seconds, Fetched: 9 row(s)
如果少写了字段会报以下错误:
代码语言:javascript复制hive> insert into table user_3g_area_all partition(city,age) select id, name, age from user_3g_age;
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target table because column number/types are different 'age': Table insclause-0 has 4 columns, but query has 3 columns.
查看分区:
代码语言:javascript复制hive> show partitions user_3g_area_all;
OK
city=shu/age=28
city=shu/age=29
city=shu/age=30
city=wei/age=30
city=wei/age=32
city=wei/age=5
city=wu/age=20
city=wu/age=28
city=wu/age=30
Time taken: 0.097 seconds, Fetched: 9 row(s)
2.2中文
过程和英文一样,不做具体介绍,只演示命令:
代码语言:javascript复制#创建分区表
hive> create table user_xy_area_all (id int,name string)
partitioned by(city string,race string)
row format delimited
fields terminated by 't';
OK
Time taken: 0.059 seconds
#设置参数
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
#开始导入数据
hive> insert into table user_xy_area_all
partition(city,race)
select id,name,city,race
from user_xy_race;
Query ID = root_20200617175222_16bf2b69-361f-4df3-939f-78c3bb8f7ef5
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0010, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0010/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job -kill job_1592152196890_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:52:30,712 Stage-1 map = 0%, reduce = 0%
2020-06-17 17:52:39,123 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.39 sec
MapReduce Total cumulative CPU time: 1 seconds 390 msec
Ended Job = job_1592152196890_0010
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_xy_area_all/.hive-staging_hive_2020-06-17_17-52-22_348_8759507579611250211-1/-ext-10000
Loading data to table test.user_xy_area_all partition (city=null, race=null)
Time taken for load dynamic partitions : 592
Loading partition {city=花果山, race=妖}
Loading partition {city=西安, race=人}
Loading partition {city=雷音寺, race=佛}
Loading partition {city=凌霄宝殿, race=仙}
Loading partition {city=高老庄, race=妖}
Loading partition {city=流沙河, race=妖}
Loading partition {city=清潭涧, race=仙}
Loading partition {city=南海, race=佛}
Loading partition {city=小雷音寺, race=妖}
Time taken for adding to write entity : 4
Partition test.user_xy_area_all{city=凌霄宝殿, race=仙} stats: [numFiles=1, numRows=1, totalSize=15, rawDataSize=14]
Partition test.user_xy_area_all{city=南海, race=佛} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=小雷音寺, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=流沙河, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=清潭涧, race=仙} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=花果山, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=西安, race=人} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
Partition test.user_xy_area_all{city=雷音寺, race=佛} stats: [numFiles=1, numRows=1, totalSize=16, rawDataSize=15]
Partition test.user_xy_area_all{city=高老庄, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.39 sec HDFS Read: 4321 HDFS Write: 749 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 390 msec
OK
Time taken: 19.139 seconds
#查看数据
hive> select * from user_xy_area_all;
OK
8 玉皇大帝 凌霄宝殿 仙
7 观世音 南海 佛
9 黄风怪 小雷音寺 妖
3 沙和尚 流沙河 妖
5 小白龙 清潭涧 仙
1 孙悟空 花果山 妖
4 唐玄奘 西安 人
6 唐玄宗 西安 人
10 如来佛祖 雷音寺 佛
2 猪八戒 高老庄 妖
Time taken: 0.063 seconds, Fetched: 10 row(s)
#查看分区
hive> show partitions user_xy_area_all;
OK
city=� ��/race=�
city=Ww/race=[
city=���/race=�
city=A��/race=�
city=m�/race=�
city=��q/race=�
city�/race=�
city=���/race=[
city=� �/race=�
Time taken: 0.062 seconds, Fetched: 9 row(s)
以上为全过程!