Hive动态分区之多分区

2021-12-31 16:25:21 浏览数 (1)

Hive动态分区之多分区

在多分区的情况下,我们可以实现半自动分区和完全自动分区,可以根据需求进行选择。

1、 半自动分区

在半自动分区中,本人将分别展示英文和中文的分区

1.1 英文分区

(1)准备数据
代码语言:javascript复制
hive> create table user_3g_age(id int ,name string ,city string ,age int) 
row format delimited 
fields terminated by 't';

OK
Time taken: 0.463 seconds
hive> load data local inpath '/root/book/3guserage.txt' overwrite into table user_3g_age;
Loading data to table test.user_3g_age
Table test.user_3g_age stats: [numFiles=1, numRows=0, totalSize=153, rawDataSize=0]
OK
Time taken: 0.678 seconds

hive> select * from user_3g_age;
OK
1	刘备	shu	30
2	关羽	shu	29
3	张飞	shu	28
4	曹操	wei	32
5	曹丕	wei	5
6	孙权	wu	30
7	孙策	wu	28
8	孙尚香	wu	20
9	夏侯?	wei	30
Time taken: 0.427 seconds, Fetched: 9 row(s)
(2)创建分区表
代码语言:javascript复制
hive> create table user_3g_age_area(id int, name string) 
partitioned by(city string, age int) 
row format delimited 
fields terminated by 't';
OK
Time taken: 0.079 seconds
(3)设置动态分区参数
代码语言:javascript复制
hive>set hive.exec.dynamici.partition=true;
hive>set hive.exec.dynamic.partition.mode=nonstrict;   #半自动分区这个参数可以不设置。
(4)开始半自动分区
代码语言:javascript复制
hive> insert into table user_3g_age_area 
partition(city='wu', age) 
select id, name, age #注意查询语句中只写普通字段和动态分区字段。
from user_3g_age 
where city='wu';

Query ID = root_20200617140809_f71a64ab-8a96-4d30-b933-aecdc9c08b5f
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0006, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0006/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 14:08:18,789 Stage-1 map = 0%,  reduce = 0%
2020-06-17 14:08:27,359 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.82 sec
MapReduce Total cumulative CPU time: 1 seconds 820 msec
Ended Job = job_1592152196890_0006
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_age_area/city=wu/.hive-staging_hive_2020-06-17_14-08-09_773_4025041538196608852-1/-ext-10000
Loading data to table test.user_3g_age_area partition (city=wu, age=null)
	 Time taken for load dynamic partitions : 424
	Loading partition {city=wu, age=20}
	Loading partition {city=wu, age=28}
	Loading partition {city=wu, age=30}
	 Time taken for adding to write entity : 2
Partition test.user_3g_age_area{city=wu, age=20} stats: [numFiles=2, numRows=2, totalSize=24, rawDataSize=22]
Partition test.user_3g_age_area{city=wu, age=28} stats: [numFiles=2, numRows=2, totalSize=18, rawDataSize=16]
Partition test.user_3g_age_area{city=wu, age=30} stats: [numFiles=2, numRows=2, totalSize=18, rawDataSize=16]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.82 sec   HDFS Read: 4620 HDFS Write: 232 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 820 msec
OK

导入数据的查询语句中,最后的where子句,根据数据具体情况使用。

如下有几种错误的导入写法,仅供参考:

  • 错误写法一:
代码语言:javascript复制
hive> insert into table user_3g_age_area 
partition(city='wu', age) 
select id, name, city #将查询字段写成了静态分区字段
from user_3g_age where city='wu';

Query ID = root_20200617140142_fa4672a3-3d7d-43f6-901e-88b5392c119b
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0003, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0003/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 14:01:51,192 Stage-1 map = 0%,  reduce = 0%
2020-06-17 14:02:00,899 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.97 sec
MapReduce Total cumulative CPU time: 1 seconds 970 msec
Ended Job = job_1592152196890_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_age_area/city=wu/.hive-staging_hive_2020-06-17_14-01-42_632_4964455617566151732-1/-ext-10000
Loading data to table test.user_3g_age_area partition (city=wu, age=null)
	 Time taken for load dynamic partitions : 120
	Loading partition {city=wu, age=wu}
	 Time taken for adding to write entity : 1
Partition test.user_3g_age_area{city=wu, age=wu} stats: [numFiles=1, numRows=3, totalSize=30, rawDataSize=27]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.97 sec   HDFS Read: 4613 HDFS Write: 122 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 970 msec
OK
Time taken: 19.886 seconds
hive> select * from user_3g_age_area;
OK
6	孙权	wu	NULL
7	孙策	wu	NULL
8	孙尚香	wu	NULL
Time taken: 0.107 seconds, Fetched: 3 row(s)
  • 错误写法二:
代码语言:javascript复制
hive> insert into table user_3g_age_area 
partition(city='wu', age) 
select id,name,city,age #将所有字段都写到了查询语句中
from user_3g_age;
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target table because column number/types are different 'age': Table insclause-0 has 3 columns, but query has 4 columns.
(5)查看分区
代码语言:javascript复制
hive> show partitions user_3g_age_area;
OK
city=wu/age=20
city=wu/age=28
city=wu/age=30
Time taken: 0.064 seconds, Fetched: 7 row(s)

1.2 中文分区

想实现中文分区,如果遇到问题,请查看:Hive动态分区

创建中文分区步骤都一样,就不做详细介绍了,直接演示过程:

代码语言:javascript复制
#创建表
hive> create table user_xy_race(id int, name string, city string , race string) 
row format delimited 
fields terminated by 't';
OK
Time taken: 0.04 seconds
#加载数据
hive> load data local inpath '/root/book/xyuserrace.txt' overwrite into table user_xy_race;
Loading data to table test.user_xy_race
Table test.user_xy_race stats: [numFiles=1, numRows=0, totalSize=272, rawDataSize=0]
OK
Time taken: 0.558 seconds
#查看原始数据
hive> select * from user_xy_race;
OK
1	孙悟空	花果山	妖
2	猪八戒	高老庄	妖
3	沙和尚	流沙河	妖
4	唐玄奘	西安	人
5	小白龙	清潭涧	仙
6	唐玄宗	西安	人
7	观世音	南海	佛
8	玉皇大帝	凌霄宝殿	仙
9	黄风怪	小雷音寺	妖
10	如来佛祖	雷音寺	佛
Time taken: 0.059 seconds, Fetched: 10 row(s)
#创建分区表
hive> create table user_xy_race_area(id int, name string) 
partitioned by(city string,race string) 
row format delimited 
fields terminated by 't';
OK
Time taken: 0.059 seconds
#设置动态分区参数
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
#半自动动态分区
hive> insert into table user_xy_race_area 
partition(city='西安',race) 
select id,name,race 
from user_xy_race 
where city='西安';

Query ID = root_20200617174813_5d9c2185-926d-4948-9894-96e4e1b30749
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0009, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0009/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:48:22,325 Stage-1 map = 0%,  reduce = 0%
2020-06-17 17:48:31,876 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.78 sec
MapReduce Total cumulative CPU time: 1 seconds 780 msec
Ended Job = job_1592152196890_0009
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_xy_race_area/city=西安/.hive-staging_hive_2020-06-17_17-48-13_156_5420308159658982499-1/-ext-10000
Loading data to table test.user_xy_race_area partition (city=西安, race=null)
	 Time taken for load dynamic partitions : 93
	Loading partition {city=西安, race=人}
	 Time taken for adding to write entity : 0
Partition test.user_xy_race_area{city=西安, race=人} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.78 sec   HDFS Read: 4654 HDFS Write: 124 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 780 msec
OK
Time taken: 20.209 seconds
#查看数据
hive> select * from user_xy_race_area;
OK
4	唐玄奘	西安	人
6	唐玄宗	西安	人
Time taken: 0.063 seconds, Fetched: 2 row(s)
#查看分区
hive> show partitions user_xy_race_area;
OK
city�/race=�
Time taken: 0.098 seconds, Fetched: 1 row(s)

中文分区在hive中显示为乱码,但是不影响正常使用。Mysql中也会有显示乱码的情况,但是MySQL有一个names可以设置编码格式。

2、全自动分区

2.1英文

全自动还是用半自动的数据进行,操作如下:

代码语言:javascript复制
#建表
hive> create table user_3g_area_all (id int,name string) 
partitioned by (city string,age string) #此处将类型写错了
row format delimited 
fields terminated by 't';
OK
Time taken: 0.17 seconds

#修改字段类型
hive> alter table user_3g_area_all partition column (age int);
OK
Time taken: 0.097 seconds

#设置参数
set hive.exec.dynamici.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

#开始自动分区导入数据
hive> insert into table user_3g_area_all 
partition(city,age) 
select id, name, city, age #这里要把字段写完整。
from user_3g_age;

Query ID = root_20200617171344_b3dae4d6-e248-440b-b977-2db4e191e298
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0008, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0008/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:13:54,523 Stage-1 map = 0%,  reduce = 0%
2020-06-17 17:14:03,066 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.4 sec
MapReduce Total cumulative CPU time: 1 seconds 400 msec
Ended Job = job_1592152196890_0008
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_3g_area_all/.hive-staging_hive_2020-06-17_17-13-44_647_9180685517392249162-1/-ext-10000
Loading data to table test.user_3g_area_all partition (city=null, age=null)
	 Time taken for load dynamic partitions : 632
	Loading partition {city=wu, age=28}
	Loading partition {city=shu, age=28}
	Loading partition {city=shu, age=29}
	Loading partition {city=wei, age=32}
	Loading partition {city=wu, age=20}
	Loading partition {city=wei, age=30}
	Loading partition {city=wu, age=30}
	Loading partition {city=shu, age=30}
	Loading partition {city=wei, age=5}
	 Time taken for adding to write entity : 3
Partition test.user_3g_area_all{city=shu, age=28} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=shu, age=29} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=shu, age=30} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wei, age=30} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_3g_area_all{city=wei, age=32} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wei, age=5} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wu, age=20} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_3g_area_all{city=wu, age=28} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
Partition test.user_3g_area_all{city=wu, age=30} stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.4 sec   HDFS Read: 4247 HDFS Write: 624 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 400 msec
OK
Time taken: 22.086 seconds

#查看数据
hive> select * from user_3g_area_all ;
OK
3	张飞	shu	28
2	关羽	shu	29
1	刘备	shu	30
9	夏侯惇	wei	30
4	曹操	wei	32
5	曹丕	wei	5
8	孙尚香	wu	20
7	孙策	wu	28
6	孙权	wu	30
Time taken: 0.125 seconds, Fetched: 9 row(s)

如果少写了字段会报以下错误:

代码语言:javascript复制
hive> insert into table user_3g_area_all partition(city,age) select id, name, age from user_3g_age;
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target table because column number/types are different 'age': Table insclause-0 has 4 columns, but query has 3 columns.

查看分区:

代码语言:javascript复制
hive> show partitions user_3g_area_all;
OK
city=shu/age=28
city=shu/age=29
city=shu/age=30
city=wei/age=30
city=wei/age=32
city=wei/age=5
city=wu/age=20
city=wu/age=28
city=wu/age=30
Time taken: 0.097 seconds, Fetched: 9 row(s)

2.2中文

过程和英文一样,不做具体介绍,只演示命令:

代码语言:javascript复制
#创建分区表
hive> create table user_xy_area_all (id int,name string) 
partitioned by(city string,race string) 
row format delimited 
fields terminated by 't';
OK
Time taken: 0.059 seconds

#设置参数
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;

#开始导入数据
hive> insert into table user_xy_area_all 
partition(city,race) 
select id,name,city,race 
from user_xy_race;

Query ID = root_20200617175222_16bf2b69-361f-4df3-939f-78c3bb8f7ef5
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1592152196890_0010, Tracking URL = http://hadoop:8088/proxy/application_1592152196890_0010/
Kill Command = /home/software/hadoop-2.7.1/bin/hadoop job  -kill job_1592152196890_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-06-17 17:52:30,712 Stage-1 map = 0%,  reduce = 0%
2020-06-17 17:52:39,123 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.39 sec
MapReduce Total cumulative CPU time: 1 seconds 390 msec
Ended Job = job_1592152196890_0010
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop:9000/user/hive/warehouse/test.db/user_xy_area_all/.hive-staging_hive_2020-06-17_17-52-22_348_8759507579611250211-1/-ext-10000
Loading data to table test.user_xy_area_all partition (city=null, race=null)
	 Time taken for load dynamic partitions : 592
	Loading partition {city=花果山, race=妖}
	Loading partition {city=西安, race=人}
	Loading partition {city=雷音寺, race=佛}
	Loading partition {city=凌霄宝殿, race=仙}
	Loading partition {city=高老庄, race=妖}
	Loading partition {city=流沙河, race=妖}
	Loading partition {city=清潭涧, race=仙}
	Loading partition {city=南海, race=佛}
	Loading partition {city=小雷音寺, race=妖}
	 Time taken for adding to write entity : 4
Partition test.user_xy_area_all{city=凌霄宝殿, race=仙} stats: [numFiles=1, numRows=1, totalSize=15, rawDataSize=14]
Partition test.user_xy_area_all{city=南海, race=佛} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=小雷音寺, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=流沙河, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=清潭涧, race=仙} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=花果山, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
Partition test.user_xy_area_all{city=西安, race=人} stats: [numFiles=1, numRows=2, totalSize=24, rawDataSize=22]
Partition test.user_xy_area_all{city=雷音寺, race=佛} stats: [numFiles=1, numRows=1, totalSize=16, rawDataSize=15]
Partition test.user_xy_area_all{city=高老庄, race=妖} stats: [numFiles=1, numRows=1, totalSize=12, rawDataSize=11]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.39 sec   HDFS Read: 4321 HDFS Write: 749 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 390 msec
OK
Time taken: 19.139 seconds

#查看数据
hive> select * from user_xy_area_all;
OK
8	玉皇大帝	凌霄宝殿	仙
7	观世音	南海	佛
9	黄风怪	小雷音寺	妖
3	沙和尚	流沙河	妖
5	小白龙	清潭涧	仙
1	孙悟空	花果山	妖
4	唐玄奘	西安	人
6	唐玄宗	西安	人
10	如来佛祖	雷音寺	佛
2	猪八戒	高老庄	妖
Time taken: 0.063 seconds, Fetched: 10 row(s)

#查看分区
hive> show partitions user_xy_area_all;
OK
city=� ��/race=�
city=Ww/race=[
city=���/race=�
city=A��/race=�
city=m�/race=�
city=��q/race=�
city�/race=�
city=���/race=[
city=� �/race=�
Time taken: 0.062 seconds, Fetched: 9 row(s)

以上为全过程!

0 人点赞