Hive根据用户自定义函数、reflect函数和窗口分析函数

2024-09-10 16:19:54 浏览数 (1)

Hive根据用户自定义函数类别分为以下三种:

(1)UDF(User-Defined-Function)

一进一出

(2)UDAF(User-Defined Aggregation Function)

聚集函数,多进一出

类似于:count/max/min

(3)UDTF(User-Defined Table-Generating Functions)

一进多出

如lateral view explore()

1. lateral view 、explode、reflect

1) 使用explode函数将hive表中的Map和Array字段数据进行拆分

需求

现在有数据格式如下

代码语言:javascript复制
zhangsan child1,child2,child3,child4 k1:v1,k2:v2
lisi child5,child6,child7,child8 k3:v3,k4:v4

字段之间使用t分割,需求将所有的child进行拆开成为一列。

代码语言:javascript复制
 ---------- -- 
| mychild|
 ---------- -- 
| child1|
| child2|
| child3|
| child4|
| child5|
| child6|
| child7|
| child8|
 ---------- -- 

将map的key和value也进行拆开,成为如下结果

代码语言:javascript复制
 ---------------- ------------------- --  
| mymapkey| mymapvalue  |
 ---------------- ------------------- --  
| k1 | v1                     |
| k2 | v2                     |
| k3 | v3                     |
| k4 | v4                     |
 ---------------- ------------------- --  

实现

创建hive数据库

代码语言:javascript复制
hive (demo)> CREATE table demo1(
name STRING,
children array <STRING>,
address Map <STRING,STRING>)
row format delimited fields terminated by 't'
collection items terminated by ','
map keys terminated by ':'
stored as textFile;

加载数据

执行以下命令创建表数据文件

代码语言:javascript复制
#mkdir -p /export/servers/hivedatas/
#cd /export/servers/hivedatas/
#gedit maparray
代码语言:javascript复制
zhangsan child1,child2,child3,child4 k1:v1,k2:v2
lisi child5,child6,child7,child8 k3:v3,k4:v4

hive表当中加载数据

代码语言:javascript复制
hive (demo)> LOAD DATA LOCAL INPATH '/home/jerry/hive/demo1' INTO TABLE demo1;

使用explode将hive当中数据拆开

将array当中的数据拆分开

代码语言:javascript复制
hive (demo)> SELECT explode(children) AS myChild FROM demo1;
mychild
child1
child2
child3
child4
child5
child6
child7
child8
Time taken: 1.187 seconds, Fetched: 8
row(s)

将map当中的数据拆分开

代码语言:javascript复制
hive(demo)> SELECT explode(address) AS (myMapKey, myMapValue) FROM demo1;
mymapkey mymapvalue
k1 v1
k2 v2
k3 v3
k4 v4
Time taken: 0.117 seconds, Fetched: 4
row(s)
hive (demo)>

2)使用explode拆分json字符串

需求

现在有一些数据格式如下:

代码语言:javascript复制
a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

其中字段与字段之间的分隔符是 |

我们要解析得到所有的monthSales对应的值为以下这一列(行转列)

代码语言:javascript复制
7fresh 4900      1900      9.9
jd        2090      78981    9.8
jdmart 6987      1600      9.0

实现

创建hive表

代码语言:javascript复制
hive(demo)> CREATE TABLEdemo2(
area STRING,
goods_id STRING,
sale_info STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS textfile;

准备数据并加载数据

代码语言:javascript复制
#cd /export/servers/hivedatas
#gedit explode_json
代码语言:javascript复制
a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

加载数据到hive表当中

代码语言:javascript复制
hive(demo)> LOAD DATA LOCAL INPATH '/home/jerry/hive/demo2' OVERWRITE INTO TABLE demo2;

使用explode拆分Array

代码语言:javascript复制
hive(demo)> SELECT explode(split(goods_id,',')) as goods_id FROM demo2;
OK
goods_id
1
2
3
4
5
6
7
8
9
Time taken: 0.087 seconds, Fetched: 9
row(s)

使用explode拆解Map

代码语言:javascript复制
hive (demo)> SELECT explode(split(area,',')) as area FROM demo2;
OK
area
a:shandong
b:beijing
c:hebei
Time taken: 0.077 seconds, Fetched: 3
row(s)

拆解json字段

代码语言:javascript复制
hive (demo)> SELECT explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{')) as sale_info FROM demo2;
sale_info
"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"
"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"
"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"
Time taken: 0.082 seconds, Fetched: 3
row(s)

然后用get_json_object来获取key为monthSales的数据

代码语言:javascript复制
hive (demo)> SELECT get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{')),'$.monthSales') as sale_info FROM demo2;
FAILED: SemanticException [Error 10081]:
UDTF's are not supported outside the SELECT clause, nor nested in expressions

然后挂了FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions

UDTF(表生成函数(UDTF)) explode不能写在别的函数内

如果你这么写,想查两个字段

代码语言:javascript复制
SELECT explode(split(area,',')) as area,good_id FROM demo2;

会报错FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id'

使用UDTF的时候,只支持一个字段,这时候就需要LATERAL VIEW出场了。

3) 配合LATERAL VIEW使用

配合lateral view查询多个字段

代码语言:javascript复制
hive (demo)> SELECT goods_id2,sale_info FROM demo2 LATERAL VIEW explode(split(goods_id,',')) goods as goods_id2; 
goods_id2 sale_info
1[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
2[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
3[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
4[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
5[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
6[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
7[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
8[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
9[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]
Time taken: 0.075 seconds, Fetched: 9
row(s)

其中LATERAL VIEW explode(split(goods_id,','))goods相当于一个虚拟表,与原表explode_lateral_view笛卡尔积关联。

也可以多重使用

代码语言:javascript复制
hive (demo)> SELECT goods_id2,sale_info,area2 FROM demo2
LATERAL VIEW explode(split(goods_id,','))goods as goods_id2
LATERAL VIEW explode(split(area,',')) areaas area2;
goods_id2 sale_info area2
1[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}] a:shandong
1[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}] b:beijing
1[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}] c:hebei
9[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}] c:hebei
Time taken: 0.052 seconds, Fetched: 27row(s)
hive (demo)> SELECT
get_json_object(concat('{',sale_info_1,'}'),'$.source') as source,
get_json_object(concat('{',sale_info_1,'}'),'$.monthSales')
as monthSales,
get_json_object(concat('{',sale_info_1,'}'),'$.userCount')
as monthSales,
get_json_object(concat('{',sale_info_1,'}'),'$.score')
as monthSales
FROM demo2
LATERAL VIEW
explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{'))sale_info
as sale_info_1;
OK
source monthsales    monthsales    monthsales
7fresh 4900      1900      9.9
jd 2090      78981    9.8
jdmart 6987      1600      9.0
Time taken: 0.05 seconds, Fetched: 3 row(s)

最终,我们可以通过下面的句子,把这个json格式的一行数据,完全转换成二维表的方式展现。

代码语言:javascript复制
hive (hive_explode)> SELECT get_json_object(concat('{',sale_info_1,'}'),'$.source') as source,
get_json_object(concat('{',sale_info_1,'}'),'$.monthSales') as monthSales,
get_json_object(concat('{',sale_info_1,'}'),'$.userCount') as monthSales,
get_json_object(concat('{',sale_info_1,'}'),'$.score') as monthSales 
FROM demo2 
LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{'))sale_info as sale_info_1;
OK
source  monthsales  monthsales  monthsales
7fresh  4900  1900  9.9
jd  2090  78981  9.8
jdmart  6987  1600  9.0
Time taken: 0.05 seconds, Fetched: 3 row(s)

3)行转列

相关参数说明

  • CONCAT(string A/col, string B/col…):返回输入字符串连接后的结果,支持任意个输入字符串;
  • CONCAT_WS(separator, str1, str2,...):它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串。如果分隔符是 NULL,返回值也将为 NULL。这个函数会跳过分隔符参数后的任何 NULL 和空字符串。分隔符将被加到被连接的字符串之间;
  • COLLECT_SET(col):函数只接受基本数据类型,它的主要作用是将某字段的值进行去重汇总,产生array类型字段。

需求

name

constellation

blood_type

孙悟空

白羊座

A

老王

射手座

A

宋宋

白羊座

B

猪八戒

白羊座

A

凤姐

射手座

A

把星座和血型一样的人归类到一起。结果如下:

代码语言:javascript复制
射手座,A            老王|凤姐
白羊座,A            孙悟空|猪八戒
白羊座,B            宋宋

创建本地constellation.txt,导入数据

执行以下命令创建文件,注意数据使用t进行分割

代码语言:javascript复制
#cd /export/servers/hivedatas
#gedit demo3
代码语言:javascript复制
孙悟空 白羊座 A
老王 射手座 A
宋宋 白羊座 B       
猪八戒 白羊座 A
凤姐 射手座 A

创建hive表并加载数据

代码语言:javascript复制
hive (demo)> CREATE TABLE demo3(name STRING,constellation STRING,blood_type STRING)row format delimited fields terminated by ",";

加载数据hive (demo)> LOAD DATA LOCAL INPATH '/home/jerry/hive/demo3' INTO TABLE demo3;

按需求查询数据

代码语言:javascript复制
hive (demo)> SELECT t1.base,concat_ws('|', collect_set(t1.name)) name FROM (SELECT
name,concat(constellation, "," , blood_type) base FROM demo3) t1
GROUP BY t1.base;
OK
t1.base name
射手座,A       老王|凤姐
白羊座,A       孙悟空|猪八戒
白羊座,B       宋宋
Time taken: 2.179 seconds, Fetched: 3
row(s)

4)列转行

所需函数:

  • EPLODE(col):将文件中的一列中复杂的array或者map结构拆分成多行。

LATERAL VIEW

用法:

代码语言:javascript复制
hive> LATERAL VIEW udtf(expression) tableAlias AS columnAlias

解释:用于和split, explode等UDTF一起使用,它能够将一列数据拆成多行数据,在此基础上可以对拆分后的数据进行聚合。

需求

代码语言:javascript复制
#cd /export/servers/hivedatas
#gedit movie.txt
代码语言:javascript复制
《疑犯追踪》 悬疑,动作,科幻,剧情
《Lie to me》 悬疑,警匪,动作,心理,剧情
《战狼2》 战争,动作,灾难
数据字段之间使用t进行分割
将电影分类中的数组数据展开。结果如下:
《疑犯追踪》 悬疑
《疑犯追踪》 动作
《疑犯追踪》 科幻
《疑犯追踪》 剧情
《Lie to me》 悬疑
《Lie to me》 警匪
《Lie to me》 动作
《Lie to me》 心理
《Lie to me》 剧情
《战狼2》 战争
《战狼2》 动作
《战狼2》 灾难

建立数据表

代码语言:javascript复制
hive (demo)> create table movie_info(
movie STRING,
category array<STRING>)
row format delimited fields terminated by
"t"
collection items terminated by
",";

加载数据

代码语言:javascript复制
hive (demo)> LOAD DATA LOCALINPATH"/home/jerry/hive/movie" INTO TABLE movie_info;

按需求查询数据

代码语言:javascript复制
hive (demo)> SELECT movie,category_name
FROM movie_info lateral view explode(category) table_tmp as category_name;
movie category_name
《疑犯追踪》      悬疑
《疑犯追踪》      动作
《疑犯追踪》      科幻
《疑犯追踪》      剧情
《Lie to me》       悬疑
《Lie to me》       警匪
《Lie to me》       动作
《Lie to me》       心理
《Lie to me》       剧情
《战狼2》    战争
《战狼2》    动作
《战狼2》    灾难
Time taken: 0.05 seconds, Fetched: 12
row(s)

5) reflect函数

reflect函数可以支持在SQL中调用java中的自带函数,秒杀一切udf函数。

需求1: 使用java.lang.Math当中的Max求两列中最大值。

创建hive表

代码语言:javascript复制
hive (demo)> CREATE TABLE test_udf(col1 int,col2 int) row format delimited fields terminated by ',';

准备数据并加载数据

代码语言:javascript复制
#cd /home/jerry/hive/
#gedit test_udf
代码语言:javascript复制
1,2
4,3
6,4
7,5
5,6

加载数据

代码语言:javascript复制
hive (demo)> LOAD DATA LOCAL INPATH '/home/jerry/hive/test_udf' OVERWRITE INTO TABLE test_udf;

执行查询

代码语言:javascript复制
hive (demo)> SELECT reflect("java.lang.Math","max",col1,col2) FROM test_udf;
_c0
2
4
6
7
6
Time taken: 0.075 seconds, Fetched: 5
row(s)

需求2: 文件中不同的记录来执行不同的java的内置函数

实现步骤:

创建hive表

代码语言:javascript复制
hive (demo)> CREATE TABLE test_udf2(class_name STRING,method_name STRING,col1 int , col2 int) row format delimited fields terminated by ',';

准备数据

代码语言:javascript复制
#cd /home/jerry/hive
#gedit test_udf2
代码语言:javascript复制
java.lang.Math,min,1,2
java.lang.Math,max,2,3

加载数据

代码语言:javascript复制
hive (demo)> LOAD DATA LOCAL INPAT '/home/jerry/hive/test_udf2' OVERWRITE INTO TABLE test_udf2;

执行查询

代码语言:javascript复制
hive (demo)> SELECT reflect(class_name,method_name,col1,col2) FROM test_udf2;
OK
_c0
1
3
Time taken: 0.072 seconds, Fetched: 2row(s)

2窗口函数

1)窗口函数SUM() 、AVG() 、MIN() 、MAX()

建立数据表语句

代码语言:javascript复制
hive>CREATE table demo4 (
cookieid string,
creatrtime string, --day
pv int
) row format delimited
fields terminated by ',';

准备数据

代码语言:javascript复制
cookie1,2024-04-10,1
cookie1,2024-04-11,5
cookie1,2024-04-12,7
cookie1,2024-04-13,3
cookie1,2024-04-14,2
cookie1,2024-04-15,4
cookie1,2024-04-16,4

加载数据

代码语言:javascript复制
hive>LOAD DATA LOCAL INPATH'/home/jerry/hive/demo4' into table demo4;

开启智能本地模式

代码语言:javascript复制
hive>set hive.exec.mode.local.auto=true;

SUM()函数和窗口函数的配合使用:结果和ORDER BY相关,默认为升序。

代码语言:javascript复制
hive>SELECT
cookieid,createtime,pv,sum(pv) over(PARTITION BY cookieid ORDER BY createtime)
as pv1 FROM demo4;
OK
cookieid createtime pv   pv1
cookie1 2024-04-10  1     1
cookie1 2024-04-11  5     6
cookie1 2024-04-12  7     13
cookie1 2024-04-13  3     16
cookie1 2024-04-14  2     18
cookie1 2024-04-15  4     22
cookie1 2024-04-16  4     26
Time taken: 1.444 seconds, Fetched: 7row(s)

pv1: 分组内从起点到当前行的pv累积,如,11日的pv1=10日的pv 11日的pv, 12日=10日 11日 12日。

代码语言:javascript复制
hive>SELECT cookieid,createtime,pv,
sum(pv) over(PARTITION BY cookieid ORDER BY
createtime ROWS BETWEEN unbounded preceding and current row) as pv2
FROM demo4;
OK
cookieid createtime pv   pv2
cookie1 2024-04-10  1     1
cookie1 2024-04-11  5     6
cookie1 2024-04-12  7     13
cookie1 2024-04-13  3     16
cookie1 2024-04-14  2     18
cookie1 2024-04-15  4     22
cookie1 2024-04-16  4     26
Time taken: 3.307 seconds, Fetched: 7row(s)

pv2:同pv1

代码语言:javascript复制
hive>SELECT
cookieid,createtime,pv,sum(pv) over(PARTITION BY cookieid) as pv3 FROM demo4;
OK
cookieid createtime pv   pv3
cookie1 2024-04-16  4     26
cookie1 2024-04-15  4     26
cookie1 2024-04-14  2     26
cookie1 2024-04-13  3     26
cookie1 2024-04-12  7     26
cookie1 2024-04-11  5     26
cookie1 2024-04-10  1     26
Time taken: 1.333 seconds, Fetched: 7row(s)

pv3: 分组内(cookie1)所有的pv累加

代码语言:javascript复制
hive>SELECT cookieid,createtime,pv,
sum(pv) over(PARTITION BY cookieid ORDER BY
createtime ROWS BETWEEN 3 preceding and current row) as pv4
FROM demo4;
OK
cookieid createtime pv   pv4
cookie1 2024-04-10  1     1
cookie1 2024-04-11  5     6
cookie1 2024-04-12  7     13
cookie1 2024-04-13  3     16
cookie1 2024-04-14  2     17
cookie1 2024-04-15  4     16
cookie1 2024-04-16  4     13
Time taken: 1.409 seconds, Fetched: 7row(s)

pv4: 分组内当前行 往前3行,如,11日=10日 11日, 12日=10日 11日 12日,13日=10日 11日 12日 13日, 14日=11日 12日 13日 14日

代码语言:javascript复制
hive>SELECT cookieid,createtime,pv,
sum(pv) over(PARTITION BY cookieid ORDER BY
createtime ROWS BETWEEN 3 preceding and 1 following) as pv5
FROM demo4;
OK
cookieid createtime pv   pv5
cookie1 2024-04-10  1     6
cookie1 2024-04-11  5     13
cookie1 2024-04-12  7     16
cookie1 2024-04-13  3     18
cookie1 2024-04-14  2     21
cookie1 2024-04-15  4     20
cookie1 2024-04-16  4     13
Time taken: 1.476 seconds, Fetched: 7row(s)

pv5: 分组内当前行 往前3行 往后1行,如,14日=11日 12日 13日 14日 15日=5 7 3 2 4=21

代码语言:javascript复制
hive>SELECT cookieid,createtime,pv,
sum(pv) over(PARTITION BY cookieid ORDER BY
createtime ROWS BETWEEN current row and unbounded following) as pv6
FROM demo4;
cookieid createtime pv   pv6
cookie1 2024-04-10  1     26
cookie1 2024-04-11  5     25
cookie1 2024-04-12  7     20
cookie1 2024-04-13  3     13
cookie1 2024-04-14  2     10
cookie1 2024-04-15  4     8
cookie1 2024-04-16  4     4
Time taken: 1.408 seconds, Fetched: 7
row(s)

pv6: 分组内当前行 往后所有行,如,13日=13日 14日 15日 16日=3 2 4 4=13,14日=14日 15日 16日=2 4 4=10。

说明

  • 如果不指定ROWS BETWEEN,默认为从起点到当前行;
  • 如果不指定ORDER BY,则将分组内所有值累加。

关键是理解ROWS BETWEEN含义,也叫做window子句:

  • preceding:往前
  • following:往后
  • current row:当前行
  • unbounded:起点
  • unbounded preceding 表示从前面的起点
  • unbounded following:表示到后面的终点

AVG(),MIN(),MAX(),和SUM()用法一样。

2)窗口函数ROW_NUMBER()、RANK() 、DENSE_RANK() 、NTILE()

准备数据

代码语言:javascript复制
cookie1,2024-04-10,1
cookie1,2024-04-11,5
cookie1,2024-04-12,7
cookie1,2024-04-13,3
cookie1,2024-04-14,2
cookie1,2024-04-15,4
cookie1,2024-04-16,4
cookie2,2024-04-10,2
cookie2,2024-04-11,3
cookie2,2024-04-12,5
cookie2,2024-04-13,6
cookie2,2024-04-14,3
cookie2,2024-04-15,9
cookie2,2024-04-16,7

建立数据表

代码语言:javascript复制
hive>CREATE TABLE demo5 (
cookieid STRING,
createtime STRING, --day
pv INT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;

加载数据:

代码语言:javascript复制
hive>LOAD DATA LOCAL INPATH '/home/jerry/hive/demo5' INTO TABLE demo5; 

ROW_NUMBER()使用

ROW_NUMBER()从1开始,按照顺序,生成分组内记录的序列。

代码语言:javascript复制
hive>SELECT cookieid,createtime,pv,ROW_NUMBER() OVER(PARTITON BYcookieid ORDER BYpv desc) AS rn FROM demo5;
cookieid createtime pv   rn
cookie1 2024-04-12  7     1
cookie1 2024-04-11  5     2
cookie1 2024-04-16  4     3
cookie1 2024-04-15  4     4
cookie1 2024-04-13  3     5
cookie1 2024-04-14  2     6
cookie1 2024-04-10  1     7
cookie2 2024-04-15  9     1
cookie2 2024-04-16  7     2
cookie2 2024-04-13  6     3
cookie2 2024-04-12  5     4
cookie2 2024-04-11  3     5
cookie2 2024-04-14  3     6
cookie2 2024-04-10  2     7
Time taken: 2.968 seconds, Fetched: 14row(s)

RANK() 和DENSE_RANK()使用

RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位 。

DENSE_RANK()生成数据项在分组中的排名,排名相等会在名次中不会留下空位。

代码语言:javascript复制
hive>SELECT
cookieid,createtime,pv,RANK() OVER(PARTITON BYcookieid ORDER BYpv desc) AS
rn1,DENSE_RANK() OVER(PARTITON BYcookieid ORDER BYpv desc) AS rn2,ROW_NUMBER()
OVER(PARTITON BYcookieid ORDER BYpv DESC) AS rn3 FROM demo5 WHERE cookieid =
'cookie1';
OK
cookieid createtime pv   rn1  rn2  rn3
cookie1 2024-04-12  7     1     1     1
cookie1 2024-04-11  5     2     2     2
cookie1 2024-04-16  4     3     3     3
cookie1 2024-04-15  4     3     3     4
cookie1 2024-04-13  3     5     4     5
cookie1 2024-04-14  2     6     5     6
cookie1 2024-04-10  1     7     6     7
Time taken: 3.388 seconds, Fetched: 7
row(s)

NTILE使用

有时会有这样的需求:如果数据排序后分为三部分,业务人员只关心其中的一部分,如何将这中间的三分之一数据拿出来呢?NTILE函数即可以满足。

NTILE可以看成是:把有序的数据集合平均分配到指定的数量(num)个桶中, 将桶号分配给每一行。如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。

然后可以根据桶号,选取前或后n分之几的数据。数据会完整展示出来,只是给相应的数据打标签;具体要取几分之几的数据,需要再嵌套一层根据标签取出。

代码语言:javascript复制
hive>SELECT cookieid, hive>SELECT cookieid, createtime,pv,
NTILE(2) OVER(PARTITON BY cookieid ORDER BY
createtime) AS rn1,
NTILE(3) OVER(PARTITON BY cookieid ORDER BY
createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3
FROM demo5
ORDER BY cookieid,createtime;
OK
cookieid createtime pv   rn1  rn2  rn3
cookie1 2024-04-10  1     1     1     1
cookie1 2024-04-11  5     1     1     1
cookie1 2024-04-12  7     1     1     2
cookie1 2024-04-13  3     1     2     2
cookie1 2024-04-14  2     2     2     3
cookie1 2024-04-15  4     2     3     4
cookie1 2024-04-16  4     2     3     4
cookie2 2024-04-10  2     1     1     1
cookie2 2024-04-11  3     1     1     1
cookie2 2024-04-12  5     1     1     2
cookie2 2024-04-13  6     1     2     2
cookie2 2024-04-14  3     2     2     3
cookie2 2024-04-15  9     2     3     3
cookie2 2024-04-16  7     2     3     4
Time taken: 3.825 seconds, Fetched: 14row(s)

LAG使用

LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

代码语言:javascript复制
hive>SELECT cookieid,createtime,
ROW_NUMBER() OVER(PARTITON BYcookieid ORDER
BYcreatetime) AS rn,
LAG(createtime,1,'1970-01-01 00:00:00')
OVER(PARTITON BYcookieid ORDER BYcreatetime) AS last_1_time,
LAG(createtime,2) OVER(PARTITON BYcookieid
ORDER BYcreatetime) AS last_2_time
FROM demo5;
cookieid createtime rn    last_1_time           last_2_time
cookie1 2024-04-10  1     1970-01-0100:00:00  NULL
cookie1 2024-04-11  2     2024-04-10         NULL
cookie1 2024-04-12  3     2024-04-11         2024-04-10
cookie1 2024-04-13  4     2024-04-12         2024-04-11
cookie1 2024-04-14  5     2024-04-13         2024-04-12
cookie1 2024-04-15  6     2024-04-14         2024-04-13
cookie1 2024-04-16  7     2024-04-15         2024-04-14
cookie2 2024-04-10  1     1970-01-0100:00:00  NULL
cookie2 2024-04-11  2     2024-04-10         NULL
cookie2 2024-04-12  3     2024-04-11         2024-04-10
cookie2 2024-04-13  4     2024-04-12         2024-04-11
cookie2 2024-04-14  5     2024-04-13         2024-04-12
cookie2 2024-04-15  6     2024-04-14         2024-04-13
cookie2 2024-04-16  7     2024-04-15         2024-04-14
Time taken: 1.497 seconds, Fetched: 14row(s)

3)窗口函数: LAG(),LEAD(),FIRST_VALUE(),LAST_VALUE()

last_1_time: 指定了往上第1行的值,default为'1970-01-01 00:00:00'

cookie1第一行,往上1行为NULL,因此取默认值 1970-01-01 00:00:00

cookie1第三行,往上1行值为第二行值,2015-04-10 10:00:02

cookie1第六行,往上1行值为第五行值,2015-04-10 10:50:01

last_2_time: 指定了往上第2行的值,为指定默认值

cookie1第一行,往上2行为NULL

cookie1第二行,往上2行为NULL

cookie1第四行,往上2行为第二行值,2015-04-10 10:00:02

cookie1第七行,往上2行为第五行值,2015-04-10 10:50:0

LEAD

与LAG相反LEAD(col,n,DEFAULT)用于统计窗口内往下第n行值第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

代码语言:javascript复制
hive>SELECT cookieid,createtime,
ROW_NUMBER() OVER(PARTITON BY cookieid ORDER BY createtime) AS rn,
LEAD(createtime,1,'1970-01-01 00:00:00')
OVER(PARTITON BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITON BY
cookieid ORDER BY createtime) AS next_2_time
FROM demo5;
cookieid createtime rn    next_1_time   next_2_time
cookie1 2024-04-10  1     2024-04-11  2024-04-12
cookie1 2024-04-11  2     2024-04-12  2024-04-13
cookie1 2024-04-12  3     2024-04-13  2024-04-14
cookie1 2024-04-13  4     2024-04-14  2024-04-15
cookie1 2024-04-14  5     2024-04-15  2024-04-16
cookie1 2024-04-15  6     2024-04-16  NULL
cookie1 2024-04-16  7     1970-01-0100:00:00  NULL
cookie2 2024-04-10  1     2024-04-11  2024-04-12
cookie2 2024-04-11  2     2024-04-12  2024-04-13
cookie2 2024-04-12  3     2024-04-13  2024-04-14
cookie2 2024-04-13  4     2024-04-14  2024-04-15
cookie2 2024-04-14  5     2024-04-15  2024-04-16
cookie2 2024-04-15  6     2024-04-16  NULL
cookie2 2024-04-16  7     1970-01-0100:00:00  NULL
Time taken: 1.459 seconds, Fetched: 14row(s)

FIRST_VALUE

取分组内排序后,截止到当前行,第一个值

代码语言:javascript复制
hive>SELECT cookieid,createtime,pv,
ROW_NUMBER() OVER(PARTITON BY cookieid
ORDER BY createtime) AS rn,
FIRST_VALUE(pv) OVER(PARTITON BY cookieid
ORDER BY createtime) AS first1
FROM demo5;
OK
cookieid createtime rn    first1
cookie1 2024-04-10  1     1
cookie1 2024-04-11  2     1
cookie1 2024-04-12  3     1
cookie1 2024-04-13  4     1
cookie1 2024-04-14  5     1
cookie1 2024-04-15  6     1
cookie1 2024-04-16  7     1
cookie2 2024-04-10  1     2
cookie2 2024-04-11  2     2
cookie2 2024-04-12  3     2
cookie2 2024-04-13  4     2
cookie2 2024-04-14  5     2
cookie2 2024-04-15  6     2
cookie2 2024-04-16  7     2
Time taken: 1.407 seconds, Fetched: 14row(s)

LAST_VALUE

取分组内排序后,截止到当前行,最后一个值

代码语言:javascript复制
hive>SELECT cookieid,createtime,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER
BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER
BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid
ORDER BY createtime DESC) AS last2
FROM demo5
ORDER BY cookieid,createtime;
OK
cookieid createtime rn    last1 last2
cookie1 2024-04-10  1     1     4
cookie1 2024-04-11  2     5     4
cookie1 2024-04-12  3     7     4
cookie1 2024-04-13  4     3     4
cookie1 2024-04-14  5     2     4
cookie1 2024-04-15  6     4     4
cookie1 2024-04-16  7     4     4
cookie2 2024-04-10  1     2     7
cookie2 2024-04-11  2     3     7
cookie2 2024-04-12  3     5     7
cookie2 2024-04-13  4     6     7
cookie2 2024-04-14  5     3     7
cookie2 2024-04-15  6     9     7
cookie2 2024-04-16  7     7     7
Time taken: 3.947 seconds, Fetched: 14row(s)

如果想要取分组内排序后最后一个值,则需要变通一下:

代码语言:javascript复制
hive>SELECT cookieid,createtime,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER
BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER
BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid
ORDER BY createtime DESC) AS last2
FROM demo5
ORDER BY cookieid,createtime;

特别注意ORDER BY

如果不指定ORDER BY,则进行排序混乱,会出现错误的结果

代码语言:javascript复制
hive>SELECT cookieid,createtime,
FIRST_VALUE(url) OVER(PARTITION BY
cookieid) AS first2
FROM demo5;
OK
cookieid createtime first2
cookie1 2024-04-10  1
cookie1 2024-04-16  1
cookie1 2024-04-15  1
cookie1 2024-04-14  1
cookie1 2024-04-13  1
cookie1 2024-04-12  1
cookie1 2024-04-11  1
cookie2 2024-04-16  7
cookie2 2024-04-15  7
cookie2 2024-04-14  7
cookie2 2024-04-13  7
cookie2 2024-04-12  7
cookie2 2024-04-11  7
cookie2 2024-04-10  7
Time taken: 1.405 seconds, Fetched: 14row(s)

4)窗口函数: CUME_DIST(), PERCENT_RANK()

这两个序列分析函数不是很常用,注意:序列函数不支持WHERE子句

数据准备

代码语言:javascript复制
d1,user1,1000
d1,user2,2000
d1,user3,3000
d2,user4,4000
d2,user5,5000
代码语言:javascript复制
hive>CREATE EXTERNAL TABLE demo6(dept STRING,userid STRING,sal INT) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;

加载数据:

代码语言:javascript复制
hive>LOAD DATA LOCAL INPATH '/home/jerry/hive/demo6' INTO TABLE demo6;

CUME_DIST

CUME_DIST和ORDER BY的排序顺序有关系

CUME_DIST 小于等于当前值的行数/分组内总行数 ORDER默认顺序 正序 升序,比如:统计小于等于当前薪水的人数,所占总人数的比例。

代码语言:javascript复制
hive>SELECT dept,userid,sal,
CUME_DIST() OVER(ORDER BY sal) AS rn1,
CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM
demo6;
OK
dept userid sal   rn1  rn2
d1 user1      1000      0.2  0.3333333333333333
d1 user2      2000      0.4  0.6666666666666666
d1 user3      3000      0.6  1.0
d2 user4      4000      0.8  0.5
d2 user5      5000      1.0  1.0
Time taken: 2.571 seconds, Fetched: 5row(s)

r rn1: 没有partition,所有数据均为1组,总行数为5,

第一行:小于等于1000的行数为1,因此,1/5=0.2

第二行:小于等于2000的行数为2,因此,2/5=0.4

第三行:小于等于3000的行数为3,因此,3/5=0.6

….

rn2:按照部门分组,dpet=d1的行数为3,

第二行:小于等于1000的行数为1,因此,1/3=0.33

第二行:小于等于2000的行数为2,因此,2/3=0.66

5)分析函数: GROUPING SETS(), GROUPING__ID, CUBE,ROLLUP

数据准备

代码语言:javascript复制
2024-03,2024-03-10,cookie1
2024-03,2024-03-10,cookie5
2024-03,2024-03-12,cookie7
2024-04,2024-04-12,cookie3
2024-04,2024-04-13,cookie2
2024-04,2024-04-13,cookie4
2024-04,2024-04-16,cookie4
2024-03,2024-03-10,cookie2
2024-03,2024-03-10,cookie3
2024-04,2024-04-12,cookie5
2024-04,2024-04-13,cookie6
2024-04,2024-04-15,cookie3
2024-04,2024-04-15,cookie2
2024-04,2024-04-16,cookie1

建立表

代码语言:javascript复制
hive>CREATE TABLE demo7 (
month STRING,
day STRING,
cookieid STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
stored as textfile;

加载数据:

代码语言:javascript复制
hive>LOAD DATA LOCAL INPATH'/home/jerry/hive/demo7'
into table demo7;

GROUPING SETS

grouping sets是一种将多个GROUP BY逻辑写在一个SQL语句中的便利写法。

等价于将不同维度的GROUP BY结果集进行UNION ALL。

代码语言:javascript复制
hive>SELECT month,day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING_ID
FROM demo7
GROUP BY month,day
GROUPING SETS (month,day)
ORDER BY GROUPING_ID;

grouping_id表示这一组结果属于哪个分组集合,

根据GROUPING SETS中的分组条件month,day,1是代表month,2是代表day

等价于

代码语言:javascript复制
hive>SELECT month,NULL,COUNT(DISTINCT
cookieid) AS uv,1 AS GROUPING_ID
FROM demo7
GROUP BY month
UNION ALL
SELECT NULL as month,day,COUNT(DISTINCT
cookieid) AS uv,2 AS GROUPING_ID
FROM demo7
GROUP BY day;
代码语言:javascript复制
hive>SELECT month,day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING_ID
FROM demo7
GROUP BY month,day
GROUPING SETS (month,day,(month,day))
ORDER BY GROUPING_ID;

等价于

代码语言:javascript复制
hive>SELECT month,NULL,COUNT(DISTINCT
cookieid) AS uv,1 AS GROUPING_ID FROM test_t5 GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT cookieid) AS
uv,2 AS GROUPING_ID
FROM demo7 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid)
AS uv,3 AS GROUPING_ID
FROM demo7 GROUP BY month,day;

CUBE

根据GROUP BY的维度的所有组合进行聚合。

代码语言:javascript复制
hive>SELECT month,day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING_ID
FROM demo7
GROUP BY month,day
WITH CUBE
ORDER BY GROUPING_ID;
month day         uv   grouping__id
2024-03 2024-03-10 4     0
2024-04 2024-04-16 2     0
2024-04 2024-04-13 3     0
2024-04 2024-04-12 2     0
2024-04 2024-04-15 2     0
2024-03 2024-03-12 1     0
2024-03 NULL 5     1
2024-04 NULL 6     1
NULL 2024-04-16  2     2
NULL 2024-04-15  2     2
NULL 2024-04-13  3     2
NULL 2024-04-12  2     2
NULL 2024-03-12  1     2
NULL 2024-03-10  4     2
NULL NULL             7     3

等价于

代码语言:javascript复制
hive>SELECT NULL,NULL,COUNT(DISTINCT
cookieid) AS uv,0 AS GROUPING_ID
FROM demo7
UNION ALL
SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING_ID
FROM demo7
GROUP BY month
UNION ALL 
SELECT NULL,day,COUNT(DISTINCT cookieid) AS
uv,2 AS GROUPING_ID
FROM demo7 GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT cookieid)
AS uv,3 AS GROUPING_ID
FROM demo7 GROUP BY month,day;

ROLLUP

是CUBE的子集,以最左侧的维度为主,从该维度进行层级聚合。

比如,以month维度进行层级聚合:

代码语言:javascript复制
hive>SELECT month,day,
COUNT(DISTINCT cookieid) AS uv,
GROUPING_ID
FROM test_t5
GROUP BY month,day
WITH ROLLUP
ORDER BY GROUPING_ID;
month day  uv   grouping__id
2024-04 2024-04-16 2     0
2024-04 2024-04-15 2     0
2024-04 2024-04-13 3     0
2024-04 2024-04-12 2     0
2024-03 2024-03-12 1     0
2024-03 2024-03-10 4     0
2024-04 NULL 6     1
2024-03 NULL 5     1
NULL NULL      7     3
Time taken: 2.652 seconds, Fetched: 9
row(s)

把month和day调换顺序,则以day维度进行层级聚合:

代码语言:javascript复制
hive>SELECT month,day,COUNT(DISTINCT
cookieid) AS uv,GROUPING__ID FROM demo7
GROUP BY month,day WITH ROLLUP  ORDER
BYGROUPING__ID;
month day  uv   grouping__id
2024-04 2024-04-16 2     0
2024-04 2024-04-15 2     0
2024-04 2024-04-13 3     0
2024-04 2024-04-12 2     0
2024-03 2024-03-12 1     0
2024-03 2024-03-10 4     0
2024-04 NULL 6     1
2024-03 NULL 5     1
NULL NULL      7     3
Time taken: 2.652 seconds, Fetched: 9
row(s)

等价于

代码语言:javascript复制
hive>SELECT day,month,
COUNT(DISTINCT cookieid) AS uv,
GROUPING_ID
FROM test_t5
GROUP BY day,month
WITH ROLLUP
ORDER BY GROUPING_ID;

这里,根据天和月进行聚合,和根据天聚合结果一样,因为有父子关系,如果是其他维度组合的话,就会不一样.

把month和day调换顺序,则以day维度进行层级聚合.

0 人点赞