深入理解Hive【Hive架构介绍、安装配置、Hive语法介绍】

一、Hive的产生背景

1、MapReduce编程十分繁琐

2、传统RDBMS人员的需要

Hive由Facebook开源项目：

1、用于解决海量结构化日志的数据统计问题

2、构建在Hadoop之上的数据仓库

3、Hive提供SQL查询语言：HQL

4、底层支持多种不同的执行引擎【MR/Tez/Spark】1.x默认为MR 2.x默认为Spark 当然也可以设置为Tez

5、Hive提供了统一的元数据管理，因Hive数据是存放在HDFS中的，而HDFS不存在schema【HDFS只是普通的文本文件，无法确定字段的含义】，元数据信息一般存放在MySQL中

二、Hive的体系架构

最上面的一层shell、Thrift/JDBC(server/jdbc)都是client，另外还包括WebUI(HUE/Zeppelin)等

Metastore(元数据): ———— 在生产环境中通常存在MySQL中

对于database: name、location、owner等

对于table: name 、 location 、owner 、column name/type ...

实际上HQL语句经过Driver驱动，SQL Parser(SQL 分析/解析器) 生成语法树，Query Optimizer(查询优化器)进行优化，选择最优的执行计划，最后生成物理计划(Physical Plan)、经过序列化与反序列化，UDF(用户定义函数)，虽Hive提供了很多内置函数，但在实际工作中可能不足以应付，那么用户就可以自定义函数，最终执行(Execution)，Execution过程转换为MapReduce作业。【整个过程Hive就是将HQL生成MapReduce】，将作业提交到HDFS/HBase中运行。

三、Hive部署架构

四、Hive安装配置简单介绍

(1)从官网下载hive安装包，推荐使用Hive-1.2.1【因为Hive1.x底层是MapReduce，自Hive2.x后改为Spark】

(2)将Hive-1.2.1导入到服务器，进入/hive-1.2.1/conf文件夹中，里面有个hive-default.xml.template文件，里面是hive的默认配置信息。

(3)由于hive的metastore存储在MySQL中，那么hive所在的服务器怎么知道你连接哪个MySQL服务器呢？那么就需要手动配置一下MySQL相关信息，所以在hive-1.2.1/conf下创建一个hive-site.xml，用于配置数据库MySQL相关信息，该文件会覆盖hive-default.xml.template中的相关配置。

hive-site.xml：

代码语言：javascript复制

<configuration>
  <property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:mysql://hdp-03:3306/hive?createDatabaseIfNotExist=true</value>
	<description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>com.mysql.jdbc.Driver</value>
	<description>Driver class name for a JDBC metastore</description>
  </property>

  <property>
	<name>javax.jdo.option.ConnectionUserName</name>
	<value>root</value>
	<description>username to use against metastore database</description>
  </property>

  <property>
	<name>javax.jdo.option.ConnectionPassword</name>
	<value>root</value>
	<description>password to use against metastore database</description>
  </property>
</configuration>

(4)hive服务器默认不带mysql驱动包，所以将mysql-connector-java-5.1.39.jar 上传到hive-1.2.1/lib包下

(5)配置HADOOP_HOME和HIVE_HOME到环境变量中 : vi /etc/profile ------> source /etc/profile

(6)hive启动测试直接输入hive即可出现 hive> (交互性界面)

设置一些基本参数，让hive使用起来更便捷，比如：

让提示符显示当前库：

代码语言：javascript复制

hive>set hive.cli.print.current.db=true;

显示查询结果时显示字段名称：

代码语言：javascript复制

hive>set hive.cli.print.header=true;

但是这样设置只对当前会话有效，重启hive会话后就失效，解决办法：

在linux的当前用户目录【root用户为/root下】中，编辑一个.hiverc文件，将参数写入其中：

代码语言：javascript复制

vi .hiverc

代码语言：javascript复制

set hive.cli.print.header=true;
set hive.cli.print.current.db=true;

五、启动hive服务/客户端连接hive

5.1 启动hive服务

启动hive的服务：

代码语言：javascript复制

[root@hdp-02 hive-1.2.1]# bin/hiveserver2 -hiveconf hive.root.logger=DEBUG,console

上述启动，会将这个服务启动在前台，如果要启动在后台，则命令如下：

代码语言：javascript复制

[root@hdp-02 hive-1.2.1]# nohup bin/hiveserver2 1>/dev/null 2>&1 &

含义：不挂断后台运行 bin/hiveserver2程序标准输出到/dev/null (LInux下的黑洞，指不输出) 错误输出重定向到标准输出

科普：

nohup 是 no hang up 的缩写，就是不挂断的意思。

nohup命令：如果你正在运行一个进程，而且你觉得在退出帐户时该进程还不会结束，那么可以使用nohup命令。该命令可以在你退出帐户/关闭终端之后继续运行相应的进程。

& ：指在后台运行

例子：

代码语言：javascript复制

nohup command > myout.file 2>&1 &

在上面的例子中最后有一个& 指后台运行，0 – stdin (standard input)，1 – stdout (standard output)，2 – stderr (standard error)

2>&1是将标准错误（2）重定向到标准输出（&1），标准输出（&1）再被重定向输入到myout.file文件中。

5.2 hive客户端连接

启动成功后，可以在别的节点上用beeline去连接

方式1:

代码语言：javascript复制

[root@hdp-02 hive-1.2.1]# bin/beeline

回车，进入beeline的命令界面，输入命令连接hiveserver2

代码语言：javascript复制

beeline> !connect  jdbc:hive2//hdp-02:10000

（hdp-02是hiveserver2所启动的那台主机名，端口默认是10000）

方式2:

启动时直接连接：

代码语言：javascript复制

bin/beeline -u jdbc:hive2://hdp-02:10000 -n root

接下来就可以做正常sql查询了

六、脚本化运行【生产环境】

大量的hive查询任务，如果用交互式shell来进行输入的话，显然效率及其低下，因此，生产中更多的是使用脚本化运行机制：

该机制的核心点是：hive可以用一次性命令的方式来执行给定的hql语句

代码语言：javascript复制

[root@hdp-02 ~]#  hive -e "insert into table t_dest select * from t_src;"

然后，进一步，可以将上述命令写入shell脚本中，以便于脚本化运行hive任务，并控制、调度众多hive任务，示例如下：

书写shell脚本， vi t_order_etl.sh

代码语言：javascript复制

#!/bin/bash
hive -e "select * from db_order.t_order"
hive -e "select * from default.t_user"
hql="create table  default.t_bash as select * from db_order.t_order"
hive -e "$hql"

直接执行sh文件即可如 ./ t_order_etl.sh

【常见做法】

如果要执行的hql语句特别复杂，那么，可以把hql语句写入一个文件： vi x.hql

代码语言：javascript复制

select * from db_order.t_order;
select count(1) from db_order.t_user;

然后，用hive -f /root/x.hql 来执行

七、hive建库建表与数据导入

7.1 建库

hive中有一个默认的库：

库名：default

库目录：hdfs://hdp-02:9000/user/hive/warehouse

新建库：

代码语言：javascript复制

create database db_order;

库建好后，在hdfs中会生成一个库目录：

hdfs://hdp-02:9000/user/hive/warehouse/db_order.db

7.2 建表

7.2.1 基本建表语句

代码语言：javascript复制

use db_order;

代码语言：javascript复制

create table t_order(id string,create_time string,amount float,uid string);

表建好后，会在所属的库目录中生成一个表目录

hdfs://hdp-02:9000/user/hive/warehouse/db_order.db/t_order

只是，这样建表的话，hive会认为表数据文件中的字段分隔符为^A (对应键盘control V control A)

正确的建表语句为：

代码语言：javascript复制

create table t_order(id string,create_time string,amount float,uid string)

row format delimited

fields terminated by ',';

这样就指定了，我们的表数据文件中的字段分隔符为","

7.2.2 删除表

代码语言：javascript复制

drop table t_order;

删除表的效果是：

hive会从元数据库中清除关于这个表的信息；

hive还会从hdfs中删除这个表的表目录；

7.2.3 内部表与外部表

内部表(MANAGED_TABLE)：表目录按照hive的规范来部署，位于hive的仓库目录/user/hive/warehouse中

外部表(EXTERNAL_TABLE)：表目录由建表用户自己指定[如我们采集到的日志在/log/2019-04-09中]，想实现该文件与hive的映射关系，则无需将日志文件移动到/user/hive/warehouse中，一是为了方便，二是担心因为移动文件而对外部程序造成影响。

create external table t_access(ip string,url string,access_time string) row format delimited fields terminated by ',' location '/log/2019-04-09';

外部表和内部表的特性差别：

内部表的目录在hive的仓库目录中，外部表的目录由用户指定
drop一个内部表时：hive会清除相关元数据，并删除表数据目录
drop一个外部表时：hive只会清除相关元数据；

一个hive的数据仓库，最底层的表，一定是来自于外部系统，为了不影响外部系统的工作逻辑，在hive中可建external表来映射这些外部系统产生的数据目录；然后，后续的ETL操作，产生的各种表建议用managed_table

7.2.4 分区表

分区表的实质是：在表目录中为数据文件创建分区子目录，以便于在查询时，MR程序可以针对分区子目录中的数据进行处理，缩减读取数据的范围。

比如，网站每天产生的浏览记录，浏览记录应该建一个表来存放，但是，有时候，我们可能只需要对某一天的浏览记录进行分析

这时，就可以将这个表建为分区表，每天的数据导入其中的一个分区

当然，每日的分区目录，应该有一个目录名（分区字段）

代码语言：javascript复制

/user/hive/warehouse/t_pv_log/day=2019-04-08/
			    /day=2019-04-09/

/user/hive/warehouse/t_buyer_log/city=beijing/
			    /city=shanghai/

这样的话，day=2019-04-08和day=2019-04-09都属于t_pv_log，在查询的时候可以按日期查，也可以根据t_pv_log一起查出来，十分方便。

7.2.4.1 一个分区字段的实例：

1、创建带分区的表

create table t_access(ip string,url string,access_time string) partitioned by(day string) row format delimited fields terminated by ',';

将来把数据向表中insert的时候，就需要指定一个day了，如day=2017-09-16，指定之后就插入到该目录。（plus：这个子目录day=2017-09-16并不是在建表时候就有的，而是在插入/导入数据时候才在HDFS中生成该目录的）

注意：分区字段不能是表定义中的已存在字段，否则会冲突，实际上分区字段是伪字段，在select查询时也会显示出来。

2、向分区中导入数据

代码语言：javascript复制

load data local inpath '/root/access.log.2019-04-08.log' into table t_access partition(day='20190408');

load data local inpath '/root/access.log.2019-04-09.log' into table t_access partition(day='20190409');

【注意点：local inpath是指hive服务端所在的机器的本地目录】，导入后发现/user/hive/warehouse/access.db/t_access下生成了对应的文件夹day='20190408和day='20190409，而day='20190408文件内部是我们上传的log日志文件

3、针对分区数据进行查询

a、统计4月8号的总PV：

代码语言：javascript复制

select count(*) from t_access where day='20190408';

实质：就是将分区字段当成表字段来用【实际上是伪字段】，就可以使用where子句指定分区了

b、统计表中所有数据总的PV：

代码语言：javascript复制

select count(*) from t_access;

实质：不指定分区条件即可

7.3 数据导入导出

方式1：导入数据的一种方式： 手动用hdfs命令，将文件放入表目录；

方式2：在hive的交互式shell中用hive命令来导入本地数据到表目录

代码语言：javascript复制

hive>load data local inpath '/root/order.data.2' into table t_order;

方式3：用hive命令导入hdfs中的数据文件到表目录

代码语言：javascript复制

hive>load data inpath '/access.log.2019-04-09.log' into table t_access partition(dt='20190409');

注意：导本地文件和导HDFS文件的区别： 本地文件导入表：复制 hdfs文件导入表：移动(实际上是移动到表所在文件夹内部)

将hive表中的数据导出到指定路径的文件

(1)将hive表中的数据导入HDFS的文件

insert overwrite directory '/root/access-data' row format delimited fields terminated by ',' select * from t_access;

(2)将hive表中的数据导入本地磁盘文件

insert overwrite local directory '/root/access-data' row format delimited fields terminated by ',' select * from t_access limit 100000;

7.4 hive文件格式

HIVE支持很多种文件格式：SEQUENCE FILE | TEXT FILE | PARQUET FILE | RC FILE

《Hive文件格式之textfile,sequencefile和rcfile的使用与区别详解》

create table t_pq(movie string,rate int) stored as textfile;

create table t_pq(movie string,rate int) stored as sequencefile;

create table t_pq(movie string,rate int) stored as parquetfile;

演示： 1、先建一个存储文本文件的表

代码语言：javascript复制

create table t_access_text(ip string, url string,access_time string)
row format delimited fields terminated by ','
stored as textfile;

导入文本数据到表中：

代码语言：javascript复制

load data local inpath '/root/access-data/000000_0' into table t_access_text;

2、建一个存储sequence file文件的表：

代码语言：javascript复制

create table t_access_seq(ip string,url string,access_time string)
stored as sequencefile;

从文本表中查询数据插入sequencefile表中，生成数据文件就是sequencefile格式的了：

代码语言：javascript复制

insert into t_access_seq
select * from t_access_text;

3、建一个存储parquet file文件的表：

代码语言：javascript复制

create table t_access_parq(ip string,url string,access_time string)
stored as parquetfile;

八 Hive数据类型

8.1 数字类型

TINYINT (1-byte signed integer, from -128 to 127) SMALLINT (2-byte signed integer, from -32,768 to 32,767) INT/INTEGER (4-byte signed integer, from -2,147,483,648 to 2,147,483,647) BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807) FLOAT (4-byte single precision floating point number) DOUBLE (8-byte double precision floating point number)

示例：

create table t_test(a string ,b int,c bigint,d float,e double,f tinyint,g smallint)

8.2 日期时间类型

TIMESTAMP (Note: Only available starting with Hive 0.8.0) DATE (Note: Only available starting with Hive 0.12.0)

示例，假如有以下数据文件：

代码语言：javascript复制

1,zhangsan,1985-06-30
2,lisi,1986-07-10
3,wangwu,1985-08-09

那么，就可以建一个表来对数据进行映射

create table t_customer(id int,name string,birthday date)

row format delimited fields terminated by ',';

然后导入数据

load data local inpath '/root/customer.dat' into table t_customer;

然后，就可以正确查询

8.3 字符串类型

STRING VARCHAR (Note: Only available starting with Hive 0.12.0) CHAR (Note: Only available starting with Hive 0.13.0)

8.4 混杂类型

BOOLEAN BINARY (Note: Only available starting with Hive 0.8.0)

8.5 复合类型

8.5.1 array数组类型

arrays: ARRAY<data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14)

示例：array类型的应用

假如有如下数据需要用hive的表去映射：

代码语言：javascript复制

战狼2,吴京:吴刚:龙母,2017-08-16
三生三世十里桃花,刘亦菲:痒痒,2017-08-20

设想：如果主演信息用一个数组来映射比较方便

建表：

代码语言：javascript复制

create table t_movie(moive_name string , actors array<string> , first_show date)

row format delimited 

fields terminated by ','

collection items terminated by ':';

导入数据：

代码语言：javascript复制

load data local inpath '/root/movie.dat' into table t_movie;

查询：

代码语言：javascript复制

select * from t_movie;

select moive_name,actors[0] from t_movie;

-- 使用array_containns(field,'keyword')看某字段是否包含keyword
select moive_name,actors from t_movie where array_contains(actors,'吴刚');

-- 求每部电影中包含多少位主演(actors是array类型,  size(field) 数组的长度函数)
select moive_name,size(actors) from t_movie;

8.5.2 map类型

maps: MAP<primitive_type, data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)

1、假如有以下数据：

代码语言：javascript复制

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:xiaolin#mother:ruhua#sister:xiaoniu,29
4,mayun,father:mababa#mother:xiaoqiang,26

可以用一个map类型来对上述数据中key-value的家庭成员进行描述

2、建表语句：

create table t_person(id int,name string,family_members map<string,string>,age int)

row format delimited fields terminated by ','

collection items terminated by '#'

map keys terminated by ':';

3、查询

select * from t_person;

## 取map字段的指定key的值(查出每个人的爸爸)

select id,name,family_members['father'] as father from t_person;

## 取map字段的所有key(查出每个人的亲属关系)

select id,name,map_keys(family_members) as relation from t_person;

## 取map字段的所有value(查出每个人亲人的名字)

select id,name,map_values(family_members) from t_person;

##查出每个人中亲人的数量(使用size() 函数)

select id,name,size(family_members) as relations,age from t_person;

## 综合：查询有brother的用户信息[谁有兄弟、兄弟是谁]

代码语言：javascript复制

-- 方式1
select id,name,father 
from 
(select id,name,family_members['brother'] as brother_name from t_person) tmp
where brother_name is not null;

--方式2
select id,name,age, famaily_members['brother'] 
from t_person where array_contains(map_key(famaily_members),'brother');

8.5.3 struct类型

structs: STRUCT<col_name : data_type, ...>

1、假如有如下数据：

代码语言：javascript复制

1,zhangsan,18:male:beijing
2,lisi,28:female:shanghai

其中的用户信息包含：年龄：整数，性别：字符串，地址：字符串

设想用一个字段来描述整个用户信息，可以采用struct

2、建表：

代码语言：javascript复制

create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)

row format delimited fields terminated by ','

collection items terminated by ':';

3、查询

代码语言：javascript复制

select * from t_person_struct;

-- 查询 id , name , 年龄
select id,name,info.age from t_person_struct;

8.6 修改表定义

仅修改Hive元数据，不会触动表中的数据，用户需要确定实际的数据布局符合元数据的定义。

修改表名：

代码语言：javascript复制

ALTER TABLE table_name RENAME TO new_table_name

示例：alter table t_1 rename to t_x;

修改分区名：

代码语言：javascript复制

alter table t_partition partition(department='xiangsheng',sex='male',howold=20) rename to partition(department='1',sex='1',howold=20);

添加分区：

代码语言：javascript复制

alter table t_partition add partition (department='2',sex='0',howold=40);

删除分区：

代码语言：javascript复制

alter table t_partition drop partition (department='2',sex='2',howold=24);

修改表的文件格式定义：

代码语言：javascript复制

ALTER TABLE table_name [PARTITION partitionSpec] SET FILEFORMAT file_format

ALTER TABLE t_partition partition(department='2',sex='0',howold=40 ) set fileformat sequencefile;

修改列名定义：

代码语言：javascript复制

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENTcol_comment] [FIRST|(AFTER column_name)]  

alter table t_user change price jiage float first;

增加/替换列：

代码语言：javascript复制

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type[COMMENT col_comment], ...)  

alter table t_user add columns (sex string,addr string);
alter table t_user replace columns (id string,age int,price float);

九、Hive查询语法

9.1 基本查询示例

代码语言：javascript复制

select * from t_access;

select count(*) from t_access;

select max(ip) from t_access;

9.2 条件查询

代码语言：javascript复制

select * from t_access where access_time < '2017-08-06 15:30:20'

select * from t_access where access_time < '2017-08-06 16:30:20' and ip>'192.168.33.3';

9.3 join关联查询示例

假如有a.txt文件

代码语言：javascript复制

a,1
b,2
c,3
d,4

假如有b.txt文件

代码语言：javascript复制

a,xx
b,yy
d,zz
e,pp

进行各种join查询：

1、inner join（join）

代码语言：javascript复制

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
join t_b b
on a.name=b.name

结果：

2、left outer join（left join）

代码语言：javascript复制

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
left outer join t_b b
on a.name=b.name

3、right outer join（right join）

代码语言：javascript复制

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
right outer join t_b b
on a.name=b.name

4、full outer join（full join）

代码语言：javascript复制

select 
a.name as aname,
a.numb as anumb,
b.name as bname,
b.nick as bnick
from t_a a
full join t_b b
on a.name=b.name;

5、left semi join(左半连接)

hive中不支持exist/IN子查询，可以用left semi join来实现同样的效果：

代码语言：javascript复制

select 
a.name as aname,
a.numb as anumb
from t_a a
left semi join t_b b
on a.name=b.name;

注意：left semi join的select子句中，不能有右表的字段，所以select b.*也没有用。。。

left semi join中，右表的引用不能出现在where条件中

9.4 group by分组聚合

1、建表映射上述数据【使用分区partition】

代码语言：javascript复制

create table t_access(ip string,url string,access_time string)
partitioned by (day string)
row format delimited fields terminated by ',';

2、书写测试数据

代码语言：javascript复制

//1、新建/root/hivetest/access.log.0804
192.168.33.3,http://www.edu360.cn/stu,2017-08-04 15:30:20
192.168.33.3,http://www.edu360.cn/teach,2017-08-04 15:35:20
192.168.33.4,http://www.edu360.cn/stu,2017-08-04 15:30:20
192.168.33.4,http://www.edu360.cn/job,2017-08-04 16:30:20
192.168.33.5,http://www.edu360.cn/job,2017-08-04 15:40:20

//2、新建/root/hivetest/access.log.0805
192.168.33.3,http://www.edu360.cn/stu,2017-08-05 15:30:20
192.168.44.3,http://www.edu360.cn/teach,2017-08-05 15:35:20
192.168.33.44,http://www.edu360.cn/stu,2017-08-05 15:30:20
192.168.33.46,http://www.edu360.cn/job,2017-08-05 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-05 15:40:20

//3、新建/root/hivetest/access.log.0806
192.168.133.3,http://www.edu360.cn/register,2017-08-06 15:30:20s
192.168.111.3,http://www.edu360.cn/register,2017-08-06 15:35:20
192.168.34.44,http://www.edu360.cn/pay,2017-08-06 15:30:20
192.168.33.46,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-06 15:40:20
192.168.33.46,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.25,http://www.edu360.cn/job,2017-08-06 15:40:20
192.168.33.36,http://www.edu360.cn/excersize,2017-08-06 16:30:20
192.168.33.55,http://www.edu360.cn/job,2017-08-06 15:40:20

3、导入数据

代码语言：javascript复制

-- 导入数据
load data local inpath '/root/hivetest/access.log.0804' into table t_access partition(day = '2017-08-04');
load data local inpath '/root/hivetest/access.log.0805' into table t_access partition(day = '2017-08-05');
load data local inpath '/root/hivetest/access.log.0806' into table t_access partition(day = '2017-08-06');

4、查看表的分区信息

代码语言：javascript复制

show partitions t_access;

5、执行SQL

代码语言：javascript复制

-- 题1:求8月4号以后,每天http://www.edu360.cn/job的总访问次数,以及访问者中ip地址最大的。(3种写法一样)
select day,max(url),count(1),max(ip)
from t_access
where url = 'http://www.edu360.cn/job'
group by day
having day > '2017-08-04';

select day,'http://www.edu360.cn/job',count(1),max(ip)
from t_access
where url = 'http://www.edu360.cn/job'
group by day
having day > '2017-08-04';

select day,url,count(1),max(ip)
from t_access
where url = 'http://www.edu360.cn/job'
group by day,url
having day > '2017-08-04';


-- 题2:求8月4号以后,每天每个页面的总访问次数,以及访问者中ip地址最大的
select day,url,count(1),max(ip)
from t_access where day > '2017-08-04'
group by day,url;

题1：

题2：

注意：一旦有group by子句，那么，在select子句中就不能有（分组字段，聚合函数）以外的字段

## 为什么where必须写在group by的前面，为什么group by后面的条件只能用having？

因为，where是用于在真正执行查询逻辑之前过滤数据用的

having是对group by分组聚合之后的结果进行再过滤；

语句的执行逻辑：

1、where过滤不满足条件的数据 2、用聚合函数和group by进行数据运算聚合，得到聚合结果 3、用having条件过滤掉聚合结果中不满足条件的数据

9.5 子查询

代码语言：javascript复制

-- 题3:求8月4号以后,每天每个页面的总访问次数,以及访问者中ip地址最大的,且总访问次数大于2的
select day,url,count(1) as count,max(ip)
from t_access where day > '2017-08-04'
group by day,url 
having count > 2;

-- 方式2 : 使用子查询(子查询出来的结果实际上就是一张"中间表",再通过条件过滤中间表的数据)
select day,url,count,max_ip
from
(select day,url,count(1) as count,max(ip) as max_ip
from t_access where day > '2017-08-04'
group by day,url) temp
where temp.count > 2;

十、hive函数使用

《hive函数》

10.1 类型转换函数

代码语言：javascript复制

-- 将字符串转int
select cast("5" as int); 
--将字符串转date  
select cast("2017-08-03" as date) ;
--将timestamp转date
select cast(current_timestamp as date);

10.2 数学运算函数

代码语言：javascript复制

select round(5.4);   -- 5.0
select round(5.1345,3);  -- 5.135
select ceil(5.4);  -- select ceiling(5.4);   ## 6
select floor(5.4)  -- 5
select abs(-5.4)  -- 5.4
select greatest(3,5,6)  -- 6 
select least(3,5,6) --3


select max(age) from t_person;   -- 聚合函数
select min(age) from t_person;   -- 聚合函数

10.3 字符串函数

substr(string, int start) ## 截取子串

substring(string, int start)

示例：select substr("abcdefg",2); //输出 bcdefg

substr(string, int start, int len)

substring(string, int start, int len)

示例：select substr("abcdefg",2,3); //输出bcd

concat(string A, string B...) ## 拼接字符串

concat_ws(string SEP, string A, string B...) ##SEP分隔符

示例：select concat("ab","xy");

select concat_ws(".","192","168","33","44"); //输出192.168.33.44

length(string A) #字符串长度

示例：select length("192.168.33.44");

split(string str, string pat) #字符串分割

示例：~~select split("192.168.33.44" , "." );~~ //错误，因为.号是正则语法中的特定字符，不报错，但无法出正确结果

select split("192.168.33.44","\."); //正确输出， ["192","168","2","1"]

select split("192.168.33.44","\.")[1]; //正确输出， 168

upper(string str) ##转大写

lower(string str) ##转小写

10.4 时间函数

select current_timestamp; ##查看当前时间戳

select current_date; ##查看当前日期

## 取当前时间的毫秒数时间戳

select unix_timestamp();

from_unixtime (bigint unixtime [, string format] )

示例：select from_unixtime(unix_timestamp());

select from_unixtime(unix_timestamp(),"yyyy/MM/dd HH:mm:ss");

## 字符串转unix 时间戳

unix_timestamp(string date, string pattern)

示例：select unix_timestamp("2019-04-10 02:50:30");

select unix_timestamp("2019/04/10 02:50:30","yyyy/MM/dd HH:mm:ss");

## 将字符串转成日期date

select to_date("2017-09-17 16:58:32");

10.5 表生成函数

10.5.1 行转列函数：explode(field)

假如有以下数据：

代码语言：javascript复制

1,zhangsan,化学:物理:数学:语文
2,lisi,化学:数学:生物:生理:卫生
3,wangwu,化学:语文:英语:体育:生物

映射成一张表：

代码语言：javascript复制

create table t_stu_subject(id int,name string,subjects array<string>)

row format delimited fields terminated by ','

collection items terminated by ':';

使用explode()对数组字段转为列

然后，我们利用这个explode的结果，来求去重的课程：

代码语言：javascript复制

select distinct tmp.sub
from 
(select explode(subjects) as sub from t_stu_subject) tmp;

10.5.2 表生成函数lateral view

代码语言：javascript复制

select id,name,tmp.sub 
from t_stu_subject lateral view explode(subjects) tmp as sub;

理解： lateral view 相当于两个表在join 左表：是原表 右表：是explode(某个集合字段)之后产生的表 而且：这个join只在同一行的数据间进行

那样，可以方便做更多的查询：比如，查询选修了生物课的同学

代码语言：javascript复制

select a.id,a.name,a.sub from 
(select id,name,tmp.sub as sub from t_stu_subject lateral view explode(subjects) tmp as sub) a
where sub='生物';

10.6 条件控制函数

10.6.1 case when

示例：

代码语言：javascript复制

select id,name,
case
when age<28 then 'youngth'
when age>27 and age<40 then 'zhongnian'
else 'old'
end
from t_user;

10.6.2 IF

select id,if(age>25,'working','worked') from t_user;

select moive_name,if(array_contains(actors,'吴刚'),'好电影','rom t_movie;

10.7 json解析函数：表生成函数

json_tuple函数

示例：

代码语言：javascript复制

-- movie、rate、timeStamp,uid都为json的key值,as表示生成的列命名
select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_rating_json;

产生结果：

代码语言：javascript复制

select * 
from t_rating_json
limit 10;

10.8 分析函数：row_number() over()——分组TOPN

有如下数据：

代码语言：javascript复制

1,18,a,male
2,19,b,male
3,22,c,female
4,16,d,female
5,30,e,male
6,26,f,female

代码语言：javascript复制

-- 创建表
create table t_rn(id int , age int , name string ,sec string)
row format delimited fields terminated by ',';

--导入数据
load data local inpath '/root/hivetest/t_rn.data' into table t_rn;

需要查询出每种性别中年龄最大的2条数据

思考：使用分组聚合函数group by只能产生一个结果，如最大、最小等

那么实现思路可以为：先分组 --> 排序 --> 标记序号 --> 书写过滤条件 where rn < 3 (组内序号小于3的，就是前两条)。

代码语言：javascript复制

-- 根据sex进行分组 order by age,根据年龄进行降序操作
select * from 
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rn
from t_rn) tmp
where rn < 3;

10.9 级联报表查询(窗口分析函数累计报表——sum-over() )

窗口分析函数 sum() over() ：可以实现在窗口内逐行累加报表

有如下数据：

代码语言：javascript复制

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
C,2015-01,10
C,2015-01,20
A,2015-02,4
A,2015-02,6
C,2015-02,30
C,2015-02,10
B,2015-02,10
B,2015-02,5
A,2015-03,14
A,2015-03,6
B,2015-03,20
B,2015-03,25
C,2015-03,10
C,2015-03,20

建表映射：

代码语言：javascript复制

create table t_access_times(username string,month string,counts int)
row format delimited fields terminated by ',';

需要要开发hql脚本，来统计出如下累计报表：

用户	月份	月总额	累计到当月的总额
A	2015-01	33	33
A	2015-02	10	43
A	2015-03	30	73
B	2015-01	30	30
B	2015-02	15	45

假设已存在前3列数据(用户uuid、月份month、月总额amount)在t_access_amount表中，那么累计到当月总额accumulate就为分组(A)，排序后(按月份升序排列)的前月累加总和，如1月总和 = 1月、2月总和 = 1月 2月、 3月综合 = 1月 2月 3月......

求每个人累计到当月的总额

书写sql语句：

代码语言：javascript复制

select * 
from 
(select id,age,name,sex,
row_number() over(partition by sex order by age desc) as rn
from t_rn) tmp
where rn < 3;

10.10 hive自定义函数

有如下json数据：rating.json

代码语言：javascript复制

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}

建表映射上述数据

代码语言：javascript复制

create table t_ratingjson(json string);

load data local inpath '/root/hivetest/rating.json' into table t_ratingjson;

想把上面的原始数据变成如下形式：

代码语言：javascript复制

1193,5,978300760,1
661,3,978302109,1
914,3,978301968,1
3408,4,978300275,1

思路：如果能够定义一个json解析函数，则很方便了

代码语言：javascript复制

create table t_rate
as
select myjson(json,1) as movie,cast(myjson(json,2) as int) as rate,myjson(json,3) as ts,myjson(json,4) as uid from t_ratingjson;

解决： hive中如何定义自己的函数： 1、先写一个java类（extends UDF,重载方法public C evaluate(A a,B b)），实现你所想要的函数的功能（传入一个json字符串和一个脚标，返回一个值） C为hive中的返回值 , A、B为hive中的两个入参

代码语言：javascript复制

public class ParseJson extends UDF{
	
	// 重载 ：返回值类型 和参数类型及个数，完全由用户自己决定
	// 本处需求是：给一个字符串，返回一个数组
	public String[] evaluate(String json) {
		
		String[] split = json.split(""");
		String[] res = new String[]{split[3],split[7],split[11],split[15]};
		return res;
	}
}

2、将java程序打成jar包，上传到hive所在的机器 3、在hive命令行中将jar包添加到classpath ： hive> add jar /root/hivetest/myjson.jar; 4、在hive命令中用命令创建一个函数叫做myjson，关联你所写的这个java类 hive> create temporary function myjson as 'cn.itcats.hive.udf.MyJsonParser';

参考官方文档UDF: https://cwiki.apache.org/confluence/display/Hive/HivePlugins

十一、练习

11.1 用hql来做wordcount

有以下文本文件：

代码语言：javascript复制

hello tom hello jim
hello rose hello tom
tom love rose rose love jim
jim love tom love is what
what is love

需要用hive做wordcount

代码语言：javascript复制

-- 建表映射
create table t_wc(sentence string);

代码语言：javascript复制

-- 导入数据
load data local inpath '/root/hivetest/xx.txt' into table t_wc;

hql答案：

先使用split(sentence,' ') 按空格分割，返回数组

将数组expolode(array) 后将返回结果当做临时表，使用聚合分组得到结果

代码语言：javascript复制

SELECT word,count(1) as cnts
FROM (
    SELECT explode(split(sentence, ' ')) AS word
    FROM t_wc
    ) tmp
GROUP BY word
order by cnts desc;

javascript hive 大数据 node.js 数据库

0 人点赞