欢迎访问我的GitHub
这里分类和汇总了欣宸的全部原创(含配套源码):https://github.com/zq2599/blog_demos
关于hive
Hive是种基于Hadoop的数据仓库工具,将结构化的数据文件映射为一张数据库表,并提供类SQL查询功能。
环境信息
本文对应的环境信息如下:
- CentOS Linux release 7.5.1804
- JDK:1.8.0_191
- hadoop:2.7.7
- hive:1.2.2
hadoop的部署和启动
- hadoop环境的部署和启动请参考《Linux部署hadoop2.7.7集群》;
- 注意:确保环境变量中有HADOOP_HOME的配置;
安装和配置MySQL(5.7.27版本)
- MySQL用来存储元数据,我这里为了简化操作是在docker环境下部署的,一行命令即可:
docker run --name mysql -p 3306:3306 -e MYSQL_ROOT_PASSWORD=888888 -idt mysql:5.7.27
- 进入mysql容器:
docker exec -it mysql /bin/bash
- 进入容器后连接mysql,密码是888888:
mysql -h127.0.0.1 -uroot -p
- 新建名为hive的mysql账号:
CREATE USER 'hive' IDENTIFIED BY '888888';
- 给hive账号授权访问(并且hvie账号还有权给其他账号授权):
GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%' WITH GRANT OPTION;
- 刷新权限数据:
flush privileges;
- 在宿主机的终端执行以下命令重启mysql服务:
docker exec mysql service mysql restart
- 再次进入mysql容器,以hive账号的身份登录mysql:
mysql -uhive -p
- 创建名为hive的数据库:
CREATE DATABASE hive;
安装hive
- 去hive官网下载,地址是:http://mirror.bit.edu.cn/apache/hive/ ,选择合适的版本,如下图:
注意:接下来的操作用的账号都不是root,而是hadoop
- 在hadoop账号的家目录下解压刚刚下载的apache-hive-1.2.2-bin.tar.gz文件,是个名为apache-hive-1.2.2-bin的目录;
- 编辑hadoop账号的.bash_profile文件,增加一个环境变量,就是将刚刚解压出来的apache-hive-1.2.2-bin文件夹的完整路径:
export HIVE_HOME=/home/hadoop/apache-hive-1.2.2-bin
- 修改完毕后,重新打开一个ssh连接,或者执行source ~/.bash_profile让环境变量立即生效;
- 进入目录apache-hive-1.2.2-bin/conf/,用模板文件复制一份配置文件:
cp hive-default.xml.template hive-default.xml
- 在此目录创建名为hive-site.xml的文件,内容如下:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>888888</value>
</property>
</configuration>
- 将mysql的JDBC包放在此目录:/home/hadoop/apache-hive-1.2.2-bin/lib/,我这里用的是mysql-connector-java-5.1.47.jar,您可以在此下载:https://download.csdn.net/download/boling_cavalry/11834367
设置工作已经完成了,接下来是启动和初始化;
初始化和启动hive
- 进入目录apache-hive-1.2.2-bin/bin,执行以下命令初始化:
./schematool -initSchema -dbType mysql
操作成功后,控制台提示:
代码语言:txt复制[hadoop@node0 bin]$ ./schematool -initSchema -dbType mysql
Metastore connection URL: jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: hive
Starting metastore schema initialization to 1.2.0
Initialization script hive-schema-1.2.0.mysql.sql
Initialization script completed
schemaTool completed
- 在mysql上看一下,数据库hvie下建了多个表:
mysql> show tables;
---------------------------
| Tables_in_hive |
---------------------------
| BUCKETING_COLS |
| CDS |
| COLUMNS_V2 |
| COMPACTION_QUEUE |
| COMPLETED_TXN_COMPONENTS |
| DATABASE_PARAMS |
| DBS |
| DB_PRIVS |
| DELEGATION_TOKENS |
| FUNCS |
| FUNC_RU |
| GLOBAL_PRIVS |
| HIVE_LOCKS |
| IDXS |
| INDEX_PARAMS |
| MASTER_KEYS |
| NEXT_COMPACTION_QUEUE_ID |
| NEXT_LOCK_ID |
| NEXT_TXN_ID |
| NOTIFICATION_LOG |
| NOTIFICATION_SEQUENCE |
| NUCLEUS_TABLES |
| PARTITIONS |
| PARTITION_EVENTS |
| PARTITION_KEYS |
| PARTITION_KEY_VALS |
| PARTITION_PARAMS |
| PART_COL_PRIVS |
| PART_COL_STATS |
| PART_PRIVS |
| ROLES |
| ROLE_MAP |
| SDS |
| SD_PARAMS |
| SEQUENCE_TABLE |
| SERDES |
| SERDE_PARAMS |
| SKEWED_COL_NAMES |
| SKEWED_COL_VALUE_LOC_MAP |
| SKEWED_STRING_LIST |
| SKEWED_STRING_LIST_VALUES |
| SKEWED_VALUES |
| SORT_COLS |
| TABLE_PARAMS |
| TAB_COL_STATS |
| TBLS |
| TBL_COL_PRIVS |
| TBL_PRIVS |
| TXNS |
| TXN_COMPONENTS |
| TYPES |
| TYPE_FIELDS |
| VERSION |
---------------------------
53 rows in set (0.00 sec)
- 在目录/home/hadoop/apache-hive-1.2.2-bin/bin执行命令./hive即可启动;
初始化和启动已经完成,接下来验证hive;
验证
- 前面执行./hive之后,已进入了对话模式,输入以下命令创建名为test001的数据库:
CREATE database test001;
- 选择该数据库:
use test001;
- 创建一个名为test_table的表:
create table test_table(
id INT,
word STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't'
STORED AS TEXTFILE;
- 新建一个ssh连接,创建名为hive_test.txt的文本文件,内容如下:
1 aaa
2 bbb
3 ccc
4 ddd
5 eee
6 fff
- 回到和hive对话模式的控制台,输入以下命令,将上述文本文件的内容导入到test001.test_table表中:
LOAD DATA LOCAL INPATH '/home/hadoop/hive_test.txt' INTO TABLE test001.test_table;
控制台提示如下:
代码语言:txt复制hive> LOAD DATA LOCAL INPATH '/home/hadoop/hive_test.txt' INTO TABLE test001.test_table;
Loading data to table test001.test_table
Table test001.test_table stats: [numFiles=1, totalSize=36]
OK
Time taken: 0.264 seconds
- 执行select操作,可以看到数据已经全部入库:
hive> select * from test_table;
OK
1 aaa
2 bbb
3 ccc
4 ddd
5 eee
6 fff
Time taken: 0.453 seconds, Fetched: 6 row(s)
- 执行group by查询:
select word,count(word) from test_table GROUP BY word;
此时会启动一个job来完成上述查询,控制台输出如下:
代码语言:txt复制hive> select word,count(word) from test_table GROUP BY word;
Query ID = hadoop_20191007190528_3bd50401-267b-4d75-8b08-17ead5f0d790
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1570427946161_0002, Tracking URL = http://node0:8088/proxy/application_1570427946161_0002/
Kill Command = /home/hadoop/hadoop-2.7.7/bin/hadoop job -kill job_1570427946161_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-10-07 19:05:34,812 Stage-1 map = 0%, reduce = 0%
2019-10-07 19:05:39,991 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.29 sec
2019-10-07 19:05:46,201 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.23 sec
MapReduce Total cumulative CPU time: 3 seconds 230 msec
Ended Job = job_1570427946161_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.23 sec HDFS Read: 7000 HDFS Write: 36 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 230 msec
OK
aaa 1
bbb 1
ccc 1
ddd 1
eee 1
fff 1
Time taken: 18.614 seconds, Fetched: 6 row(s)
至此,hive的安装和体验实战就完成了,希望本文能给一起学习hive的读者们一些参考。
关于容器和镜像的环境
如果您不想自己搭建kubernetes环境,推荐使用腾讯云容器服务TKE:无需自建,即可在腾讯云上使用稳定, 安全,高效,灵活扩展的 Kubernetes 容器平台;
如果您希望自己的镜像可以通过外网上传和下载,推荐腾讯云容器镜像服务TCR:像数据加密存储,大镜像多节点快速分发,跨地域镜像同步
你不孤单,欣宸原创一路相伴
- Java系列
- Spring系列
- Docker系列
- kubernetes系列
- 数据库 中间件系列
- DevOps系列
欢迎关注公众号:程序员欣宸
微信搜索「程序员欣宸」,我是欣宸,期待与您一同畅游Java世界...