前言
Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。
其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计。
安装Mysql
https://www.psvmc.cn/article/2018-06-25-mysql-tips.html
安装
代码语言:javascript复制curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum makecache
yum install wget
wget -i https://dev.mysql.com/get/mysql57-community-release-el7-10.noarch.rpm
yum localinstall mysql57-community-release-el7-10.noarch.rpm
yum repolist enabled | grep mysql
安装mysql 服务器 命令:
代码语言:javascript复制yum install -y mysql-community-server --nogpgcheck
启动mysql 命令:
代码语言:javascript复制service mysqld start
查看mysql是否自启动,并且设置开启自启动 命令:
代码语言:javascript复制chkconfig --list | grep mysqld
chkconfig mysqld on
配置修改
表名/编码/连接数/数据包大小
设置表名不区分大小写/字符编码/连接数
修改 /etc/my.cnf
vi /etc/my.cnf
添加以下的5行
代码语言:javascript复制[mysqld]
lower_case_table_names=1
character_set_server = utf8
max_connections = 1000
max_allowed_packet = 100M
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
重启
代码语言:javascript复制service mysqld restart
修改密码
默认密码
打开日志文件
代码语言:javascript复制vi /var/log/mysqld.log
找到密码
代码语言:javascript复制[Note] A temporary password is generated for root@localhost: tr&:DeYtt8 k
修改密码:
MySQL 5.7推荐使用ALTER USER修改密码 MySQL修改用户的密码主要有:ALTER USER 和UPDATE用户表
ALTER USER
推荐用此方式
代码语言:javascript复制mysql -uroot -p
连接上以后
代码语言:javascript复制set global validate_password_policy=0;
set global validate_password_length=1;
alter user user() identified by "psvmc123";
flush privileges;
quit
允许远程登录
代码语言:javascript复制GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'psvmc123' WITH GRANT OPTION;
FLUSH PRIVILEGES;
quit
设置密码永不过期
代码语言:javascript复制ALTER USER 'root'@'%' PASSWORD EXPIRE NEVER;
flush privileges;
Hive
下载与配置
https://dlcdn.apache.org/hive/hive-2.3.9/
解压
代码语言:javascript复制tar -zxvf apache-hive-2.3.9-bin.tar.gz -C /data/tools/bigdata/
复制Mysql连接的JAR
代码语言:javascript复制cp mysql-connector-java-8.0.28.jar /data/tools/bigdata/apache-hive-2.3.9-bin/lib/
进入目录
代码语言:javascript复制cd /data/tools/bigdata/apache-hive-2.3.9-bin
配置重命名
代码语言:javascript复制cd conf
cp hive-default.xml.template hive-site.xml
cp hive-env.sh.template hive-env.sh
cp hive-log4j2.properties.template hive-log4j.properties
hive-site.xml中修改下面的配置
代码语言:javascript复制<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1:3306/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>psvmc123</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
<description>Whether to print the names of the columns in query output.</description>
</property>
<property>
<name>hive.fetch.task.conversion</name>
<value>more</value>
<description>
Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)
</description>
</property>
HiveServer2
代码语言:javascript复制<property>
<name>hive.server2.long.polling.timeout</name>
<value>5000ms</value>
<description>Time in milliseconds that HiveServer2 will wait, before responding to asynchronous calls that use long polling</description>
</property>
<property>
<name>hive.server2.thrift.bind.host</name>
<value>localhost</value>
<description>Bind host on which to run the HiveServer2 Thrift interface.
Can be overridden by setting $HIVE_SERVER2_THRIFT_BIND_HOST</description>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
<description>
Setting this property to true will have HiveServer2 execute
Hive operations as the user making the calls to it.
</description>
</property>
搜索所有的
代码语言:javascript复制${system:java.io.tmpdir}
替换为
代码语言:javascript复制/data/tools/bigdata/apache-hive-2.3.9-bin/tmp
修改hive-log4j.properties配置
代码语言:javascript复制property.hive.log.dir = /data/tools/bigdata/apache-hive-2.3.9-bin/logs/
初始化库
Hive的数据库MySQL在安装的时候没有初始化
在MySQL中
代码语言:javascript复制# 删除mysql中的元数据库
drop database metastore;
# 新建一个元数据库
create database metastore;
命令行中执行
代码语言:javascript复制# 在命令行中,重新初始化
cd /data/tools/bigdata/apache-hive-2.3.9-bin
./bin/schematool -dbType mysql -initSchema
否则会报错
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.Session
在Hive中无法执行任何命令
启动Hive
代码语言:javascript复制cd /data/tools/bigdata/apache-hive-2.3.9-bin
./bin/hive
查询表
代码语言:javascript复制show tables;
目录权限
修改HDFS系统中关于Hive的一些目录权限
代码语言:javascript复制/data/tools/bigdata/hadoop-2.7.7/bin/hadoop fs -chmod 777 /tmp/
/data/tools/bigdata/hadoop-2.7.7/bin/hadoop fs -chmod 777 /user/hive/warehouse
Hive连接工具DBever
https://dbeaver.io/
链接:https://pan.baidu.com/s/1XjhohO-JV7_PTPaD85sEtg 提取码:psvm
启动Hive服务
代码语言:javascript复制cd /data/tools/bigdata/apache-hive-2.3.9-bin
./bin/hiveserver2
连接Hive
设置地址
默认驱动会从Github上下载,但是下载超时
这里建议添加本地驱动
下载地址:
链接:https://pan.baidu.com/s/1XjhohO-JV7_PTPaD85sEtg 提取码:psvm
下载其中的两个Jar
添加本地驱动
这样我们就可以连接了。
HSQL
以下介绍常用的Hive的类SQL语句。
创建表:
代码语言:javascript复制create table t_user(id int,name string,password string);
创建表时指定分隔符
代码语言:javascript复制create table tablename(name string,password string)row format delimited fields terminated by ',';
加载表
代码语言:javascript复制load data inpath '/user/hadoop/output7/part-r-00000' into table t_table01;
创建一个新表,结构与某表一样
代码语言:javascript复制create table table02 like table01;
创建分区表
代码语言:javascript复制create table tablename(id int,line string) partitioned by (dt string,country string);
显示表里有多少条记录(count 数大于50的有多少条记录)
代码语言:javascript复制select count(*) from tablename where count>50;
排序用法order by (查询count 数大于50并排序)
代码语言:javascript复制select * from tablename where count > 50 order by count;
显示表中有多少分区
代码语言:javascript复制show partitions tablename;
显示所有表
代码语言:javascript复制show tables;
显示所有与t开头的表
代码语言:javascript复制show tables 't*';
显示表的结构信息
代码语言:javascript复制describe tablename;
修改表名字
代码语言:javascript复制alter table table01 rename to table02;
在原表上新添加一列
代码语言:javascript复制alter table tablename add columns(new_col2 int comment 'a commment');
alter table tablename add columns(new_col3 int);
删除表
代码语言:javascript复制drop table tablename;
从本地文件加载数据:
代码语言:javascript复制LOAD DATA LOCAL INPATH '/home/hadoop/input/sample.txt' OVERWRITE INTO TABLE records;
加载分区表
代码语言:javascript复制load data inpath '/user/hive/warehouse/part-r-00000' overwrite into table clickstream_log PARTITION(dt = '2018-11-30');
显示所有函数
代码语言:javascript复制show functions;
查看函数的用法
代码语言:javascript复制describe function substr;
查看数组、map、结构
代码语言:javascript复制select col1[0],col2['b'],col3.c from complex;
查看数组、map、结构
代码语言:javascript复制select col1[0],col2['b'],col3.c from complex;
内连接:
代码语言:javascript复制SELECT sales., things. FROM sales JOIN things ON (sales.id = things.id);
查看hive为某个查询使用多少个MapReduce作业
代码语言:javascript复制Explain SELECT sales., things. FROM sales JOIN things ON (sales.id = things.id);
外连接:
代码语言:javascript复制SELECT sales., things. FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
SELECT sales., things. FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
SELECT sales., things. FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
in查询:Hive不支持,但可以使用LEFT SEMI JOIN
代码语言:javascript复制SELECT * FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
Map连接:
Hive可以把较小的表放入每个Mapper的内存来执行连接操作
代码语言:javascript复制SELECT / MAPJOIN(things) / sales., things. FROM sales JOIN things ON (sales.id = things.id);
INSERT OVERWRITE TABLE …SELECT:
新表预先存在
代码语言:javascript复制FROM records2
INSERT OVERWRITE TABLE stations_by_year SELECT year, COUNT(DISTINCT station) GROUP BY year
INSERT OVERWRITE TABLE records_by_year SELECT year, COUNT(1) GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year SELECT year, COUNT(1) WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year;
CREATE TABLE … AS SELECT:
新表表预先不存在
代码语言:javascript复制CREATE TABLE target AS SELECT col1,col2 FROM source;
创建视图:
代码语言:javascript复制CREATE VIEW valid_records AS SELECT * FROM records2 WHERE temperature !=9999;
查看视图详细信息:
代码语言:javascript复制DESCRIBE EXTENDED valid_records;