大数据环境搭建-Hive和Mysql

2022-04-19 08:56:11 浏览数 (1)

前言

Hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。

其优点是学习成本低,可以通过类SQL语句快速实现简单的MapReduce统计。

安装Mysql

https://www.psvmc.cn/article/2018-06-25-mysql-tips.html

安装

代码语言:javascript复制
curl -o /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
yum makecache
yum install wget

wget -i https://dev.mysql.com/get/mysql57-community-release-el7-10.noarch.rpm
yum localinstall mysql57-community-release-el7-10.noarch.rpm
yum repolist enabled | grep mysql

安装mysql 服务器 命令:

代码语言:javascript复制
yum install -y mysql-community-server --nogpgcheck

启动mysql 命令:

代码语言:javascript复制
service mysqld start

查看mysql是否自启动,并且设置开启自启动 命令:

代码语言:javascript复制
chkconfig --list | grep mysqld
chkconfig mysqld on

配置修改

表名/编码/连接数/数据包大小

设置表名不区分大小写/字符编码/连接数

修改 /etc/my.cnf

代码语言:javascript复制
vi /etc/my.cnf

添加以下的5行

代码语言:javascript复制
[mysqld]
lower_case_table_names=1
character_set_server = utf8
max_connections = 1000
max_allowed_packet = 100M
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

重启

代码语言:javascript复制
service mysqld restart

修改密码

默认密码

打开日志文件

代码语言:javascript复制
vi /var/log/mysqld.log

找到密码

代码语言:javascript复制
[Note] A temporary password is generated for root@localhost: tr&:DeYtt8 k

修改密码:

MySQL 5.7推荐使用ALTER USER修改密码 MySQL修改用户的密码主要有:ALTER USER 和UPDATE用户表

ALTER USER

推荐用此方式

代码语言:javascript复制
mysql -uroot -p

连接上以后

代码语言:javascript复制
set global validate_password_policy=0;
set global validate_password_length=1;
alter user user() identified by "psvmc123";
flush privileges;
quit

允许远程登录

代码语言:javascript复制
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'psvmc123' WITH GRANT OPTION;
FLUSH PRIVILEGES; 
quit

设置密码永不过期

代码语言:javascript复制
ALTER USER 'root'@'%' PASSWORD EXPIRE NEVER;
flush privileges;

Hive

下载与配置

https://dlcdn.apache.org/hive/hive-2.3.9/

解压

代码语言:javascript复制
tar -zxvf apache-hive-2.3.9-bin.tar.gz -C /data/tools/bigdata/

复制Mysql连接的JAR

代码语言:javascript复制
cp mysql-connector-java-8.0.28.jar /data/tools/bigdata/apache-hive-2.3.9-bin/lib/

进入目录

代码语言:javascript复制
cd /data/tools/bigdata/apache-hive-2.3.9-bin

配置重命名

代码语言:javascript复制
cd conf
cp hive-default.xml.template hive-site.xml
cp hive-env.sh.template hive-env.sh
cp hive-log4j2.properties.template hive-log4j.properties

hive-site.xml中修改下面的配置

代码语言:javascript复制
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://127.0.0.1:3306/metastore?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.cj.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>psvmc123</value>
  <description>password to use against metastore database</description>
</property>

<property>
  <name>hive.cli.print.current.db</name>
  <value>true</value>
  <description>Whether to include the current database in the Hive prompt.</description>
</property>

<property>
  <name>hive.cli.print.header</name>
  <value>true</value>
  <description>Whether to print the names of the columns in query output.</description>
</property>

<property>
  <name>hive.fetch.task.conversion</name>
  <value>more</value>
  <description>
    Some select queries can be converted to single FETCH task minimizing latency.
    Currently the query should be single sourced not having any subquery and should not have
    any aggregations or distincts (which incurs RS), lateral views and joins.
    1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
    2. more    : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)
  </description>
</property>

HiveServer2

代码语言:javascript复制
<property>
  <name>hive.server2.long.polling.timeout</name>
  <value>5000ms</value>
  <description>Time in milliseconds that HiveServer2 will wait, before responding to asynchronous calls that use long polling</description>
</property>
<property>
  <name>hive.server2.thrift.bind.host</name>
  <value>localhost</value>
  <description>Bind host on which to run the HiveServer2 Thrift interface.
  Can be overridden by setting $HIVE_SERVER2_THRIFT_BIND_HOST</description>
</property>
<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value>
  <description>
   Setting this property to true will have HiveServer2 execute
    Hive operations as the user making the calls to it.
  </description>
</property>

搜索所有的

代码语言:javascript复制
${system:java.io.tmpdir}

替换为

代码语言:javascript复制
/data/tools/bigdata/apache-hive-2.3.9-bin/tmp

修改hive-log4j.properties配置

代码语言:javascript复制
property.hive.log.dir = /data/tools/bigdata/apache-hive-2.3.9-bin/logs/

初始化库

Hive的数据库MySQL在安装的时候没有初始化

在MySQL中

代码语言:javascript复制
# 删除mysql中的元数据库
drop database metastore;
# 新建一个元数据库
create database metastore;

命令行中执行

代码语言:javascript复制
# 在命令行中,重新初始化
cd /data/tools/bigdata/apache-hive-2.3.9-bin
./bin/schematool -dbType mysql -initSchema

否则会报错

java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.Session

在Hive中无法执行任何命令

启动Hive

代码语言:javascript复制
cd /data/tools/bigdata/apache-hive-2.3.9-bin
./bin/hive

查询表

代码语言:javascript复制
show tables;

目录权限

修改HDFS系统中关于Hive的一些目录权限

代码语言:javascript复制
/data/tools/bigdata/hadoop-2.7.7/bin/hadoop fs -chmod 777 /tmp/
/data/tools/bigdata/hadoop-2.7.7/bin/hadoop fs -chmod 777 /user/hive/warehouse

Hive连接工具DBever

https://dbeaver.io/

链接:https://pan.baidu.com/s/1XjhohO-JV7_PTPaD85sEtg 提取码:psvm

启动Hive服务

代码语言:javascript复制
cd /data/tools/bigdata/apache-hive-2.3.9-bin
./bin/hiveserver2

连接Hive

设置地址

默认驱动会从Github上下载,但是下载超时

这里建议添加本地驱动

下载地址:

链接:https://pan.baidu.com/s/1XjhohO-JV7_PTPaD85sEtg 提取码:psvm

下载其中的两个Jar

添加本地驱动

这样我们就可以连接了。

HSQL

以下介绍常用的Hive的类SQL语句。

创建表:

代码语言:javascript复制
create table t_user(id int,name string,password string);

创建表时指定分隔符

代码语言:javascript复制
create table tablename(name string,password string)row format delimited fields terminated by ',';

加载表

代码语言:javascript复制
load data inpath '/user/hadoop/output7/part-r-00000' into table t_table01;

创建一个新表,结构与某表一样

代码语言:javascript复制
create table table02 like table01;

创建分区表

代码语言:javascript复制
create table tablename(id int,line string) partitioned by (dt string,country string);

显示表里有多少条记录(count 数大于50的有多少条记录)

代码语言:javascript复制
select count(*) from tablename where count>50;

排序用法order by (查询count 数大于50并排序)

代码语言:javascript复制
select * from tablename where count > 50 order by count;

显示表中有多少分区

代码语言:javascript复制
show partitions tablename;

显示所有表

代码语言:javascript复制
show tables;

显示所有与t开头的表

代码语言:javascript复制
show tables 't*';

显示表的结构信息

代码语言:javascript复制
describe tablename;

修改表名字

代码语言:javascript复制
alter table table01 rename to table02;

在原表上新添加一列

代码语言:javascript复制
alter table tablename add columns(new_col2 int comment 'a commment');

alter table tablename add columns(new_col3 int);

删除表

代码语言:javascript复制
drop table tablename;

从本地文件加载数据:

代码语言:javascript复制
LOAD DATA LOCAL INPATH '/home/hadoop/input/sample.txt' OVERWRITE INTO TABLE records;

加载分区表

代码语言:javascript复制
load data inpath '/user/hive/warehouse/part-r-00000' overwrite into table clickstream_log PARTITION(dt = '2018-11-30');

显示所有函数

代码语言:javascript复制
show functions;

查看函数的用法

代码语言:javascript复制
describe function substr;

查看数组、map、结构

代码语言:javascript复制
select col1[0],col2['b'],col3.c from complex;

查看数组、map、结构

代码语言:javascript复制
select col1[0],col2['b'],col3.c from complex;

内连接:

代码语言:javascript复制
SELECT sales., things. FROM sales JOIN things ON (sales.id = things.id);

查看hive为某个查询使用多少个MapReduce作业

代码语言:javascript复制
Explain SELECT sales., things. FROM sales JOIN things ON (sales.id = things.id);

外连接:

代码语言:javascript复制
SELECT sales., things. FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
SELECT sales., things. FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
SELECT sales., things. FROM sales FULL OUTER JOIN things ON (sales.id = things.id);

in查询:Hive不支持,但可以使用LEFT SEMI JOIN

代码语言:javascript复制
SELECT * FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);

Map连接:

Hive可以把较小的表放入每个Mapper的内存来执行连接操作

代码语言:javascript复制
SELECT /  MAPJOIN(things) / sales., things. FROM sales JOIN things ON (sales.id = things.id);

INSERT OVERWRITE TABLE …SELECT:

新表预先存在

代码语言:javascript复制
FROM records2
INSERT OVERWRITE TABLE stations_by_year SELECT year, COUNT(DISTINCT station) GROUP BY year
INSERT OVERWRITE TABLE records_by_year SELECT year, COUNT(1) GROUP BY year
INSERT OVERWRITE TABLE good_records_by_year SELECT year, COUNT(1) WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year;

CREATE TABLE … AS SELECT:

新表表预先不存在

代码语言:javascript复制
CREATE TABLE target AS SELECT col1,col2 FROM source;

创建视图:

代码语言:javascript复制
CREATE VIEW valid_records AS SELECT * FROM records2 WHERE temperature !=9999;

查看视图详细信息:

代码语言:javascript复制
DESCRIBE EXTENDED valid_records;

0 人点赞