数据分析工具篇——HQL中DDL操作

建表语句

1）表的创建方法：

建表：

代码语言：javascript复制

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
  [(col_name data_type [COMMENT col_comment], ...)] 
  [COMMENT table_comment] 
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
  [ROW FORMAT row_format] 
  [STORED AS file_format] 
  [LOCATION hdfs_path]

说明：

1.1） CREATE TABLE 创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 IF NOT EXISTS 选项来忽略这个异常。

1.2） EXTERNAL关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（LOCATION），Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

LOCATOION：在建外部表时指定的数据存储目录。

代码语言：javascript复制

create external table fz_external_table(
id int,
name string
age int,
tel string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hive/external/fz_external_table';

1.3）LIKE 允许用户复制现有的表结构，但是不复制数据：

代码语言：javascript复制

CREATE TABLE t4 like t2;

1.4）ROW FORMAT：创建表时指定的数据切分格式

代码语言：javascript复制

DELIMITED [FIELDS TERMINATED BY char]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
|SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value,property_name=property_value, ...)]

用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 ROW FORMAT 或者 ROW FORMAT DELIMITED，将会使用自带的 SerDe。在建表的时候，用户还需要为表指定列，用户在指定表的列的同时也会指定自定义的SerDe，Hive通过 SerDe 确定表的具体的列的数据。

SerDe是Serialize/Deserilize的简称，目的是用于序列化和反序列化

例如：

代码语言：javascript复制

CREATE TABLE psn (
    id int,
    name string,
    hobbies ARRAY <string>,
    address MAP <string, string>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '-'
MAP KEYS TERMINATED BY ':';

1.5）STORED AS:

代码语言：javascript复制

SEQUENCEFILE|TEXTFILE|RCFILE

如果文件数据是纯文本，可以使用 STORED AS TEXTFILE：默认格式，数据不做压缩，磁盘开销大，数据解析开销大。

如果数据需要压缩，使用 STORED AS SEQUENCEFILE：支持三种压缩选择：NONE，RECORD，BLOCK。Record压缩率低，一般建议使用BLOCK压缩。

RCFILE是一种行列存储相结合的存储方式。首先，其将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。其次，块数据列式存储，有利于数据压缩和快速的列存取。

代码语言：javascript复制

CREATE TABLE if not exists testfile_table(
    site string,
    url  string,
    pv   bigint,
    label string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS textfile;
LOAD DATA LOCAL INPATH '/app/weibo.txt'
OVERWRITE INTO TABLE textfile_table;

1.6）CLUSTERED BY:

对于每一个表（table）或者分区， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

SORTED BY：对桶中的一个或多个列另外排序。

把表（或者分区）组织成桶（Bucket）有两个理由：

1.6.1）获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大较少JOIN的数据量。

1.6.2）使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便。

代码语言：javascript复制

create table stu_buck(
sno int,
sname string,
sex string,
sage int,
sdept string)
clustered by(sno) sorted by(sno DESC) into 4 buckets
row format delimited fields terminated by ',';

1.7）COMMENT：可以为表与字段增加描述

1.8）PARTITIONED BY ：根据分区来查看表中的内容，每个分区以文件夹的形式单独存在表文件夹的目录下。

2）具体实例：

2.1）创建内部表的三种方式：

2.1.1）第一种方式：

代码语言：javascript复制

Create table if not exists default.weblog(
Ip string,
Time string,
Req_url string,
Status string,
Size string)
Row format delimited fields terminated by ‘t’;

数据存储：

代码语言：javascript复制

Load data local inpath’/opt/hive/weblog.txt’ into table default.weblog;

查看数据：

代码语言：javascript复制

Select * from default.weblog;

2.1.2）第二种方式：

代码语言：javascript复制

Create table default.weblog
As select ip, time, req_url, from default.weblog;

查看表结构：

代码语言：javascript复制

Desc default.weblog;

2.1.3）第三种方式：

代码语言：javascript复制

Create table if not exists default.weblog_s like default.weblog;
Select * from weblog;
Desc weblog;

2.2）创建外部表pageview：

代码语言：javascript复制

Create external table if not exists pageview(
Pageid int,
Page_url string comment ‘the page url’)
Row format delimited fields terminated by ‘,’
Location ‘hdfs://192.168.11.191:9000/user/hive/warehouse/’;

2.3）创建分区表invites：

分区是指将数据放在多个单独的文件夹中，减少运行的时间；

代码语言：javascript复制

create table student_p(Sno int,Sname string,Sex string,Sage int,Sdept string)
partitioned by(part string)
row format delimited fields terminated by ','
stored as textfile;

2.3.1）内部分区表：

代码语言：javascript复制

create table student(id int,namestring) 
partitioned by (cls string) 
row format delimited fields terminated by 't';

2.3.2）分区表装载数据：

代码语言：javascript复制

load data local inpath '/home/airib/work/hadoop/book' into table student partition (cls='class2');

2.3.3）外部分区表：可以单独的往某个分区中导入数据；

代码语言：javascript复制

create external table teacher (id int,name string)
partitioned by (cls string)
row format delimited fields terminated by 't' location '/user/teacher';
location '/user/teacher'

2.4）创建带桶的表：

如何理解分桶：

对于某一个表或者分区，hive可以进一步构建成桶，是更为细粒度的划分；一开始数据都是在一起的，建造表的时候会按照id将表分在四个文件中，分别命名为1,2,3,4；数据会对应的存入到这四个文件中，数据的存取方式为将数据按照id进行hash散列，然后按照hash散列分到四个文件中。

分桶建表的基本语法为：

代码语言：javascript复制

Create table student(
Id int,
Age int,
Name string)
Partitioned by (stat_Date string)
Clustered by (id) sorted by (age) into 2 buckets
Row format delimited fields terminated by ‘,’;

注意：

2.4.1）对于已有的数据，将数据导入到分桶的表中的时候是不主动分桶的，他只是记录了数据是分过桶的，文件没有变，因此在load数据的时候一般先分桶，再导入数据；

2.4.2）分桶处理时一般采用insert语句，分布进行，步骤如下：

首先：将数据load进入到一个普通的表中；

代码语言：javascript复制

Create table t_p(id string, name string)
Row format delimited fields terminated by ‘,’;

其次：有一个分桶的开关需要打开：

代码语言：javascript复制

set hive.enforce.bucketing = true;
set mapreduce.job.reduces = 4;

其三：然后将数据按照分桶原则从表中插入到分桶的表中，经历以下语句，数据就会被完整的分到四个桶中，四个桶分为四个不同的文件：

代码语言：javascript复制

Insert into table t_buck
Select id, name from t_p distribute by (id) sort by (id);

reduce的数量需要与分桶的数量一致。

修改表

1)增加/删除分区

1.1)语法结构：

代码语言：javascript复制

ALTER TABLE table_name
ADD [IF NOT EXISTS]
partition_spec [ LOCATION'location1' ]
partition_spec [ LOCATION 'location2' ] ...

其中partition_spec为：

代码语言：javascript复制

PARTITION (partition_col = partition_col_value, partition_col =partiton_col_value, ...)
ALTER TABLE table_name DROP partition_spec, partition_spec,...

1.1.1）增加分区：

代码语言：javascript复制

alter table student_p add partition(part='a') partition(part='b')；

1.1.2）删除分区：

代码语言：javascript复制

show partitions student;
alter table student_p drop partition(stat_date='20160105');
alter table student_p add partition(stat_date=’20140101’)
location ‘/user/hive/warehouse/student’ partition(stat_Date=’ 20140102’)

1.1.3)重命名表:

语法结构:

代码语言：javascript复制

ALTER TABLE table_name RENAME TO new_table_name

具体实例:

代码语言：javascript复制

alter table students renameto students1;
show tables;

2)增加/更新列:

2.1)语法结构:

代码语言：javascript复制

ALTER TABLE table_name
ADD|REPLACE COLUMNS (
col_name data_type[COMMENT col_comment], ...
)

与：

代码语言：javascript复制

ALTER TABLE table_name
CHANGE [COLUMN] col_old_name col_new_namecolumn_type
[COMMENT col_comment]
[FIRST|AFTER column_name]

注：ADD是代表新增一字段，字段位置在所有列后面(partition列前)，REPLACE则是表示替换表中所有字段，Change是修改列及属性。

2.1.1）增加列：

代码语言：javascript复制

Alter table students add columns(name1 string);

2.1.2）更新列：

代码语言：javascript复制

Alter table students replace columns(id int, age int, name string);

4）修改列：

代码语言：javascript复制

alter table students change columns salary salary double;

5）修改桶：

代码语言：javascript复制

alter table btest3 clustered by(name, age) sorted by(age) into10 buckets;

删除表

1）删除库

代码语言：javascript复制

drop database if exists db_name;

2）强制删除库

代码语言：javascript复制

drop database if exists db_name cascade;

3）删除表

代码语言：javascript复制

drop table if exists employee;

4）清空表

代码语言：javascript复制

truncate table employee;

5）清空表，第二种方式

代码语言：javascript复制

insert overwrite table employee select * from employee where 1=0;

6）删除分区

代码语言：javascript复制

alter table employee_table drop partition (stat_year_month>='2018-01');

7）按条件删除数据

代码语言：javascript复制

insert overwrite table employee_table select * from employee_table where id>'180203a15f';

8）删除列：

代码语言：javascript复制

alter table students drop columns id;

显示命令

常见的显示命令如下：

代码语言：javascript复制

show tables
show databases
show partitions
show functions
desc extended t_name;
desc formatted table_name;

代码语言：javascript复制

hive python 编程算法文件存储

0 人点赞