Hive Table Sampling – Concept and Example

2021-12-16 11:01:12 浏览数 (1)

Hive Table Sampling - Concept, Methods and Example, Hive Table Sampling Concept

The Relational databases like SQL server supports writing queries on a relatively small number of rows from the very large table. In this article, we will check Hive table sampling concept, methods and some examples.

The Hive TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table. The sampling comes handy when you are working on the large tables and it takes time to return results. The TABLESAMPLE clause can be added to any table in the FROM clause.

Type of Hive Sampling

There are two type of Hive tables sampling

  • Sampling Bucketized Table
  • Hive Block Sampling

Hive Table Sampling Syntax

Bucketized Sampling

Following is the syntax of the Bucketized Sampling

代码语言:javascript复制
SELECT *
FROM source TABLESAMPLE (BUCKET x OUT OF y [ON colname]) s;

Block Sampling

代码语言:javascript复制
SELECT *
FROM source TABLESAMPLE (n PERCENT) s;

Where, the BUCKET is numbered starting from 1. colname indicates the column on which to sample each row in the table. Instead of colname, use rand() indicating sampling on the entire row instead of an individual column.

And n is the percent of data size in case of block sampling.

Hive Sampling Bucketized Table

The sampling Bucketized table allows you to get sample records using the number of buckets. The Bucketized sampling method can be used when your tables are bucketed.

You can provide the bucket number starting from 1 along with colname on which to sample each row in the Hive table. You can also use rand() indicating sampling on the entire row instead of an individual column.

For example, following example provides random sample rows from the bucket 1.

代码语言:javascript复制
SELECT *
FROM SAMPLE_DEMO TABLESAMPLE(BUCKET 1 OUT OF 3 ON rand()) s;
代码语言:javascript复制
 ------- --------- -------- -- 
| s.id  | s.name  | s.mnt  |
 ------- --------- -------- -- 
| 3     | ccc     | 1      |
| 11    | kkk     | 4      |
 ------- --------- -------- -- 

Hive Block Sampling

This sampling method will allow Hive to pick up at least n% data size. Note that, PERCENT doesn’t necessarily mean the number of rows, it is the percentage of table size. If your table is small then it may return all rows.

For example, in the following example the input size 0.1% or more will be used for the query.

代码语言:javascript复制
SELECT *
FROM SAMPLE_DEMO TABLESAMPLE(0.01 PERCENT) s;
代码语言:javascript复制
 ------- --------- -------- -- 
| s.id  | s.name  | s.mnt  |
 ------- --------- -------- -- 
| 1     | aaa     | 1      |
| 2     | bbb     | 1      |
| 3     | ccc     | 1      |
 ------- --------- -------- -- 

Hope this helps

0 人点赞