hbase 过滤数据

2022-08-22 10:42:20 浏览数 (1)

hbase 支持百万列、十亿行,非常适合用来存储海量数据。有时需要从这些海量数据中找出某条数据进行数据验证,这就用到了 hbase 过滤器,本文简单介绍几种常用的过滤方法。

初次登录 hbase 时,包含了默认的命名空间(schema),这里新建一个命名空间 test

代码语言:javascript复制
create_namespace 'test'

查看命名空间

代码语言:javascript复制
list_namespace

新建 student

代码语言:javascript复制
create 'test:student', 'infomation'

查看表

代码语言:javascript复制
list

查看指定命名空间的表

代码语言:javascript复制
list_namespace_tables 'test'

插入数据

代码语言:javascript复制
put 'test:student', '001','infomation:name_','Alex'
put 'test:student', '001','infomation:age__','13'
put 'test:student', '001','infomation:sex__','Male'
put 'test:student', '001','infomation:class','3.1'

put 'test:student', '002','infomation:name_','Bob'
put 'test:student', '002','infomation:age__','13'
put 'test:student', '002','infomation:sex__','Male'
put 'test:student', '002','infomation:class','3.2'

put 'test:student', '003','infomation:name_','Cindy'
put 'test:student', '003','infomation:age__','13'
put 'test:student', '003','infomation:sex__','Female'
put 'test:student', '003','infomation:class','3.3'

put 'test:student', '004','infomation:name_','Dama'
put 'test:student', '004','infomation:age__','13'
put 'test:student', '004','infomation:sex__','Female'
put 'test:student', '004','infomation:class','3.4'

put 'test:student', '005','infomation:name_','Ella'
put 'test:student', '005','infomation:age__','13'
put 'test:student', '005','infomation:sex__','Female'
put 'test:student', '005','infomation:class','3.5'

按照主键过滤(行过滤)

代码语言:javascript复制
hbase:231:0> scan 'test:student', FILTER => "RowFilter(=,'substring:003')"
ROW                                                COLUMN CELL
 003                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.238, value=13
 003                                               column=infomation:class, timestamp=2022-03-13T14:45:00.259, value=3.3
 003                                               column=infomation:name_, timestamp=2022-03-13T14:45:00.227, value=Cindy
 003                                               column=infomation:sex__, timestamp=2022-03-13T14:45:00.249, value=Female
1 row(s)
Took 0.0105 seconds

按照主键前缀过滤

代码语言:javascript复制
hbase:233:0> scan 'test:student',{FILTER=>"PrefixFilter('00')"}
ROW                                                COLUMN CELL
 001                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
 001                                               column=infomation:class, timestamp=2022-03-13T14:45:00.149, value=3.1
 001                                               column=infomation:name_, timestamp=2022-03-13T14:45:00.106, value=Alex
 001                                               column=infomation:sex__, timestamp=2022-03-13T14:45:00.132, value=Male
 002                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
 002                                               column=infomation:class, timestamp=2022-03-13T14:45:00.210, value=3.2
 002                                               column=infomation:name_, timestamp=2022-03-13T14:45:00.171, value=Bob
 002                                               column=infomation:sex__, timestamp=2022-03-13T14:45:00.197, value=Male
 003                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.238, value=13
 003                                               column=infomation:class, timestamp=2022-03-13T14:45:00.259, value=3.3
 003                                               column=infomation:name_, timestamp=2022-03-13T14:45:00.227, value=Cindy
 003                                               column=infomation:sex__, timestamp=2022-03-13T14:45:00.249, value=Female
 004                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.285, value=13
 004                                               column=infomation:class, timestamp=2022-03-13T14:45:00.309, value=3.5
 004                                               column=infomation:name_, timestamp=2022-03-13T14:45:00.275, value=Dama
 004                                               column=infomation:sex__, timestamp=2022-03-13T14:45:00.298, value=Female
 005                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.336, value=13
 005                                               column=infomation:class, timestamp=2022-03-13T14:45:01.882, value=3.5
 005                                               column=infomation:name_, timestamp=2022-03-13T14:45:00.324, value=Ella
 005                                               column=infomation:sex__, timestamp=2022-03-13T14:45:00.349, value=Female
5 row(s)
Took 0.0110 seconds

按照列前缀过滤

代码语言:javascript复制
hbase:235:0> scan 'test:student',{FILTER=>"ColumnPrefixFilter('a')"}
ROW                                                COLUMN CELL
 001                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
 002                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
 003                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.238, value=13
 004                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.285, value=13
 005                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.336, value=13
5 row(s)
Took 0.0100 seconds

按照主键范围过滤

代码语言:javascript复制
hbase:236:0> scan 'test:student',{STARTROW=>'001',STOPROW=>'003'}
ROW                                                COLUMN CELL
 001                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
 001                                               column=infomation:class, timestamp=2022-03-13T14:45:00.149, value=3.1
 001                                               column=infomation:name_, timestamp=2022-03-13T14:45:00.106, value=Alex
 001                                               column=infomation:sex__, timestamp=2022-03-13T14:45:00.132, value=Male
 002                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
 002                                               column=infomation:class, timestamp=2022-03-13T14:45:00.210, value=3.2
 002                                               column=infomation:name_, timestamp=2022-03-13T14:45:00.171, value=Bob
 002                                               column=infomation:sex__, timestamp=2022-03-13T14:45:00.197, value=Male
2 row(s)
Took 0.0082 seconds

按照 主键范围 列前缀 过滤

代码语言:javascript复制
hbase:237:0> scan 'test:student',{STARTROW=>'001',STOPROW=>'003',FILTER=>"ColumnPrefixFilter('a')"}
ROW                                                COLUMN CELL
 001                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
 002                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
2 row(s)
Took 0.0075 seconds

按照“主键 列” 过滤

代码语言:javascript复制
hbase:253:0> scan 'test:student', {FILTER => "(RowFilter(=,'substring:003')) AND (ColumnPrefixFilter('age'))" }
ROW                                                COLUMN CELL
 003                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.238, value=13
1 row(s)
Took 0.0090 seconds

按照“主键范围 列值”过滤

代码语言:javascript复制
hbase:254:0> scan 'test:student',{STARTROW=>'001',STOPROW=>'003',FILTER=>"ValueFilter(=,'binary:13')"}
ROW                                                COLUMN CELL
 001                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
 002                                               column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
2 row(s)
Took 0.0433 seconds

通过上述几种方法,基本上可以满足 hbase 数据过滤的需求,如果还有没覆盖到的,欢迎留言~~

0 人点赞