hbase 支持百万列、十亿行,非常适合用来存储海量数据。有时需要从这些海量数据中找出某条数据进行数据验证,这就用到了 hbase 过滤器,本文简单介绍几种常用的过滤方法。
初次登录 hbase 时,包含了默认的命名空间(schema
),这里新建一个命名空间 test
create_namespace 'test'
查看命名空间
代码语言:javascript复制list_namespace
新建 student
表
create 'test:student', 'infomation'
查看表
代码语言:javascript复制list
查看指定命名空间的表
代码语言:javascript复制list_namespace_tables 'test'
插入数据
代码语言:javascript复制put 'test:student', '001','infomation:name_','Alex'
put 'test:student', '001','infomation:age__','13'
put 'test:student', '001','infomation:sex__','Male'
put 'test:student', '001','infomation:class','3.1'
put 'test:student', '002','infomation:name_','Bob'
put 'test:student', '002','infomation:age__','13'
put 'test:student', '002','infomation:sex__','Male'
put 'test:student', '002','infomation:class','3.2'
put 'test:student', '003','infomation:name_','Cindy'
put 'test:student', '003','infomation:age__','13'
put 'test:student', '003','infomation:sex__','Female'
put 'test:student', '003','infomation:class','3.3'
put 'test:student', '004','infomation:name_','Dama'
put 'test:student', '004','infomation:age__','13'
put 'test:student', '004','infomation:sex__','Female'
put 'test:student', '004','infomation:class','3.4'
put 'test:student', '005','infomation:name_','Ella'
put 'test:student', '005','infomation:age__','13'
put 'test:student', '005','infomation:sex__','Female'
put 'test:student', '005','infomation:class','3.5'
按照主键过滤(行过滤)
代码语言:javascript复制hbase:231:0> scan 'test:student', FILTER => "RowFilter(=,'substring:003')"
ROW COLUMN CELL
003 column=infomation:age__, timestamp=2022-03-13T14:45:00.238, value=13
003 column=infomation:class, timestamp=2022-03-13T14:45:00.259, value=3.3
003 column=infomation:name_, timestamp=2022-03-13T14:45:00.227, value=Cindy
003 column=infomation:sex__, timestamp=2022-03-13T14:45:00.249, value=Female
1 row(s)
Took 0.0105 seconds
按照主键前缀过滤
代码语言:javascript复制hbase:233:0> scan 'test:student',{FILTER=>"PrefixFilter('00')"}
ROW COLUMN CELL
001 column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
001 column=infomation:class, timestamp=2022-03-13T14:45:00.149, value=3.1
001 column=infomation:name_, timestamp=2022-03-13T14:45:00.106, value=Alex
001 column=infomation:sex__, timestamp=2022-03-13T14:45:00.132, value=Male
002 column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
002 column=infomation:class, timestamp=2022-03-13T14:45:00.210, value=3.2
002 column=infomation:name_, timestamp=2022-03-13T14:45:00.171, value=Bob
002 column=infomation:sex__, timestamp=2022-03-13T14:45:00.197, value=Male
003 column=infomation:age__, timestamp=2022-03-13T14:45:00.238, value=13
003 column=infomation:class, timestamp=2022-03-13T14:45:00.259, value=3.3
003 column=infomation:name_, timestamp=2022-03-13T14:45:00.227, value=Cindy
003 column=infomation:sex__, timestamp=2022-03-13T14:45:00.249, value=Female
004 column=infomation:age__, timestamp=2022-03-13T14:45:00.285, value=13
004 column=infomation:class, timestamp=2022-03-13T14:45:00.309, value=3.5
004 column=infomation:name_, timestamp=2022-03-13T14:45:00.275, value=Dama
004 column=infomation:sex__, timestamp=2022-03-13T14:45:00.298, value=Female
005 column=infomation:age__, timestamp=2022-03-13T14:45:00.336, value=13
005 column=infomation:class, timestamp=2022-03-13T14:45:01.882, value=3.5
005 column=infomation:name_, timestamp=2022-03-13T14:45:00.324, value=Ella
005 column=infomation:sex__, timestamp=2022-03-13T14:45:00.349, value=Female
5 row(s)
Took 0.0110 seconds
按照列前缀过滤
代码语言:javascript复制hbase:235:0> scan 'test:student',{FILTER=>"ColumnPrefixFilter('a')"}
ROW COLUMN CELL
001 column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
002 column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
003 column=infomation:age__, timestamp=2022-03-13T14:45:00.238, value=13
004 column=infomation:age__, timestamp=2022-03-13T14:45:00.285, value=13
005 column=infomation:age__, timestamp=2022-03-13T14:45:00.336, value=13
5 row(s)
Took 0.0100 seconds
按照主键范围过滤
代码语言:javascript复制hbase:236:0> scan 'test:student',{STARTROW=>'001',STOPROW=>'003'}
ROW COLUMN CELL
001 column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
001 column=infomation:class, timestamp=2022-03-13T14:45:00.149, value=3.1
001 column=infomation:name_, timestamp=2022-03-13T14:45:00.106, value=Alex
001 column=infomation:sex__, timestamp=2022-03-13T14:45:00.132, value=Male
002 column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
002 column=infomation:class, timestamp=2022-03-13T14:45:00.210, value=3.2
002 column=infomation:name_, timestamp=2022-03-13T14:45:00.171, value=Bob
002 column=infomation:sex__, timestamp=2022-03-13T14:45:00.197, value=Male
2 row(s)
Took 0.0082 seconds
按照 主键范围 列前缀
过滤
hbase:237:0> scan 'test:student',{STARTROW=>'001',STOPROW=>'003',FILTER=>"ColumnPrefixFilter('a')"}
ROW COLUMN CELL
001 column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
002 column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
2 row(s)
Took 0.0075 seconds
按照“主键 列” 过滤
代码语言:javascript复制hbase:253:0> scan 'test:student', {FILTER => "(RowFilter(=,'substring:003')) AND (ColumnPrefixFilter('age'))" }
ROW COLUMN CELL
003 column=infomation:age__, timestamp=2022-03-13T14:45:00.238, value=13
1 row(s)
Took 0.0090 seconds
按照“主键范围 列值”过滤
代码语言:javascript复制hbase:254:0> scan 'test:student',{STARTROW=>'001',STOPROW=>'003',FILTER=>"ValueFilter(=,'binary:13')"}
ROW COLUMN CELL
001 column=infomation:age__, timestamp=2022-03-13T14:45:00.118, value=13
002 column=infomation:age__, timestamp=2022-03-13T14:45:00.186, value=13
2 row(s)
Took 0.0433 seconds
通过上述几种方法,基本上可以满足 hbase 数据过滤的需求,如果还有没覆盖到的,欢迎留言~~