一、背景
爬虫或业务场景运行中经常会出现丢数据的情况,可能随机丢一分钟,或者丢几十分钟,完全没有规律,如果想用上一个有效值来补全的话单纯用lag函数无法实现
二、测试数据准备
代码语言:javascript复制create table test(
group_id string,
times bigint,
cnt bigint
)comment '测试'
stored as textfile;
insert into test values('a',1,null);
insert into test values('a',2,10);
insert into test values('a',3,20);
insert into test values('a',4,null);
insert into test values('a',5,null);
insert into test values('a',6,30);
三、实现
代码语言:javascript复制select
t1.group_id
,t1.times
,t1.cnt as ori_cnt --原始值
,nvl(t2.cnt,0) as cnt --补全后值
from (
select
group_id
,times
,cnt
,row_number() over(distribute by group_id,(data_rank-col_rank) sort by times) as rank1
from (
select
group_id
,times
,cnt
,row_number() over(distribute by group_id sort by times) as data_rank
,row_number() over(distribute by group_id sort by if(cnt is null,0,1),times) as col_rank
from test
) t
) t1
left join test t2
on t1.group_id=t2.group_id
and if(t1.cnt is null,(t1.times-t1.rank1),t1.times)=t2.times;
可以看到为空的数据都以补全了,首条记录置为0