hive补全连续或非连续空值数据sql

2023-10-17 08:49:13 浏览数 (2)

一、背景

爬虫或业务场景运行中经常会出现丢数据的情况,可能随机丢一分钟,或者丢几十分钟,完全没有规律,如果想用上一个有效值来补全的话单纯用lag函数无法实现

二、测试数据准备

代码语言:javascript复制
create table test(
group_id string,
times bigint,
cnt bigint
)comment '测试'
stored as textfile;



insert into test values('a',1,null);
insert into test values('a',2,10);
insert into test values('a',3,20);
insert into test values('a',4,null);
insert into test values('a',5,null);
insert into test values('a',6,30);

三、实现 

代码语言:javascript复制
select
     t1.group_id
    ,t1.times
    ,t1.cnt        as ori_cnt  --原始值
    ,nvl(t2.cnt,0) as cnt      --补全后值
from (
    select
         group_id
        ,times
        ,cnt
        ,row_number() over(distribute by group_id,(data_rank-col_rank) sort by times) as rank1
    from (
        select
             group_id
            ,times
            ,cnt
            ,row_number() over(distribute by group_id sort by times) as data_rank
            ,row_number() over(distribute by group_id sort by if(cnt is null,0,1),times) as col_rank
        from test
    ) t
) t1
left join test t2 
    on  t1.group_id=t2.group_id 
    and if(t1.cnt is null,(t1.times-t1.rank1),t1.times)=t2.times;

可以看到为空的数据都以补全了,首条记录置为0 

0 人点赞