clickhouse实现留存数和留存率计算

2022-04-25 09:08:13 浏览数 (1)

关于留存的计算是我们分享业务走向一种非常重要的方式,所谓留存用户,是用户在某一段时间内开始使用某个功能,到后面某一段时间又使用这个应用的用户,我们称这批用户叫留存用户。后面我们定义留存其实更多是基于两个事件来,比如某一天完成a事件,然后又完成b事件,这个我们叫b的留存,当然我们这个可以基于用户也可以基于设备。

下面我们来看clickhouse实现留存,关于clickhouse有个专用的留存函数,来处理留存用户数。

代码语言:javascript复制
语法

    retention(cond1, cond2, ..., cond32);

参数

    cond — 返回 UInt8结果(1或0)的表达式。

返回值

    数组为1或0。

        1 — 条件满足。

        0 — 条件不满足。
 

下面我们来看例子: 创建表:

代码语言:javascript复制
CREATE TABLE reg(date Date, uid Int32,channel Int32) ENGINE = Memory;



CREATE TABLE charge(date Date, uid Int32, money Int32) ENGINE = Memory;



# 插入数据:

INSERT INTO reg SELECT '2020-01-01', 1,10000;

INSERT INTO reg SELECT '2020-01-01', 2,10001;

INSERT INTO reg SELECT '2020-01-01', 3,10001;

INSERT INTO reg SELECT '2020-01-02', 4,10000;

INSERT INTO reg SELECT '2020-01-03', 5,10000;

INSERT INTO charge SELECT '2020-01-01', 1,10;

INSERT INTO charge SELECT '2020-01-01', 1,10;

INSERT INTO charge SELECT '2020-01-01', 2,200;

INSERT INTO charge SELECT '2020-01-01', 3,500;

INSERT INTO charge SELECT '2020-01-02', 4,10000;

INSERT INTO charge SELECT '2020-01-02', 2,300;

INSERT INTO charge SELECT '2020-01-02', 3,400;

INSERT INTO charge SELECT '2020-01-02', 3,400;

INSERT INTO charge SELECT '2020-01-03', 5,10000;

INSERT INTO charge SELECT '2020-01-03', 3,400;
 

假如查询2020-01-01日的0日,1日,2日注册到充值留存数,我们可以先看下uid的每个条件的实现结果:

代码语言:javascript复制
SELECT

    uid,

    retention((date = '2020-01-01') AND (type = 1), (date = '2020-01-01') AND (type = 2), (date = '2020-01-02') AND (type = 2), (date = '2020-01-03') AND (type = 2)) AS r

FROM

(

    SELECT

        uid,

        date,

        1 AS type

    FROM reg

    WHERE (date >= '2020-01-01') AND (date <= '2020-01-03')

    UNION ALL

    SELECT

        uid,

        date,

        2 AS type

    FROM charge

    WHERE (date >= '2020-01-01') AND (date <= '2020-01-03')

)

WHERE (date >= '2020-01-01') AND (date <= '2020-01-03')

GROUP BY uid

ORDER BY uid ASC



┌─uid─┬─r─────────┐

│   1 │ [1,1,0,0] │

│   2 │ [1,1,1,0] │

│   3 │ [1,1,1,1] │

│   4 │ [0,0,0,0] │

│   5 │ [0,0,0,0] │

└─────┴───────────┘
 

对一下上面我们的写入数据,符合我们的结果。 下面我们写个完整sql实现这个留存功能:

代码语言:javascript复制
SELECT

    sum(r[1]) AS a,

    sum(r[2]) AS b,

    sum(r[3]) AS c,

    sum(r[4]) AS d

FROM

(

    SELECT

        uid,

        retention((date = '2020-01-01') AND (type = 1), (date = '2020-01-01') AND (type = 2), (date = '2020-01-02') AND (type = 2), (date = '2020-01-03') AND (type = 2)) AS r

    FROM

    (

        SELECT

            uid,

            date,

            1 AS type

        FROM reg

        WHERE (date >= '2020-01-01') AND (date <= '2020-01-03')

        UNION ALL

        SELECT

            uid,

            date,

            2 AS type

        FROM charge

        WHERE (date >= '2020-01-01') AND (date <= '2020-01-03')

    )

    WHERE (date >= '2020-01-01') AND (date <= '2020-01-03')

    GROUP BY uid

    ORDER BY uid ASC

)



┌─a─┬─b─┬─c─┬─d─┐

│ 3 │ 3 │ 2 │ 1 │

└───┴───┴───┴───┘
 

上面分别是 a:2020-01-01 注册数 b:2020-01-01 注册并充值人数(新用户充值数) c:2020-01-02 注册并充值的1日留存数 d:2020-01-03 注册并充值的2日留存数

下面我们看下如何计算活跃,上面的例子需要多插入几条数据,我们需要计算3日活跃。

代码语言:javascript复制
INSERT INTO charge SELECT '2020-01-04', 3,400;

.
 

计算活跃sql:

代码语言:javascript复制
SELECT

    sum(r[1]) AS a,

    sum(r[2]) AS b,

    sum(r[3]) AS c,

    sum(r[4]) AS d

FROM

(

    SELECT

        uid,

        retention((date = '2020-01-01') AND (type = 1), (date = '2020-01-01') AND (type = 2), (date >= '2020-01-01') AND (date <= '2020-01-02') AND (type = 2), (date >= '2020-01-02') AND (date <= '2020-01-04') AND (type = 2)) AS r

    FROM

    (

        SELECT

            uid,

            date,

            1 AS type

        FROM reg

        WHERE (date >= '2020-01-01') AND (date <= '2020-01-03')

        UNION ALL

        SELECT

            uid,

            date,

            2 AS type

        FROM charge

        WHERE (date >= '2020-01-01') AND (date <= '2020-01-04')

    )

    WHERE (date >= '2020-01-01') AND (date <= '2020-01-04')

    GROUP BY uid

    ORDER BY uid ASC

)



┌─a─┬─b─┬─c─┬─d─┐

│ 3 │ 3 │ 3 │ 2 │

└───┴───┴───┴───┘
 

上面分别是 a:2020-01-01 注册数 b:2020-01-01 注册并充值人数活跃数 c:2020-01-02 注册并充值的1日活跃数 d:2020-01-03 注册并充值的2日活跃数

关于留存率或者活跃率我没有写例子,其实就是上面的数据b,c,d除以a,a是基准数据。

留存计算一般其实是针对于一天来计算 活跃是算一段时间的累计数

0 人点赞