一、题目
已知有用户登录记录表包含登录日期和登录用户ID,请查询出截止到当前日期累积登录用户数及登陆用户列表
样例数据
代码语言:javascript复制 ------------- ----------
| log_date | user_id |
------------- ----------
| 2024-01-01 | a |
| 2024-01-02 | a |
| 2024-01-02 | b |
| 2024-01-03 | b |
| 2024-01-04 | c |
| 2024-01-05 | b |
| 2024-01-05 | c |
| 2024-01-05 | d |
| 2024-01-05 | e |
------------- ----------
期望结果
代码语言:javascript复制 ------------- ----------- ------------------------
| log_date | user_cnt | user_list |
------------- ----------- ------------------------
| 2024-01-01 | 1 | ["a"] |
| 2024-01-02 | 2 | ["a","b"] |
| 2024-01-03 | 2 | ["a","b"] |
| 2024-01-04 | 3 | ["a","b","c"] |
| 2024-01-05 | 5 | ["a","b","c","d","e"] |
------------- ----------- ------------------------
二、分析
统计截止到当前行的登录用户数,考察的是聚合函数开窗函数;查询用户列表考察的数据对数据的聚合、数组去重、数组排序等操作。属于深度考察开窗函数、数组操作等知识内容。
维度 | 评分 |
---|---|
题目难度 | ⭐️⭐️⭐️⭐️ |
题目清晰度 | ⭐️⭐️⭐️⭐️⭐ |
业务常见度 | ⭐️⭐️⭐️⭐️ |
三、SQL
1.数据聚合
- 通过使用count(distinct )over(order by )的方式,聚合函数开窗,带有排序则统计到当前行的方式 ,完成对截止到当前行的数据统计;
- collect_set 也是对数据的聚合,所以也可以使用相同的方式完成截止到当前行的聚合;
执行SQL
代码语言:javascript复制select log_date,
user_id,
count(user_id) over (order by log_date asc) as user_cnt,
collect_list(user_id) over (order by log_date asc) as user_list
from t_user_login
查询结果
代码语言:javascript复制 ------------- ---------- ----------- ----------------------------------------
| log_date | user_id | user_cnt | user_list |
------------- ---------- ----------- ----------------------------------------
| 2024-01-01 | a | 1 | ["a"] |
| 2024-01-02 | b | 3 | ["a","b","a"] |
| 2024-01-02 | a | 3 | ["a","b","a"] |
| 2024-01-03 | b | 4 | ["a","b","a","b"] |
| 2024-01-04 | c | 5 | ["a","b","a","b","c"] |
| 2024-01-05 | e | 9 | ["a","b","a","b","c","e","d","c","b"] |
| 2024-01-05 | d | 9 | ["a","b","a","b","c","e","d","c","b"] |
| 2024-01-05 | c | 9 | ["a","b","a","b","c","e","d","c","b"] |
| 2024-01-05 | b | 9 | ["a","b","a","b","c","e","d","c","b"] |
------------- ---------- ----------- ----------------------------------------
注意
- 这里count()函数中特意没有使用去重函数,主要是为了方便大家观察第2行、第3行数据,可以看到user_cnt都是3。这里说明order by 之后 有相同的数据,则取较大值,并且不区分先后顺序;
- collect_list()函数进行聚合处理之后的结果,user_list也是一样的,包含到截止相同排序的最后一行数据。
2.数据去重聚合
这里我们加上去重,使用collect_set替换掉collect_list
执行SQL
代码语言:javascript复制select log_date,
user_id,
count(distinct user_id) over (order by log_date asc) as user_cnt,
collect_set(user_id) over (order by log_date asc) as user_list
from t_user_login
查询结果
代码语言:javascript复制 ------------- ---------- ----------- ------------------------
| log_date | user_id | user_cnt | user_list |
------------- ---------- ----------- ------------------------
| 2024-01-01 | a | 1 | ["a"] |
| 2024-01-02 | b | 2 | ["a","b"] |
| 2024-01-02 | a | 2 | ["a","b"] |
| 2024-01-03 | b | 2 | ["a","b"] |
| 2024-01-04 | c | 3 | ["a","b","c"] |
| 2024-01-05 | e | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | d | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | c | 5 | ["a","b","c","e","d"] |
| 2024-01-05 | b | 5 | ["a","b","c","e","d"] |
------------- ---------- ----------- ------------------------
3.对用户列表排序,然后去重得到最后结果
上面数据观察可知相同日期的结果相同我们只要对用户列表进行排序,然后使用group by 进行去重即可得到最终结果
执行SQL
代码语言:javascript复制select log_date,
user_cnt,
user_list
from (select log_date,
user_id,
count(distinct user_id) over (order by log_date asc) as user_cnt,
sort_array(collect_set(user_id) over (order by log_date asc)) as user_list
from t_user_login) t
group by log_date, user_cnt, user_list
查询结果
代码语言:javascript复制 ------------- ----------- ------------------------
| log_date | user_cnt | user_list |
------------- ----------- ------------------------
| 2024-01-01 | 1 | ["a"] |
| 2024-01-02 | 2 | ["a","b"] |
| 2024-01-03 | 2 | ["a","b"] |
| 2024-01-04 | 3 | ["a","b","c"] |
| 2024-01-05 | 5 | ["a","b","c","d","e"] |
------------- ----------- ------------------------
四、建表语句和数据插入
代码语言:javascript复制-- 建表语句
CREATE TABLE t_user_login
(
log_date STRING,
user_id STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
-- 数据插入语句
INSERT INTO t_user_login
(log_date, user_id) VALUES
('2024-01-01','a'),
('2024-01-02','a'),
('2024-01-02','b'),
('2024-01-03','b'),
('2024-01-04','c'),
('2024-01-05','b'),
('2024-01-05','c'),
('2024-01-05','d'),
('2024-01-05','e');