需求背景:
在理财 APP 中,素材、广告位、产品、策略有时候是多对多的关系。比如,在内容中台,一个素材可能关联理财、基金、存款某些产品,那我们统计该素材的好不好,转化率好不好,该归属于哪些业务?再进而计算某些业务的贡献,就可能需要用到数组。
还是不怎么看文档,因为文档的例子不够直观。
在https://community.cloud.databricks.com/ 上创建表的方法,可以参考文档,https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#examples
创表
代码语言:javascript复制CREATE TABLE student (name STRING, courses STRING)
插入数据
代码语言:javascript复制INSERT INTO student (name, courses) VALUES ('Alice', 'Math'), ('Alice', 'Science'), ('Bob', 'English'), ('Bob', 'History'), ('Bob', 'Art'), ('Charlie', 'Math'), ('Charlie', 'Art'), ('David', 'Science'), ('Emma', 'Math'), ('Emma', 'English'), ('Emma', 'Science');
数据如下:
name | courses |
---|---|
Alice | Math |
Alice | Science |
Bob | English |
Bob | History |
Bob | Art |
Charlie | Math |
Charlie | Art |
David | Science |
Emma | Math |
Emma | English |
Emma | Science |
需求1
现在需要将 student 的数据变成一个学生一行数据,分别将课程拼接成字符串和组成数组类型。代码如下:
代码语言:javascript复制select
name
,concat_ws(', ', collect_list(courses)) as courses
from
student
group by
name
name | courses |
---|---|
Charlie | Math, Art |
Bob | English, History, Art |
Alice | Math, Science |
Emma | Math, English, Science |
David | Science |
select
name,array_agg(courses) as courses
from
student
group by
name;
name | courses |
---|---|
Charlie | ["Math","Art"] |
Bob | ["English","History","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
David | ["Science"] |
select name, collect_list(courses) as courses
from student
group by name
name | courses |
---|---|
Charlie | ["Math","Art"] |
Bob | ["English","History","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
David | ["Science"] |
-- chat GPT 说这样也可以,但是我选择的版本不支持。
-- STRING_AGG 函数是 SQL:2016 标准中新增的函数,不是所有的数据库管理系统都支持该函数。
-- Spark 3.0 中,STRING_AGG 函数被引入作为 SQL:2016 标准的一部分。你可以使用 STRING_AGG 函数将每个分组的数据拼接成一个字符串。
select name, string_agg(courses, ',') as courses
from student
group by name;
踩坑1
其实我先是在 Excel 中自己弄成了 ,结果没有注意,courses2是字符串类型。而修改字段类型比较麻烦。表名是temp。
name | courses2 | courses |
---|---|---|
Alice | ['Math', 'Science'] | Math, Science |
Bob | ['English', 'History', 'Art'] | English, History, Art |
Charlie | ['Math', 'Art'] | Math, Art |
David | ['Science'] | Science |
Emma | ['Math', 'English', 'Science'] | Math, English,Science |
踩坑2
代码语言:javascript复制select
name
,array(collect_list(courses)) as courses
from
student
group by
name
name | courses |
---|---|
Charlie | [["Math","Art"]] |
Bob | [["English","History","Art"]] |
Alice | [["Math","Science"]] |
Emma | [["Math","English","Science"]] |
David | [["Science"]] |
需求2:
如果本身表就是踩坑1的,表名是 temp,对字符串类型的courses展开,变成一行数据是每一个同学和一个科目。
代码语言:javascript复制select
name
,explode(split(courses, ', ')) as course
from
temp
name | course |
---|---|
Alice | Math |
Alice | Science |
Bob | English |
Bob | History |
Bob | Art |
Charlie | Math |
Charlie | Art |
David | Science |
Emma | Math |
Emma | English,Science |
需求3:
如果对数组类型的courses展开(需求1中第二段代码),变成一行数据是每一个同学和一个科目。
代码语言:javascript复制select name,course
from
(
select
name
,array_agg(courses) as courses
from
student
group by
name
)
lateral view explode(courses) exploded_courses as course;
name | course |
---|---|
Charlie | Math |
Charlie | Art |
Bob | English |
Bob | History |
Bob | Art |
Alice | Math |
Alice | Science |
Emma | Math |
Emma | English |
Emma | Science |
David | Science |
需求4:
需要统计每门课有多少同学选修?数据是学生和课程组合起来的。
- courses 是字符串类型
select
course
,count(distinct name) as student_count
from
(
-- 踩坑1中的 temp 表,数据如需求2。
select
name,explode(split(courses, ', ')) as course
from
temp
) subquery
-- where course in ('math', 'english')
group by course;
course | student_count |
---|---|
Science | 2 |
Art | 2 |
Math | 3 |
English | 1 |
History | 1 |
English,Science | 1 |
- courses 是数组类型,再进行统计。
select course,count(distinct name) as student_count
from
(
select
name
,explode(courses) as course
from
(
select
name
,array_agg(courses) as courses
from
student
group by
name
)
) as temp
group by course;
course | student_count |
---|---|
Science | 3 |
Art | 2 |
Math | 3 |
English | 2 |
History | 1 |
需求5
直接在数组类型的courses,查询选修数据的同学的所有选修课程,结果中的选修课程是数组类型
代码语言:javascript复制-- 创建表的第二种形式,student_copy 是
create table student_copy as
select name, collect_list(courses) as courses
from student
group by name
name | courses |
---|---|
Charlie | ["Math","Art"] |
Bob | ["English","History","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
David | ["Science"] |
select * from student_copy where array_contains(courses, 'math')
name | courses |
---|---|
Charlie | ["Math","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
需求6
直接在数组类型的courses,查询有哪些不同的课程?
代码语言:javascript复制select array_distinct(courses) from student_copy
-- array_distinct(x )→ x:去重:删除数组x中重复元素。 仅对一行的数组去重,不是对整个数组字段去重
-- https://docs.ucloud.cn/usql/common/func/array
array_distinct(courses) |
---|
["Math","Art"] |
["English","History","Art"] |
["Math","Science"] |
["Math","English","Science"] |
["Science"] |
逐步靠近答案。
代码语言:javascript复制select collect_set(course) as courses
from (
select explode(courses) as course
from student_copy
) subquery;
courses |
---|
["Math","History","Art","Science","English"] |
select concat_ws('n', collect_set(course)) as courses
from (
select explode(courses) as course
from student_copy
) subquery;
courses |
---|
Math History Art Science English |
select course
from (
select explode(courses) as course
from student_copy
) subquery
group by course;
course |
---|
Science |
Art |
Math |
English |
History |
需求7
一张表是同学上学期所学课程的数据,另外一种表是下学期的课程,来看下所有同学一个学期共学习了什么课程。
代码语言:javascript复制select
t1.name,array_append(t1.courses,t2.courses) as courses
from
student_copy as t1
left join
(SELECT * FROM student_copy WHERE array_contains(courses, 'Math')) as t2
on t1.name = t2.name
代码语言:javascript复制Error in SQL statement: AnalysisException: [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_append(courses, courses)" due to data type mismatch: 错误在SQL语句:分析异常: [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPESJ由于数据类型不匹配,无法解析array_append(课程、课程)
select
t1.name,array_append(t1.courses,t2.courses) as courses
from
student_copy as t1
left join
( SELECT name, courses FROM temp) as t2
on t1.name = t2.name
name | courses |
---|---|
Charlie | ["Math","Art","Math, Art"] |
Bob | ["English","History","Art","English, History, Art"] |
Alice | ["Math","Science","Math, Science"] |
Emma | ["Math","English","Science","Math, English,Science"] |
David | ["Science","Science"] |
但是没有去重,发现 "Math" ,"Art" ,"Math, Art" 不同。
代码语言:javascript复制select
a.name,array_distinct(a.courses) as courses
from
(
select
t1.name,array_append(t1.courses,t2.courses) as courses
from
student_copy as t1
left join
( SELECT name, courses FROM temp) as t2
on t1.name = t2.name
) as a
name | array_distinct(courses) |
---|---|
Charlie | ["Math","Art","Math, Art"] |
Bob | ["English","History","Art","English, History, Art"] |
Alice | ["Math","Science","Math, Science"] |
Emma | ["Math","English","Science","Math, English,Science"] |
David | ["Science"] |
需求8:
如何排序呢?
代码语言:javascript复制select
t1.name,array_sort(t1.courses) as courses
from
(
select
name,array_agg(courses) as courses
from
students
group by
name
) as t1
t1的数据是:
name | courses |
---|---|
Charlie | ["Math","Art"] |
Bob | ["English","History","Art"] |
Alice | ["Math","Science"] |
Emma | ["Math","English","Science"] |
David | ["Science"] |
查出来的数据:
name | courses |
---|---|
Charlie | ["Art","Math"] |
Bob | ["Art","English","History"] |
Alice | ["Math","Science"] |
Emma | ["English","Math","Science"] |
David | ["Science"] |
总结:
基本总结了数组的合并展开,分组统计。