Spark SQL 中的array类的函数例子

2023-05-17 22:00:21 浏览数 (1)

需求背景:

在理财 APP 中,素材、广告位、产品、策略有时候是多对多的关系。比如,在内容中台,一个素材可能关联理财、基金、存款某些产品,那我们统计该素材的好不好,转化率好不好,该归属于哪些业务?再进而计算某些业务的贡献,就可能需要用到数组。

还是不怎么看文档,因为文档的例子不够直观。

在https://community.cloud.databricks.com/ 上创建表的方法,可以参考文档,https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#examples

创表

代码语言:javascript复制
CREATE TABLE student (name STRING, courses STRING)

插入数据

代码语言:javascript复制
INSERT INTO student (name, courses) VALUES ('Alice', 'Math'), ('Alice', 'Science'), ('Bob', 'English'), ('Bob', 'History'), ('Bob', 'Art'), ('Charlie', 'Math'), ('Charlie', 'Art'), ('David', 'Science'), ('Emma', 'Math'), ('Emma', 'English'), ('Emma', 'Science');

数据如下:

name

courses

Alice

Math

Alice

Science

Bob

English

Bob

History

Bob

Art

Charlie

Math

Charlie

Art

David

Science

Emma

Math

Emma

English

Emma

Science

需求1

现在需要将 student 的数据变成一个学生一行数据,分别将课程拼接成字符串和组成数组类型。代码如下:

代码语言:javascript复制
select 
    name
   ,concat_ws(', ', collect_list(courses)) as courses
from 
    student
group by 
    name

name

courses

Charlie

Math, Art

Bob

English, History, Art

Alice

Math, Science

Emma

Math, English, Science

David

Science

代码语言:javascript复制
select 
   name,array_agg(courses) as courses
from 
   student
group by 
    name;

name

courses

Charlie

["Math","Art"]

Bob

["English","History","Art"]

Alice

["Math","Science"]

Emma

["Math","English","Science"]

David

["Science"]

代码语言:javascript复制
select name, collect_list(courses) as courses
from student
group by name

name

courses

Charlie

["Math","Art"]

Bob

["English","History","Art"]

Alice

["Math","Science"]

Emma

["Math","English","Science"]

David

["Science"]

代码语言:javascript复制
-- chat GPT 说这样也可以,但是我选择的版本不支持。

-- STRING_AGG 函数是 SQL:2016 标准中新增的函数,不是所有的数据库管理系统都支持该函数。
-- Spark 3.0 中,STRING_AGG 函数被引入作为 SQL:2016 标准的一部分。你可以使用 STRING_AGG 函数将每个分组的数据拼接成一个字符串。
select name, string_agg(courses, ',') as courses
from student
group by name;

踩坑1

其实我先是在 Excel 中自己弄成了 ,结果没有注意,courses2是字符串类型。而修改字段类型比较麻烦。表名是temp。

name

courses2

courses

Alice

['Math', 'Science']

Math, Science

Bob

['English', 'History', 'Art']

English, History, Art

Charlie

['Math', 'Art']

Math, Art

David

['Science']

Science

Emma

['Math', 'English', 'Science']

Math, English,Science

踩坑2
代码语言:javascript复制
select 
   name
  ,array(collect_list(courses)) as courses
from 
  student
group by 
   name

name

courses

Charlie

[["Math","Art"]]

Bob

[["English","History","Art"]]

Alice

[["Math","Science"]]

Emma

[["Math","English","Science"]]

David

[["Science"]]

需求2:

如果本身表就是踩坑1的,表名是 temp,对字符串类型的courses展开,变成一行数据是每一个同学和一个科目。

代码语言:javascript复制
select 
     name
    ,explode(split(courses, ', ')) as course
from 
    temp

name

course

Alice

Math

Alice

Science

Bob

English

Bob

History

Bob

Art

Charlie

Math

Charlie

Art

David

Science

Emma

Math

Emma

English,Science

需求3:

如果对数组类型的courses展开(需求1中第二段代码),变成一行数据是每一个同学和一个科目。

代码语言:javascript复制
select name,course
from 
(
  select 
      name
     ,array_agg(courses) as courses
  from 
     student
  group by 
     name
 ) 
lateral view explode(courses) exploded_courses as course;

name

course

Charlie

Math

Charlie

Art

Bob

English

Bob

History

Bob

Art

Alice

Math

Alice

Science

Emma

Math

Emma

English

Emma

Science

David

Science

需求4:

需要统计每门课有多少同学选修?数据是学生和课程组合起来的。

  • courses 是字符串类型
代码语言:javascript复制
select
    course
   ,count(distinct name) as student_count
from 
(
  -- 踩坑1中的 temp 表,数据如需求2。
  select 
    name,explode(split(courses, ', ')) as course
  from 
    temp
) subquery
-- where course in ('math', 'english')
group by course;

course

student_count

Science

2

Art

2

Math

3

English

1

History

1

English,Science

1

  • courses 是数组类型,再进行统计。
代码语言:javascript复制
select course,count(distinct name) as student_count
from 
(
  select 
     name
    ,explode(courses) as course
  from 
  (
    select 
        name
       ,array_agg(courses) as courses
    from 
       student
    group by 
        name
  )
) as temp
group by course;

course

student_count

Science

3

Art

2

Math

3

English

2

History

1

需求5

直接在数组类型的courses,查询选修数据的同学的所有选修课程,结果中的选修课程是数组类型

代码语言:javascript复制
-- 创建表的第二种形式,student_copy 是
create table student_copy as 
select name, collect_list(courses) as courses
from student
group by name

name

courses

Charlie

["Math","Art"]

Bob

["English","History","Art"]

Alice

["Math","Science"]

Emma

["Math","English","Science"]

David

["Science"]

代码语言:javascript复制
select * from student_copy where array_contains(courses, 'math')

name

courses

Charlie

["Math","Art"]

Alice

["Math","Science"]

Emma

["Math","English","Science"]

需求6

直接在数组类型的courses,查询有哪些不同的课程?

代码语言:javascript复制
select array_distinct(courses) from student_copy 
-- array_distinct(x )→ x:去重:删除数组x中重复元素。 仅对一行的数组去重,不是对整个数组字段去重
-- https://docs.ucloud.cn/usql/common/func/array 

array_distinct(courses)

["Math","Art"]

["English","History","Art"]

["Math","Science"]

["Math","English","Science"]

["Science"]

逐步靠近答案。

代码语言:javascript复制
select collect_set(course) as courses
from (
  select explode(courses) as course
  from student_copy
) subquery;

courses

["Math","History","Art","Science","English"]

代码语言:javascript复制
select concat_ws('n', collect_set(course)) as courses
from (
  select explode(courses) as course
  from student_copy
) subquery;

courses

Math History Art Science English

代码语言:javascript复制
select course
from (
  select explode(courses) as course
  from student_copy
) subquery
group by course;

course

Science

Art

Math

English

History

需求7

一张表是同学上学期所学课程的数据,另外一种表是下学期的课程,来看下所有同学一个学期共学习了什么课程。

代码语言:javascript复制
select 
  t1.name,array_append(t1.courses,t2.courses) as courses
from 
   student_copy as t1 
 left join 
   (SELECT * FROM student_copy WHERE array_contains(courses, 'Math')) as t2 
 on t1.name = t2.name

Error in SQL statement: AnalysisException: [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPES] Cannot resolve "array_append(courses, courses)" due to data type mismatch: 错误在SQL语句:分析异常: [DATATYPE_MISMATCH.ARRAY_FUNCTION_DIFF_TYPESJ由于数据类型不匹配,无法解析array_append(课程、课程)

代码语言:javascript复制
select 
  t1.name,array_append(t1.courses,t2.courses) as courses
from 
   student_copy as t1 
 left join 
   ( SELECT name, courses FROM temp) as t2 
 on t1.name = t2.name

name

courses

Charlie

["Math","Art","Math, Art"]

Bob

["English","History","Art","English, History, Art"]

Alice

["Math","Science","Math, Science"]

Emma

["Math","English","Science","Math, English,Science"]

David

["Science","Science"]

但是没有去重,发现 "Math" ,"Art" ,"Math, Art" 不同。

代码语言:javascript复制
select 
 a.name,array_distinct(a.courses) as courses
from 
(
  select 
    t1.name,array_append(t1.courses,t2.courses) as courses
  from 
     student_copy as t1 
   left join 
     ( SELECT name, courses FROM temp) as t2 
   on t1.name = t2.name
) as a

name

array_distinct(courses)

Charlie

["Math","Art","Math, Art"]

Bob

["English","History","Art","English, History, Art"]

Alice

["Math","Science","Math, Science"]

Emma

["Math","English","Science","Math, English,Science"]

David

["Science"]

需求8:

如何排序呢?

代码语言:javascript复制
select 
   t1.name,array_sort(t1.courses) as courses
from
(
  select 
   name,array_agg(courses) as courses
  from 
     students
  group by 
      name
) as t1 

t1的数据是:

name

courses

Charlie

["Math","Art"]

Bob

["English","History","Art"]

Alice

["Math","Science"]

Emma

["Math","English","Science"]

David

["Science"]

查出来的数据:

name

courses

Charlie

["Art","Math"]

Bob

["Art","English","History"]

Alice

["Math","Science"]

Emma

["English","Math","Science"]

David

["Science"]

总结:

基本总结了数组的合并展开,分组统计。

sql

0 人点赞