(五)进阶技术 5. 快照 前面实验说明了处理维度的扩展。本篇讨论两种事实表的扩展技术。 有些用户,尤其是管理者,经常会要看某个特定时间点的数据。也就是说,他们需要数据的快照。周期快照和累积快照是两种处理事实表扩展的技术。 周期快照是在一个给定的时间对事实表进行一段时期的总计。例如,一个月销售订单周期快照是每个月底时总的销售订单金额。 累积快照用于跟踪事实表的变化。例如,数据仓库可能需要累积(存储)销售订单从下订单的时间开始,到订单中的商品被出库、运输和到达的各阶段的时间点数据来跟踪订单生命周期的进展情况。用户可能要取得在某个给定时间点,销售订单处理状态的累积快照。 下面说明周期快照和累积快照的细节问题。 周期快照 本节以销售订单的月底汇总为例说明如何实现一个周期快照。 首先需要添加一个新的事实表。图(五)- 5-1中的模式显示了一个名为month_end_sales_order_fact的新事实表。该表中有两个度量值,month_order_amount和month_order_quantity,这两个值是不能加到sales_order_fact表中的。不能加到sales_order_fact表中的原因是,sales_order_fact表和新的度量值有不同的时间属性(数据的粒度不同)。sales_order_fact表包含的是每天一条记录。新的度量值要的是每月的数据。使用清单(五)- 5-1里的脚本建立month_end_sales_order_fact表
图(五)- 5-1
代码语言:javascript复制USE dw;
CREATE TABLE month_end_sales_order_fact (
order_month_sk INT,
product_sk INT,
month_order_amount DEC(10 , 2 ),
month_order_quantity INT,
foreign key (order_month_sk)
references month_dim (month_sk)
on delete cascade on update cascade,
foreign key (product_sk)
references product_dim (product_sk)
on delete cascade on update cascade
);
清单(五)- 5-1
建立了month_end_sales_order_fact表后,现在需要向表中装载数据。月底销售订单事实表的数据源是已有的销售订单事实表。清单(五)- 5-2里的脚本装载月底销售订单事实表。每个月第一天,在每天销售订单定期装载执行完后,执行此脚本,装载上个月的销售订单数据。
代码语言:javascript复制USE dw;
SET @pre_date = SUBDATE(CURRENT_DATE,1) ;
INSERT INTO month_end_sales_order_fact
SELECT
b.month_sk
, a.product_sk
, SUM(order_amount)
, SUM(order_quantity)
FROM
sales_order_fact a
, month_dim b
, order_date_dim d
WHERE
a.order_date_sk = d.order_date_sk
AND b.month = d.month
AND b.year = d.year
AND b.month = MONTH(@pre_date)
AND b.year = YEAR(@pre_date)
GROUP BY b.month_sk, a.product_sk ;
COMMIT ;
清单(五)- 5-2
使用Kettle转换装载月底销售订单事实表的步骤如图(五)- 5-2到(五)- 5-5所示。
图(五)- 5-2
图(五)- 5-3
图(五)- 5-4
图(五)- 5-5
执行清单(五)- 5-2里的脚本或相应的Kettle转换之前,设置系统日期为2015年3月1日(装载2015年2月最后一天的数据库),因为该脚本只能在每月1日执行。 执行完清单(五)- 5-2里的脚本或相应的Kettle转换后,使用下面的命令查询month_end_sales_order_fact表,表中只有两条记录,都来自于2015年2月的销售订单,显示如下: mysql> select * from month_end_sales_order_fact; ---------------- ------------ -------------------- ---------------------- | order_month_sk | product_sk | month_order_amount | month_order_quantity | ---------------- ------------ -------------------- ---------------------- | 182 | 2 | 4000.00 | NULL | | 182 | 3 | 4000.00 | NULL | ---------------- ------------ -------------------- ---------------------- 2 rows in set (0.00 sec) mysql> select * from month_dim where month_sk = 182; ---------- ------------ ------- --------- ------ ---------------- ------------- | month_sk | month_name | month | quarter | year | effective_date | expiry_date | ---------- ------------ ------- --------- ------ ---------------- ------------- | 182 | February | 2 | 1 | 2015 | 0000-00-00 | 9999-12-31 | ---------- ------------ ------- --------- ------ ---------------- ------------- 1 row in set (0.00 sec) 累积快照 本节说明如何在销售订单上实现累积快照。 该累加快照跟踪五个销售订单的里程碑:下订单、分配库房、出库、配送和收货。这五个里程碑的日期及其各自的数量来自源数据库的销售订单表。一个订单完整的生命周期由五行描述:下订单的时间一行,订单商品被分配到库房的时间一行,产品出库的时间一行,订单配送的时间一行,订单客户收货的时间一行。每个里程碑各有一个状态:N为新订单,A为已分配库房,P为已出库,S为已配送,R为已收货。sales_order表的结构必须做相应的改变,以处理五种不同的状态。执行清单(五)- 5-3里的脚本修改数据库模式。对源数据库的修改如下:把order_date列改为status_date,添加了名为order_status的列,并把order_quantity列改为quantity。正如名字所表示的,order_status列用于存储N,A,P,S或R之一。它描述了status_date列对应的状态值。如果一条记录的状态为N,则status_date列是下订单的日期。如果状态是R,status_date列是收货日期。对数据仓库的修改如下:给现有的sales_order_fact表添加四个数量和四个日期代理键,要加的新列是allocate_date_sk、allocate_quantity、packing_date_sk、packing_quantity、ship_date_sk、ship_quantity、receive_date_sk、receive_quantity。还要在日期维度上使用数据库视图角色扮演生成四个新的日期代理键。
代码语言:javascript复制USE source;
-- 修改销售订单事务表
ALTER TABLE sales_order
CHANGE order_date status_date DATE
, ADD order_status VARCHAR(1) AFTER status_date
, CHANGE order_quantity quantity INT;
-- 删除sales_order表的主键
alter table sales_order change order_number order_number int(10) not null;
alter table sales_order drop primary key;
USE dw;
-- 事实表增加八列
ALTER TABLE sales_order_fact
ADD allocate_date_sk INT AFTER order_date_sk
, ADD allocate_quantity INT
, ADD packing_date_sk INT AFTER allocate_date_sk
, ADD packing_quantity INT
, ADD ship_date_sk INT AFTER packing_date_sk
, ADD ship_quantity INT
, ADD receive_date_sk INT AFTER ship_date_sk
, ADD receive_quantity INT;
-- 建立四个日期维度视图
CREATE VIEW allocate_date_dim (allocate_date_sk , allocate_date , month_name , month , quarter , year , promo_ind , effective_date , expiry_date) AS
SELECT
date_sk,
date,
month_name,
month,
quarter,
year,
promo_ind,
effective_date,
expiry_date
FROM
date_dim;
CREATE VIEW packing_date_dim (packing_date_sk , packing_date , month_name , month , quarter , year , promo_ind , effective_date , expiry_date) AS
SELECT
date_sk,
date,
month_name,
month,
quarter,
year,
promo_ind,
effective_date,
expiry_date
FROM
date_dim;
CREATE VIEW ship_date_dim (ship_date_sk , ship_date , month_name , month , quarter , year , promo_ind , effective_date , expiry_date) AS
SELECT
date_sk,
date,
month_name,
month,
quarter,
year,
promo_ind,
effective_date,
expiry_date
FROM
date_dim;
CREATE VIEW receive_date_dim (receive_date_sk , receive_date , month_name , month , quarter , year , promo_ind , effective_date , expiry_date) AS
SELECT
date_sk,
date,
month_name,
month,
quarter,
year,
promo_ind,
effective_date,
expiry_date
FROM
date_dim;
清单(五)- 5-3
修改后的数据仓库模式如图(五)- 5-6所示。
图(五)- 5-6
因为事实表的结构已经改变了,所以需要修改定期装载脚本。清单清单(五)- 5-4里是新的定期装载脚本。处理五个里程碑状态和日期的五条语句加了注释。同一个日期可能有销售订单的多个事务被记录,在这种情况下对应的里程碑日期同时修改。图(五)- 5-7到图(五)- 5-24显示了相应的Kettle转换所修改的步骤(“装载订单维度”、“装载事实表”转换)。
代码语言:javascript复制USE dw;
-- 设置SCD的截止时间和生效时间
SET @pre_date = SUBDATE(CURRENT_DATE,1) ;
-- 设置CDC的上限时间
UPDATE cdc_time SET current_load = CURRENT_DATE ;
-- 装载客户维度
TRUNCATE TABLE customer_stg;
INSERT INTO customer_stg
SELECT
customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
FROM source.customer ;
/* 在所有地址列上 SCD2 */
/* 置过期 */
UPDATE customer_dim a,
customer_stg b
SET
expiry_date = @pre_date
WHERE
a.customer_number = b.customer_number
AND (a.customer_street_address <> b.customer_street_address
OR a.customer_city <> b.customer_city
OR a.customer_zip_code <> b.customer_zip_code
OR a.customer_state <> b.customer_state
OR a.shipping_address <> b.shipping_address
OR a.shipping_city <> b.shipping_city
OR a.shipping_zip_code <> b.shipping_zip_code
OR a.shipping_state <> b.shipping_state
OR a.shipping_address IS NULL
OR a.shipping_city IS NULL
OR a.shipping_zip_code IS NULL
OR a.shipping_state IS NULL)
AND expiry_date = '2200-01-01';
/* 加新行 */
INSERT INTO customer_dim
SELECT
NULL
, b.customer_number
, b.customer_name
, b.customer_street_address
, b.customer_zip_code
, b.customer_city
, b.customer_state
, b.shipping_address
, b.shipping_zip_code
, b.shipping_city
, b.shipping_state
, a.version 1
, @pre_date
, '2200-01-01'
FROM
customer_dim a
, customer_stg b
WHERE
a.customer_number = b.customer_number
AND ( a.customer_street_address <> b.customer_street_address
OR a.customer_city <> b.customer_city
OR a.customer_zip_code <> b.customer_zip_code
OR a.customer_state <> b.customer_state
OR a.shipping_address <> b.shipping_address
OR a.shipping_city <> b.shipping_city
OR a.shipping_zip_code <> b.shipping_zip_code
OR a.shipping_state <> b.shipping_state
OR a.shipping_address IS NULL
OR a.shipping_city IS NULL
OR a.shipping_zip_code IS NULL
OR a.shipping_state IS NULL)
AND EXISTS(
SELECT *
FROM customer_dim x
WHERE
b.customer_number=x.customer_number
AND a.expiry_date = @pre_date )
AND NOT EXISTS (
SELECT *
FROM customer_dim y
WHERE
b.customer_number = y.customer_number
AND y.expiry_date = '2200-01-01') ;
/* 在 customer_name 列上 SCD1 */
UPDATE customer_dim a, customer_stg b
SET a.customer_name = b.customer_name
WHERE a.customer_number = b.customer_number
AND a.customer_name <> b.customer_name ;
/* 新增的客户 */
INSERT INTO customer_dim
SELECT
NULL
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, 1
, @pre_date
,'2200-01-01'
FROM customer_stg
WHERE customer_number NOT IN(
SELECT y.customer_number
FROM customer_dim x, customer_stg y
WHERE x.customer_number = y.customer_number) ;
/* 重建PA客户维度 */
TRUNCATE pa_customer_dim;
INSERT INTO pa_customer_dim
SELECT
customer_sk
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, version
, effective_date
, expiry_date
FROM customer_dim
WHERE customer_state = 'PA' ;
/* 装载产品维度 */
TRUNCATE TABLE product_stg ;
INSERT INTO product_stg
SELECT
product_code
, product_name
, product_category
FROM source.product ;
/* 在 product_name 和 product_category 列上 SCD2 */
/* 置过期 */
UPDATE
product_dim a
, product_stg b
SET
expiry_date = @pre_date
WHERE
a.product_code = b.product_code
AND ( a.product_name <> b.product_name
OR a.product_category <> b.product_category)
AND expiry_date = '2200-01-01';
/* 加新行 */
INSERT INTO product_dim
SELECT
NULL
, b.product_code
, b.product_name
, b.product_category
, a.version 1
, @pre_date
,'2200-01-01'
FROM
product_dim a
, product_stg b
WHERE
a.product_code = b.product_code
AND ( a.product_name <> b.product_name
OR a.product_category <> b.product_category)
AND EXISTS(
SELECT *
FROM product_dim x
WHERE b.product_code = x.product_code
AND a.expiry_date = @pre_date)
AND NOT EXISTS (
SELECT *
FROM product_dim y
WHERE b.product_code = y.product_code
AND y.expiry_date = '2200-01-01') ;
/* 新增的产品 */
INSERT INTO product_dim
SELECT
NULL
, product_code
, product_name
, product_category
, 1
, @pre_date
, '2200-01-01'
FROM product_stg
WHERE product_code NOT IN(
SELECT y.product_code
FROM product_dim x, product_stg y
WHERE x.product_code = y.product_code) ;
-- 装载订单维度,新增前一天的订单号
INSERT INTO order_dim (
order_number
, effective_date
, expiry_date)
SELECT
order_number
, status_date
, '2200-01-01'
FROM source.sales_order, cdc_time
WHERE order_status = 'N' AND
entry_date >= last_load AND entry_date < current_load ;
-- 装载事实表,新增前一天的订单
INSERT INTO sales_order_fact
SELECT
order_sk
, customer_sk
, product_sk
, e.order_date_sk
, NULL
, NULL
, NULL
, NULL
, f.request_delivery_date_sk
, order_amount
, quantity
, NULL
, NULL
, NULL
, NULL
FROM
source.sales_order a
, order_dim b
, customer_dim c
, product_dim d
, order_date_dim e
, request_delivery_date_dim f
, cdc_time g
WHERE
a.order_status = 'N'
AND a.order_number = b.order_number
AND a.customer_number = c.customer_number
AND a.status_date >= c.effective_date
AND a.status_date < c.expiry_date
AND a.product_code = d.product_code
AND a.status_date >= d.effective_date
AND a.status_date < d.expiry_date
AND a.status_date = e.order_date
AND a.request_delivery_date = f.request_delivery_date
AND a.entry_date >= g.last_load AND a.entry_date < g.current_load ;
/* UPDATING the new sales order to Allocated status */
UPDATE sales_order_fact a,
source.sales_order b,
allocate_date_dim c,
order_dim g, cdc_time h
SET
a.allocate_date_sk = c.allocate_date_sk,
a.allocate_quantity = b.quantity
WHERE
order_status = 'A'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = g.order_number
AND a.order_sk = g.order_sk
AND c.allocate_date = b.status_date ;
/* UPDATING the allocated order to Packed status */
UPDATE sales_order_fact a,
source.sales_order b,
packing_date_dim d,
order_dim g, cdc_time h
SET
a.packing_date_sk = d.packing_date_sk,
a.packing_quantity = b.quantity
WHERE
order_status = 'P'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = g.order_number
AND a.order_sk = g.order_sk
AND d.packing_date = b.status_date ;
/* UPDATING the packed order to Shipped status */
UPDATE sales_order_fact a,
source.sales_order b,
ship_date_dim e,
order_dim g, cdc_time h
SET
a.ship_date_sk = e.ship_date_sk,
a.ship_quantity = b.quantity
WHERE
order_status = 'S'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = g.order_number
AND a.order_sk = g.order_sk
AND e.ship_date = b.status_date ;
/* UPDATING the shipped order to Received status */
UPDATE sales_order_fact a,
source.sales_order b,
receive_date_dim f,
order_dim g, cdc_time h
SET
a.receive_date_sk = f.receive_date_sk,
a.receive_quantity = b.quantity
WHERE
order_status = 'R'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = g.order_number
AND a.order_sk = g.order_sk
AND f.receive_date = b.status_date ;
-- 更新时间戳表的last_load字段
UPDATE cdc_time SET last_load = current_load ;
COMMIT ;
清单(五)- 5-4
图(五)- 5-7
图(五)- 5-8
图(五)- 5-9
图(五)- 5-10
图(五)- 5-11
图(五)- 5-12
图(五)- 5-13
图(五)- 5-14
图(五)- 5-15
图(五)- 5-16
图(五)- 5-17
图(五)- 5-18
图(五)- 5-19
图(五)- 5-20
图(五)- 5-21
图(五)- 5-22
图(五)- 5-23
图(五)- 5-24
测试步骤 在执行清单(五)- 5-3里的定期装载脚本或相应的Kettle作业之前,需要准备一些数据。本示例用六步来跟踪两个销售订单的生命周期: 1. 使用下面的命令新增两个销售订单 USE source; INSERT INTO sales_order VALUES (50, 1, 1, '2015-03-05', 'N', '2015-03-10', '2015-03-05', 7500, 75) , (51, 2, 2, '2015-03-05', 'N', '2015-03-10', '2015-03-05', 1000, 10) ; COMMIT ; 2. 设置系统日期为2015年3月6日,执行定期装载脚本或Kettle作业并确认数据正确装载。可以使用下面的语句查询sales_order_fact表里的两个销售订单,确认定期装载成功。 mysql> select -> order_number, -> a.order_date_sk, -> allocate_date_sk, -> packing_date_sk, -> ship_date_sk, -> receive_date_sk -> from -> sales_order_fact a, -> order_dim b, -> order_date_dim c -> where -> order_number IN (50 , 51) -> and a.order_sk = b.order_sk -> and a.order_date_sk = c.order_date_sk; -------------- --------------- ------------------ ----------------- -------------- ----------------- | order_number | order_date_sk | allocate_date_sk | packing_date_sk | ship_date_sk | receive_date_sk | -------------- --------------- ------------------ ----------------- -------------- ----------------- | 50 | 5543 | NULL | NULL | NULL | NULL | | 51 | 5543 | NULL | NULL | NULL | NULL | -------------- --------------- ------------------ ----------------- -------------- ----------------- 2 rows in set (0.00 sec) 注意 只有order_date_sk列有值,其它日期都是NULL,因为这两个订单是新增的,并且还没有分配库房、出库、配送或收货。 3. 使用下面的命令添加销售订单作为这两个订单的分配库房和/或出库的里程碑 USE source; INSERT INTO sales_order VALUES (50, 1, 1, '2015-03-06', 'A', '2015-03-10', '2015-03-06', 7500, 75) , (50, 1, 1, '2015-03-06', 'P', '2015-03-10', '2015-03-06', 7500, 75) , (51, 2, 2, '2015-03-06', 'A', '2015-03-10', '2015-03-06', 1000, 10) ; COMMIT ; 4. 设置系统日期为2015年3月7日,执行定期装载脚本或Kettle作业并确认数据正确装载。使用下面的SQL语句查询sales_order_fact表里的两个销售订单,确认正确装载了数据。 mysql> select -> order_number, -> a.order_date_sk, -> allocate_date_sk, -> packing_date_sk, -> ship_date_sk, -> receive_date_sk -> from -> sales_order_fact a, -> order_dim b, -> order_date_dim c -> where -> order_number IN (50 , 51) -> and a.order_sk = b.order_sk -> and a.order_date_sk = c.order_date_sk; -------------- --------------- ------------------ ----------------- -------------- ----------------- | order_number | order_date_sk | allocate_date_sk | packing_date_sk | ship_date_sk | receive_date_sk | -------------- --------------- ------------------ ----------------- -------------- ----------------- | 50 | 5543 | 5544 | 5544 | NULL | NULL | | 51 | 5543 | 5544 | NULL | NULL | NULL | -------------- --------------- ------------------ ----------------- -------------- ----------------- 2 rows in set (0.00 sec) 注意 第一个订单具有了allocate_date_sk和packing_date_sk,第二个只具有allocate_date_sk。 5. 使用下面的命令添加销售订单作为这两个订单后面的里程碑:出库、配送和/或收货。注意四个日期可能相同。 USE source; INSERT INTO sales_order VALUES (50, 1, 1, '2015-03-07', 'S', '2015-03-10', '2015-03-07', 7500, 75) , (50, 1, 1, '2015-03-07', 'R', '2015-03-10', '2015-03-07', 7500, 75) , (51, 2, 2, '2015-03-07', 'P', '2015-03-10', '2015-03-07', 1000, 10) ; COMMIT ; 6. 设置系统日期为2015年3月8日,执行定期装载脚本或Kettle作业并确认数据正确装载。使用下面的SQL语句查询sales_order_fact表里的两个销售订单,确认正确装载了数据。 mysql> select -> order_number, -> a.order_date_sk, -> allocate_date_sk, -> packing_date_sk, -> ship_date_sk, -> receive_date_sk -> from -> sales_order_fact a, -> order_dim b, -> order_date_dim c -> where -> order_number IN (50 , 51) -> and a.order_sk = b.order_sk -> and a.order_date_sk = c.order_date_sk; -------------- --------------- ------------------ ----------------- -------------- ----------------- | order_number | order_date_sk | allocate_date_sk | packing_date_sk | ship_date_sk | receive_date_sk | -------------- --------------- ------------------ ----------------- -------------- ----------------- | 50 | 5543 | 5544 | 5544 | 5545 | 5545 | | 51 | 5543 | 5544 | 5545 | NULL | NULL | -------------- --------------- ------------------ ----------------- -------------- ----------------- 2 rows in set (0.00 sec) 注意 第一个订单号为50的订单,具有了全部日期代理键,这意味着订单已完成(客户已经收货)。第二个订单已经出库,但是还没有配送。