(五)进阶技术 13. 迟到的事实 装载日期在生效日期后的事实就是迟到的事实。晚于订单日期进入源数据的销售订单可以看做是一个迟到事实的例子。销售订单被装载进其事实表时,装载的日期晚于销售订单的订单日期,因此是一个迟到的事实。(因为定期装载的是前一天的数据,所以这里的晚于指的是晚2天及其以上。) 迟到事实影响周期快照事实表的装载,如(五)进阶技术5. “快照”中讨论的month_end_sales_order_fact表。比方说,2015年3月的销售订单金额月底快照已经计算并存储在month_end_sales_order_fact表中,这时一个迟到的订单在3月10日被装载,那么2015年3月的快照金额必须因迟到事实而重新计算。 处理迟到事实 本节说明当导入month_end_sales_order_fact表时如何处理迟到的销售订单。 为了知道一个销售订单是否是迟到的,需要把销售订单数据源的登记日期装载进sales_order_fact表。由于现在还没有登记日期列,你需要在事实表上添加此列。使用维度角色扮演技术添加登记日期。因此,在销售订单事实表里添加名为entry_date_sk的日期代理键列,并且从日期维度表创建一个叫做entry_date_dim的数据库视图。清单(五)-13-1里的脚本创建entry_date_dim视图和销售订单事实表里的entry_date_sk代理键列。
代码语言:javascript复制USE dw;
CREATE VIEW entry_date_dim (entry_date_sk , entry_date , month_name , month , quarter , year , promo_ind , effective_date , expiry_date) AS
SELECT
date_sk,
date,
month_name,
month,
quarter,
year,
promo_ind,
effective_date,
expiry_date
FROM
date_dim;
ALTER TABLE sales_order_fact ADD entry_date_sk INT AFTER receive_date_sk;
ALTER TABLE sales_order_fact ADD CONSTRAINT FOREIGN KEY (entry_date_sk) REFERENCES date_dim(date_sk);
清单(五)-13-1
创建完entry_date_dim视图,并给sales_order_fact表添加了entry_date_sk列以后,需要修改数据仓库定期装载脚本来包含登记日期。清单(五)-13-2显示了修改后的定期装载脚本。注意sales_order数据源已经含有登记日期,只是以前没有将其装载进数据仓库。
代码语言:javascript复制USE dw;
-- 设置SCD的截止时间和生效时间
SET @pre_date = SUBDATE(CURRENT_DATE,1) ;
-- 设置CDC的上限时间
UPDATE cdc_time SET current_load = CURRENT_DATE ;
-- 装载客户维度
TRUNCATE TABLE customer_stg;
INSERT INTO customer_stg
SELECT
customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
FROM source.customer ;
/* 在所有地址列上 SCD2 */
/* 置过期 */
UPDATE customer_dim a,
customer_stg b
SET
expiry_date = @pre_date
WHERE
a.customer_number = b.customer_number
AND (a.customer_street_address <> b.customer_street_address
OR a.customer_city <> b.customer_city
OR a.customer_zip_code <> b.customer_zip_code
OR a.customer_state <> b.customer_state
OR a.shipping_address <> b.shipping_address
OR a.shipping_city <> b.shipping_city
OR a.shipping_zip_code <> b.shipping_zip_code
OR a.shipping_state <> b.shipping_state
OR a.shipping_address IS NULL
OR a.shipping_city IS NULL
OR a.shipping_zip_code IS NULL
OR a.shipping_state IS NULL)
AND expiry_date = '2200-01-01';
/* 加新行 */
INSERT INTO customer_dim
SELECT
NULL
, b.customer_number
, b.customer_name
, b.customer_street_address
, b.customer_zip_code
, b.customer_city
, b.customer_state
, b.shipping_address
, b.shipping_zip_code
, b.shipping_city
, b.shipping_state
, a.version 1
, @pre_date
, '2200-01-01'
FROM
customer_dim a
, customer_stg b
WHERE
a.customer_number = b.customer_number
AND ( a.customer_street_address <> b.customer_street_address
OR a.customer_city <> b.customer_city
OR a.customer_zip_code <> b.customer_zip_code
OR a.customer_state <> b.customer_state
OR a.shipping_address <> b.shipping_address
OR a.shipping_city <> b.shipping_city
OR a.shipping_zip_code <> b.shipping_zip_code
OR a.shipping_state <> b.shipping_state
OR a.shipping_address IS NULL
OR a.shipping_city IS NULL
OR a.shipping_zip_code IS NULL
OR a.shipping_state IS NULL)
AND EXISTS(
SELECT *
FROM customer_dim x
WHERE
b.customer_number=x.customer_number
AND a.expiry_date = @pre_date )
AND NOT EXISTS (
SELECT *
FROM customer_dim y
WHERE
b.customer_number = y.customer_number
AND y.expiry_date = '2200-01-01') ;
/* 在 customer_name 列上 SCD1 */
UPDATE customer_dim a, customer_stg b
SET a.customer_name = b.customer_name
WHERE a.customer_number = b.customer_number
AND a.customer_name <> b.customer_name ;
/* 新增的客户 */
INSERT INTO customer_dim
SELECT
NULL
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, 1
, @pre_date
,'2200-01-01'
FROM customer_stg
WHERE customer_number NOT IN(
SELECT y.customer_number
FROM customer_dim x, customer_stg y
WHERE x.customer_number = y.customer_number) ;
/* 重建PA客户维度 */
TRUNCATE pa_customer_dim;
INSERT INTO pa_customer_dim
SELECT
customer_sk
, customer_number
, customer_name
, customer_street_address
, customer_zip_code
, customer_city
, customer_state
, shipping_address
, shipping_zip_code
, shipping_city
, shipping_state
, version
, effective_date
, expiry_date
FROM customer_dim
WHERE customer_state = 'PA' ;
/* 装载产品维度 */
TRUNCATE TABLE product_stg ;
INSERT INTO product_stg
SELECT
product_code
, product_name
, product_category
FROM source.product ;
/* 在 product_name 和 product_category 列上 SCD2 */
/* 置过期 */
UPDATE
product_dim a
, product_stg b
SET
expiry_date = @pre_date
WHERE
a.product_code = b.product_code
AND ( a.product_name <> b.product_name
OR a.product_category <> b.product_category)
AND expiry_date = '2200-01-01';
/* 加新行 */
INSERT INTO product_dim
SELECT
NULL
, b.product_code
, b.product_name
, b.product_category
, a.version 1
, @pre_date
,'2200-01-01'
FROM
product_dim a
, product_stg b
WHERE
a.product_code = b.product_code
AND ( a.product_name <> b.product_name
OR a.product_category <> b.product_category)
AND EXISTS(
SELECT *
FROM product_dim x
WHERE b.product_code = x.product_code
AND a.expiry_date = @pre_date)
AND NOT EXISTS (
SELECT *
FROM product_dim y
WHERE b.product_code = y.product_code
AND y.expiry_date = '2200-01-01') ;
/* 新增的产品 */
INSERT INTO product_dim
SELECT
NULL
, product_code
, product_name
, product_category
, 1
, @pre_date
, '2200-01-01'
FROM product_stg
WHERE product_code NOT IN(
SELECT y.product_code
FROM product_dim x, product_stg y
WHERE x.product_code = y.product_code) ;
/* PRODUCT_COUNT_FACT POPULATION */
TRUNCATE product_count_fact;
INSERT INTO product_count_fact(product_sk, product_launch_date_sk)
SELECT
a.product_sk
, b.date_sk
FROM
product_dim a
, date_dim b
WHERE
a.effective_date = b.date
GROUP BY product_code;
/* END OF PRODUCT_COUNT_FACT POPULATION */
-- 装载事实表,新增前一天的订单
INSERT INTO sales_order_fact
SELECT
customer_sk
, product_sk
, g.sales_order_attribute_sk
, e.order_date_sk
, NULL
, NULL
, NULL
, NULL
, h.entry_date_sk
, a.order_number
, f.request_delivery_date_sk
, order_amount
, quantity
, NULL
, NULL
, NULL
, NULL
FROM
source.sales_order a
, customer_dim c
, product_dim d
, order_date_dim e
, request_delivery_date_dim f
, sales_order_attribute_dim g
, entry_date_dim h
, cdc_time i
WHERE
a.order_status = 'N'
AND a.customer_number = c.customer_number
AND a.status_date >= c.effective_date
AND a.status_date < c.expiry_date
AND a.product_code = d.product_code
AND a.status_date >= d.effective_date
AND a.status_date < d.expiry_date
AND a.status_date = e.order_date
AND a.entry_date = h.entry_date
AND a.request_delivery_date = f.request_delivery_date
AND a.verification_ind = g.verification_ind
AND a.credit_check_flag = g.credit_check_flag
AND a.new_customer_ind = g.new_customer_ind
AND a.web_order_flag = g.web_order_flag
AND a.entry_date >= i.last_load AND a.entry_date < i.current_load ;
/* UPDATING the new sales order to Allocated status */
UPDATE sales_order_fact a,
source.sales_order b,
allocate_date_dim c,
cdc_time h
SET
a.allocate_date_sk = c.allocate_date_sk,
a.allocate_quantity = b.quantity
WHERE
order_status = 'A'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = a.order_number
AND c.allocate_date = b.status_date ;
/* UPDATING the allocated order to Packed status */
UPDATE sales_order_fact a,
source.sales_order b,
packing_date_dim d,
cdc_time h
SET
a.packing_date_sk = d.packing_date_sk,
a.packing_quantity = b.quantity
WHERE
order_status = 'P'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = a.order_number
AND d.packing_date = b.status_date ;
/* UPDATING the packed order to Shipped status */
UPDATE sales_order_fact a,
source.sales_order b,
ship_date_dim e,
cdc_time h
SET
a.ship_date_sk = e.ship_date_sk,
a.ship_quantity = b.quantity
WHERE
order_status = 'S'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = a.order_number
AND e.ship_date = b.status_date ;
/* UPDATING the shipped order to Received status */
UPDATE sales_order_fact a,
source.sales_order b,
receive_date_dim f,
cdc_time h
SET
a.receive_date_sk = f.receive_date_sk,
a.receive_quantity = b.quantity
WHERE
order_status = 'R'
AND b.entry_date >= h.last_load AND b.entry_date < h.current_load
AND b.order_number = a.order_number
AND f.receive_date = b.status_date ;
-- 更新时间戳表的last_load字段
UPDATE cdc_time SET last_load = current_load ;
COMMIT ;
清单(五)-13-2
图(五)- 13-1、图(五)- 13-3显示了对Kettle定期装载销售订单事实表做的修改:只是在转换中增加了一个“获取登记日期代理键”的步骤,并对装载事实表进行了相应的修改。
图(五)- 13-1
图(五)- 13-2
图(五)- 13-3
最后还要修改装载月底销售订单事实表脚本。清单(五)-13-3里显示了修改后的脚本,它有两部分。第一部分处理没迟到的并且月底订单事实表中不存在销售订单或新增的非迟到的销售订单。第二部分在具有相同产品和月份的现有销售订单行上增加新增的销售金额。
代码语言:javascript复制USE dw;
SET @pre_date = SUBDATE(CURRENT_DATE,1) ;
/* late arrival, amount & quantity already exist for the past months*/
UPDATE month_end_sales_order_fact a,
(SELECT
d.month_sk month_sk,
a.product_sk product_sk,
SUM(order_amount) order_amount,
SUM(order_quantity) order_quantity
FROM
sales_order_fact a, order_date_dim b, entry_date_dim c, month_dim d
WHERE
a.order_date_sk = b.order_date_sk
AND a.entry_date_sk = c.entry_date_sk
AND c.month = MONTH(@pre_date)
AND c.year = YEAR(@pre_date)
AND b.month = d.month
AND b.year = d.year
AND b.order_date <> c.entry_date
GROUP BY d.month_sk , a.product_sk) b
SET
month_order_amount = month_order_amount b.order_amount,
month_order_quantity = month_order_quantity b.order_quantity
where
a.product_sk = b.product_sk
and a.order_month_sk = b.month_sk;
INSERT INTO month_end_sales_order_fact
/* normal (order_date = entry_date) */
SELECT
d.month_sk,
a.product_sk,
SUM(order_amount),
SUM(order_quantity)
FROM
sales_order_fact a,
order_date_dim b,
entry_date_dim c,
month_dim d
WHERE
a.order_date_sk = b.order_date_sk
AND a.entry_date_sk = c.entry_date_sk
AND c.month = MONTH(@pre_date)
AND c.year = YEAR(@pre_date)
AND b.month = d.month
AND b.year = d.year
AND b.order_date = c.entry_date
GROUP BY d.month_sk , a.product_sk
union all
/* late arrival but amount & quantity not exist for the past months*/
SELECT
d.month_sk,
a.product_sk,
SUM(order_amount),
SUM(order_quantity)
FROM
sales_order_fact a,
order_date_dim b,
entry_date_dim c,
month_dim d
WHERE
a.order_date_sk = b.order_date_sk
AND a.entry_date_sk = c.entry_date_sk
AND b.order_date <> c.entry_date
AND c.month = MONTH(@pre_date)
AND c.year = YEAR(@pre_date)
AND b.month = d.month
AND b.year = d.year
AND NOT EXISTS( SELECT
1
FROM
month_end_sales_order_fact p,
month_dim s
WHERE
p.order_month_sk = s.month_sk
AND s.month = d.month
AND s.year = d.year
AND p.product_sk = a.product_sk)
GROUP BY d.month_sk , a.product_sk;
COMMIT ;
清单(五)-13-3
图(五)- 13-4、图(五)- 13-18显示了使用Kettle转换定期装载月底销售订单事实表的相关步骤。
图(五)- 13-4
图(五)- 13-5
图(五)- 13-6
图(五)- 13-7
图(五)- 13-8
图(五)- 13-9
图(五)- 13-10
图(五)- 13-11
图(五)- 13-12
图(五)- 13-13
图(五)- 13-14
图(五)- 13-15
图(五)- 13-16
图(五)- 13-17
图(五)- 13-18
测试 本节说明在执行清单(五)-13-2和清单(五)-13-3里的脚本前必须的准备步骤。 第一步是执行下面的SQL语句装载销售订单的登记日期。此SQL语句把销售订单的entry_date_sk修改为order_date_sk值。这些登记日期是后面测试三月月底快照导入所需要的。 UPDATE sales_order_fact SET entry_date_sk = order_date_sk; 可以在执行定期装载脚本前查询month_end_sales_order_fact表。之后可以对比‘前’(不包含迟到事实)‘后’(包含了迟到事实)的数据,以确认装载的正确性。可以查询month_end_sales_order_fact表,查询语句和结果如下所示。 mysql> select -> year, -> month, -> product_name, -> month_order_amount amt, -> month_order_quantity qty -> from -> month_end_sales_order_fact a, -> month_dim b, -> product_dim c -> where -> a.order_month_sk = b.month_sk -> and a.product_sk = c.product_sk -> order by year , month , product_name; ------ ------- ----------------- --------- ------ | year | month | product_name | amt | qty | ------ ------- ----------------- --------- ------ | 2014 | 1 | LCD Panel | 1000.00 | NULL | | 2014 | 2 | Hard Disk Drive | 1000.00 | NULL | | 2014 | 3 | Floppy Drive | 2000.00 | NULL | | 2014 | 4 | LCD Panel | 2500.00 | NULL | | 2014 | 5 | Hard Disk Drive | 3000.00 | NULL | | 2014 | 6 | Floppy Drive | 3500.00 | NULL | | 2014 | 7 | LCD Panel | 4000.00 | NULL | | 2014 | 8 | Hard Disk Drive | 4500.00 | NULL | | 2014 | 9 | Floppy Drive | 1000.00 | NULL | | 2014 | 10 | LCD Panel | 1000.00 | NULL | | 2015 | 2 | Floppy Drive | 4000.00 | NULL | | 2015 | 2 | LCD Panel | 4000.00 | NULL | ------ ------- ----------------- --------- ------ 12 rows in set (0.00 sec) 为了对比‘前’‘后’日期,查询sales_order_fact表,查询语句和结果如下所示。 mysql> SELECT -> product_name, SUM(order_amount) -> FROM -> sales_order_fact a, -> product_dim b -> WHERE -> order_date_sk > 5538 -> AND a.product_sk = b.product_sk -> GROUP BY a.product_sk -> ORDER BY product_name; ----------------- ------------------- | product_name | SUM(order_amount) | ----------------- ------------------- | Flat Panel | 47000.00 | | Floppy Drive | 25000.00 | | Hard Disk Drive | 46500.00 | | Keyboard | 27000.00 | ----------------- ------------------- 4 rows in set (0.00 sec) 下一步执行清单(五)-13-4里的脚本准备销售订单测试数据。此脚本将三个销售订单装载进销售订单源数据,一个是迟到的在month_end_sales_order_fact中已存在的产品,一个是迟到的在month_end_sales_order_fact中不存在的产品,另一个是非迟到的正常产品。
代码语言:javascript复制USE source;
INSERT INTO sales_order VALUES
/* late arrival */
(62, 6, 2, 'Y', 'Y', 'Y', 'N', '2015-02-25', 'N', '2015-03-30',
'2015-03-26', 1000, 10)
, (63, 6, 1, 'Y', 'Y', 'Y', 'N', '2015-02-26', 'N', '2015-03-30',
'2015-03-26', 1000, 10)
/* normal */
, (64, 12, 5, 'Y', 'N', 'Y', 'N', '2015-03-26', 'N', '2015-03-30',
'2015-03-26', 2000, 20)
;
COMMIT;
清单(五)-13-4
在执行新的月底销售订单定期装载脚本前,必须先把两条新的销售订装载进sales_order_fact表。设置系统日期为2015年3月27日,也就是销售订单测试数据的登记日期的后一天,然后执行清单(五)-13-2里的脚本。 现在已经准备好运行修改后的月底快照装载。设置系统日期为2015年4月1日并且执行清单(五)-13-4里的脚本导入2015年3月的快照。 最后,执行相同的查询获取包含了迟到事实月底销售订单数据,查询语句和结果显示如下。 mysql> select -> year, -> month, -> product_name, -> month_order_amount amt, -> month_order_quantity qty -> from -> month_end_sales_order_fact a, -> month_dim b, -> product_dim c -> where -> a.order_month_sk = b.month_sk -> and a.product_sk = c.product_sk -> order by year , month , product_name; ------ ------- -------------------------- ---------- ------ | year | month | product_name | amt | qty | ------ ------- -------------------------- ---------- ------ | 2014 | 1 | LCD Panel | 1000.00 | NULL | | 2014 | 2 | Hard Disk Drive | 1000.00 | NULL | | 2014 | 3 | Floppy Drive | 2000.00 | NULL | | 2014 | 4 | LCD Panel | 2500.00 | NULL | | 2014 | 5 | Hard Disk Drive | 3000.00 | NULL | | 2014 | 6 | Floppy Drive | 3500.00 | NULL | | 2014 | 7 | LCD Panel | 4000.00 | NULL | | 2014 | 8 | Hard Disk Drive | 4500.00 | NULL | | 2014 | 9 | Floppy Drive | 1000.00 | NULL | | 2014 | 10 | LCD Panel | 1000.00 | NULL | | 2015 | 2 | Floppy Drive | 5000.00 | NULL | | 2015 | 2 | Hard Disk Drive | 1000.00 | 10 | | 2015 | 2 | LCD Panel | 4000.00 | NULL | | 2015 | 3 | Flat Panel | 47000.00 | 275 | | 2015 | 3 | Floppy Drive | 25000.00 | 120 | | 2015 | 3 | Hard Disk Drive | 46500.00 | 420 | | 2015 | 3 | High End Hard Disk Drive | 2000.00 | 20 | | 2015 | 3 | Keyboard | 27000.00 | 90 | ------ ------- -------------------------- ---------- ------ 18 rows in set (0.00 sec) 对比‘前’‘后’查询的结果可以看到:
- 2015年2月Floppy Drive的销售金额已经从4000改正为5000,这是由于迟到的产品销售订单增加了1000的销售金额。
- 2015年2月的Hard Disk Drive(也是迟到的产品)被添加
- 所有三月的销售订单被累加。包括在执行前面的清单(五)-13-4脚本时刚添加的High End Hard Disk Drive销售订单。