存储过程实现上亿级图数据分块ETL

2022-07-04 14:24:49 浏览数 (1)

图数据分块ETL

图数据ETL的一个场景是需要将上亿条上百G的原始数据构建为图数据,在内存不够用的情况下保证数据构建过程可以平稳顺利运行,需要使用数据分块的方式进行构建。如下通过存储过程实现数据分块方案。该解决方案依赖于原始数据库的自增ID【上百G超大CSV文件的构建可以导入MySQL之后构建】,经过测试可以在生产环境正常运行并且避免过多的内存消耗。

函数与过程功能介绍

  • 从关系数据库加载数据
代码语言:javascript复制
apoc.load.jdbc
  • 函数实现数据块ID拆分
代码语言:javascript复制
 olab.ids.batch
  • 迭代处理数据块
代码语言:javascript复制
apoc.periodic.iterate
  • 对包含特殊字符的变量进行转义操作
代码语言:javascript复制
olab.escape
  • 数据分块-从数据库获取最大最小自增ID
代码语言:javascript复制
WITH 'jdbc:mysql://datalab-contentdb-dev.crkldnwly6ki.rds.cn-north-1.amazonaws.com.cn:3306/database?user=dev&password=datalabgogo&useUnicode=true&characterEncoding=utf8&serverTimezone=UTC' AS url,'SELECT MIN(puid) AS min,MAX(puid) AS max FROM table' AS sql,10000000 AS batch
CALL apoc.load.jdbc(url,sql) YIELD row WITH row.min AS min,row.max AS max,batch
WITH olab.ids.batch(min,max,batch) AS value
UNWIND value AS list
RETURN list[0] AS min,list[1] AS max

过程与函数插件ongdb-lab-apoc

代码语言:javascript复制
https://github.com/ongdb-contrib/ongdb-lab-apoc

过程与函数插件neo4j-apoc-procedures

代码语言:javascript复制
https://github.com/neo4j-contrib/neo4j-apoc-procedures

组件版本信息

代码语言:javascript复制
ongdb-3.5.x 
apoc-3.5.x 
olab-apoc-3.5.x

完整实现案例

代码语言:javascript复制
WITH 'jdbc:mysql://datalab-contentdb-dev.crkldnwly6ki.rds.cn-north-1.amazonaws.com.cn:3306/database?user=dev&password=datalabgogo&useUnicode=true&characterEncoding=utf8&serverTimezone=UTC' AS url,'SELECT MIN(puid) AS min,MAX(puid) AS max FROM table' AS sql,1000000 AS batch,'SELECT parent_pcode AS `name`,CONVERT(DATE_FORMAT(hupdatetime,'%Y%m%d%H%i%S'),UNSIGNED INTEGER) AS hupdatetime FROM table WHERE hisvalid=1 AND parent_pcode IS NOT NULL AND puid>=? AND puid<=?' AS loadSql
CALL apoc.load.jdbc(url,sql) YIELD row WITH row.min AS min,row.max AS max,url,batch,loadSql
WITH olab.ids.batch(min,max,batch) AS value,url,batch,loadSql
UNWIND value AS list
WITH list[0] AS min,list[1] AS max,url,loadSql,'CALL apoc.load.jdbc({url},{loadSql},[{min},{max}])' AS jdbc
CALL apoc.periodic.iterate(olab.replace(jdbc,[{raw:'{url}',rep:''' url '''},{raw:'{loadSql}',rep:''' olab.escape(loadSql) '''},{raw:'{min}',rep:min},{raw:'{max}',rep:max}]),'MERGE (n:PREPCODE {name:row.name}) SET n =row',{parallel:false,batchSize:1000,iterateList: false}) YIELD batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations RETURN batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations;

0 人点赞