图数据分块ETL
图数据ETL的一个场景是需要将上亿条上百G的原始数据构建为图数据,在内存不够用的情况下保证数据构建过程可以平稳顺利运行,需要使用数据分块的方式进行构建。如下通过存储过程实现数据分块方案。该解决方案依赖于原始数据库的自增ID【上百G超大CSV文件的构建可以导入MySQL之后构建】,经过测试可以在生产环境正常运行并且避免过多的内存消耗。
函数与过程功能介绍
- 从关系数据库加载数据
apoc.load.jdbc
- 函数实现数据块ID拆分
olab.ids.batch
- 迭代处理数据块
apoc.periodic.iterate
- 对包含特殊字符的变量进行转义操作
olab.escape
- 数据分块-从数据库获取最大最小自增ID
WITH 'jdbc:mysql://datalab-contentdb-dev.crkldnwly6ki.rds.cn-north-1.amazonaws.com.cn:3306/database?user=dev&password=datalabgogo&useUnicode=true&characterEncoding=utf8&serverTimezone=UTC' AS url,'SELECT MIN(puid) AS min,MAX(puid) AS max FROM table' AS sql,10000000 AS batch
CALL apoc.load.jdbc(url,sql) YIELD row WITH row.min AS min,row.max AS max,batch
WITH olab.ids.batch(min,max,batch) AS value
UNWIND value AS list
RETURN list[0] AS min,list[1] AS max
过程与函数插件ongdb-lab-apoc
代码语言:javascript复制https://github.com/ongdb-contrib/ongdb-lab-apoc
过程与函数插件neo4j-apoc-procedures
代码语言:javascript复制https://github.com/neo4j-contrib/neo4j-apoc-procedures
组件版本信息
代码语言:javascript复制ongdb-3.5.x
apoc-3.5.x
olab-apoc-3.5.x
完整实现案例
代码语言:javascript复制WITH 'jdbc:mysql://datalab-contentdb-dev.crkldnwly6ki.rds.cn-north-1.amazonaws.com.cn:3306/database?user=dev&password=datalabgogo&useUnicode=true&characterEncoding=utf8&serverTimezone=UTC' AS url,'SELECT MIN(puid) AS min,MAX(puid) AS max FROM table' AS sql,1000000 AS batch,'SELECT parent_pcode AS `name`,CONVERT(DATE_FORMAT(hupdatetime,'%Y%m%d%H%i%S'),UNSIGNED INTEGER) AS hupdatetime FROM table WHERE hisvalid=1 AND parent_pcode IS NOT NULL AND puid>=? AND puid<=?' AS loadSql
CALL apoc.load.jdbc(url,sql) YIELD row WITH row.min AS min,row.max AS max,url,batch,loadSql
WITH olab.ids.batch(min,max,batch) AS value,url,batch,loadSql
UNWIND value AS list
WITH list[0] AS min,list[1] AS max,url,loadSql,'CALL apoc.load.jdbc({url},{loadSql},[{min},{max}])' AS jdbc
CALL apoc.periodic.iterate(olab.replace(jdbc,[{raw:'{url}',rep:''' url '''},{raw:'{loadSql}',rep:''' olab.escape(loadSql) '''},{raw:'{min}',rep:min},{raw:'{max}',rep:max}]),'MERGE (n:PREPCODE {name:row.name}) SET n =row',{parallel:false,batchSize:1000,iterateList: false}) YIELD batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations RETURN batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations;