最近整理代码,发现这个处理代码,拿出来分享一下,当时还费了好久的功夫。
代码语言:txt复制# 导入需要的库
import pandas as pd
import csv
# 加载要处理的数据
h_r_t_name = ["index","name", "rel", "name2"]
reader = pd.read_csv("data.txt", iterator=True,error_bad_lines=False,na_values='NULL',names=h_r_t_name)
# 每次读入100w跳数据
loop = True
chunkSize = 1000000
chunks = []
chunks_ = []
# 设置要写出的csv文件格式及名称
csvf_entity = open("entity.csv", "w", newline='', encoding='utf-8')
w_entity = csv.writer(csvf_entity)
w_entity.writerow(("entity:ID", "name",":LABEL"))
csvf_rel = open("rel.csv", "w", newline='', encoding='utf-8')
w_rel = csv.writer(csvf_rel)
w_rel.writerow((':START_ID','name',':END_ID',':TYPE'))
i = 0
j = 0
k = 0
while loop:
try:
print(k)
k = k 1
chunk = reader.get_chunk(chunkSize)
no_na = chunk.dropna()
df = no_na.groupby('name')
for group_name, group_data in df:
w_entity.writerow(("e" str(i), group_name,"ENTITY"))
j = i
i = i 1
for index,row in group_data.iterrows():
w_rel.writerow(("e" str(j), row['rel'],"e" str(i),"REL"))
w_entity.writerow(("e" str(i), row['name2'],"ENTITY1"))
i = i 1
except StopIteration:
loop = False
print("Iteration is stopped.")
csvf_entity.close()
csvf_rel.close()
结果百度的时候发现还有人复制我的。。。这是我的原帖
他的第三节字和图都复制我的帖子。
我无所谓,别人复制也是对我的认可。我觉得他这个帖子就是好几个人的文章合在一起。