上一篇就Graph RAG主要作用、生成流程进行了简要描述,如果我们想要在系统层面实现知识图谱的生成,当然仅有理论还是不够的,需要进一步看一下各个步骤具体的做法是怎样的。
代码语言:xml复制在详细展开说明前,补充一个类似的使用大模型生成关键词的使用场景,
如国内头部的o o企业不仅在尝试将离线中等规模的生成式模型直接替换成大模型并做关键词召回,
也探索大模型在线化,同时使用Cot RLHF来思想来优化改造离线生成式召回模型(将在其他页展开详说)。
Graph RAG主要实现方式是依赖不同的prompt给LLM(大模型接口或私有化模型)进行结构化信息提取,包括先将文本进行分段、提取实体、实体间层级关系、同义实体聚类、实体数据过滤等任务,因此对prompt内容设计是有要求的,那这样的prompt究竟该如何去写呢?我们可以花个几分钟先想一想,如实体识别,是不是要说明清楚要识别出实体哪些信息如名称、描述以及实体存储数据格式等,同时为确保所有的视图都能够被完整识别出来,是不是也要重复多次等等,非常有价值的是微软针对每个任务都写出了完整的prompt,非常值得学习,大家完全可以学习与借鉴并进行prompt写法迁移用在其他具体任务上面,我针对Graph RAG每个prompt进行总结,发现prompt基本上是如下结构,逻辑性强,我猜这样严谨的写prompt是可以确保每次LLM输出较稳定的结果,在此拆解出来供大家直接参考并提供未来设计prompt借鉴思路:
prompt 组成部分 | prompt 组成说明 |
---|---|
-Target activity- | 背景说明 |
-Goal- | 非常直接总结本次prompt具体要做的事情 |
-Steps- | 针对要做的事情人工分解出具体步骤 |
-Examples- | 为了确保大模型能够按照要求输出result,通过具体示例从数据到结果给模型学习,以确保按照想要的结果输出 |
-Real Data- | 真实待分析的数据 |
Repeat to ensure we maximize entity count | 为确保prompt得到的答案没有遗漏,将重复多轮 |
Graph RAG 图按照如下步骤生成,图的关键是在Phase2~graph extraction,使用了不同的prompt进行图元素提取,在这里结合如上prompt结构组成部分并拆解说明一下“claim extraction”具体内容。它主要是:作为一个独立的工作流程,从源文本单元中提取claims。这些claims代表具有评估状态和时间界限的积极事实陈述。它们被作为主要产物称为协变量。(在这里,作者将claims抽象为一种行为动作,并且该行为动作发出者为实体A,受影响的实体为B,作者为什么要进行这样的抽象,我猜有些场景需要过滤一些数据,例如将claims实例化为 表达脏话、表达危害、甚至成人内容等)。除此之外还有3部分重要的prompt内容,可按照如上构成部分进行prompt理解:
- Entity/Relationship Extraction(实体提取与层级关系构建)
- Entity/Relationship Description Summarization(使用LLM总结实体与关系)
- Claim Extraction(数据筛选与过滤)
- Community Reports
如下是“claim extraction” prompt具体内容,{xxx} 代表的是变量,xxx可以设置或替换为具体值,主要变量有{tuple_delimiter}、{completion_delimiter}、{entity_specs}、{claim_description}、{input_text},具体组成部分与内容参考如下:
-Target activity-
You are an intelligent assistant that helps a human analyst to analyze claims against certain entities presented in a text document.
-Goal-
Given a text document that is potentially relevant to this activity, an entity specification, and a claim description, extract all entities that match the entity specification and all claims against those entities.
-Steps-
- Extract all named entities that match the predefined entity specification. Entity specification can either be a list of entity names or a list of entity types.(该步骤告诉LLM识别出自定义类型的实体,而自定义类型将list提供)
- For each entity identified in step 1, extract all claims associated with the entity. Claims need to match the specified claim description, and the entity should be the subject of the claim.(提取与步骤1实体相关的claims,同时与claims定义匹配,这些实体是claim的主实体)
For each claim, extract the following information: - Subject: name of the entity that is subject of the claim, capitalized. The subject entity is one that committed the action described in the claim. Subject needs to be one of the named entities identified in step 1.(实体是claim的主实体并大写,主实体产生claim行为并也是来着step1的实体)
- Object: name of the entity that is object of the claim, capitalized. The object entity is one that either reports/handles or is affected by the action described in the claim. If object entity is unknown, use NONE.(实体是claim的对象,例如报告或应对claim行为,或被claim行为影响,如果不存在这样的实体则为none)
- Claim Type: overall category of the claim, capitalized. Name it in a way that can be repeated across multiple text inputs, so that similar claims share the same claim type(claim类别,命名它并在多次的文本输入中重复使用,因此相同的claims共享相同的类型)
- Claim Status: TRUE, FALSE, or SUSPECTED. TRUE means the claim is confirmed, FALSE means the claim is found to be False, SUSPECTED means the claim is not verified.(如果claim真的存在则标记为TRUE,FALSE说明claim不存在)
- Claim Description: Detailed description explaining the reasoning behind the claim, together with all the related evidence and references.(关于claim的具体解释说明并有证据与参考依据)
- Claim Date: Period (start_date, end_date) when the claim was made. Both start_date and end_date should be in ISO-8601 format. If the claim was made on a single date rather than a date range, set the same date for both start_date and end_date. If date is unknown, return NONE.(标记claim的产生的开始与结束日期)
- Claim Source Text: List of all quotes from the original text that are relevant to the claim.(关于claim完整原始文本)
- Format each claim as (<subject_entity>{tuple_delimiter}<object_entity>{tuple_delimiter}<claim_type>{tuple_delimiter}<claim_status>{tuple_delimiter}<claim_start_date>{tuple_delimiter}<claim_end_date>{tuple_delimiter}<claim_description>{tuple_delimiter}<claim_source>)
3.Return output in English as a single list of all the claims identified in steps 1 and 2. Use {record_delimiter} as the list delimiter.
4.When finished, output {completion_delimiter}
-Examples-(claim为“实体A的危险行为”)
Example 1:
- Entity specification: organization
- Claim description: red flags associated with an entity
- Text: According to an article on 2022/01/10, Company A was fined for bid rigging while participating in multiple public tenders published by Government Agency B. The company is owned by Person C who was suspected of engaging in corruption activities in 2015.
- Output: (COMPANY A{tuple_delimiter}GOVERNMENT AGENCY B{tuple_delimiter}ANTI-COMPETITIVE PRACTICES{tuple_delimiter}TRUE{tuple_delimiter}2022-01-10T00:00:00{tuple_delimiter}2022-01-10T00:00:00{tuple_delimiter}Company A was found to engage in anti-competitive practices because it was fined for bid rigging in multiple public tenders published by Government Agency B according to an article published on 2022/01/10{tuple_delimiter}According to an article published on 2022/01/10, Company A was fined for bid rigging while participating in multiple public tenders published by Government Agency B.) {completion_delimiter}
Example 2:
- Entity specification: Company A, Person C
- Claim description: red flags associated with an entity
- Text: According to an article on 2022/01/10, Company A was fined for bid rigging while participating in multiple public tenders published by Government Agency B. The company is owned by Person C who was suspected of engaging in corruption activities in 2015.
- Output: (COMPANY A{tuple_delimiter}GOVERNMENT AGENCY B{tuple_delimiter}ANTI-COMPETITIVE PRACTICES{tuple_delimiter}TRUE{tuple_delimiter}2022-01-10T00:00:00{tuple_delimiter}2022-01-10T00:00:00{tuple_delimiter}Company A was found to engage in anti-competitive practices because it was fined for bid rigging in multiple public tenders published by Government Agency B according to an article published on 2022/01/10{tuple_delimiter}According to an article published on 2022/01/10, Company A was fined for bid rigging while participating in multiple public tenders published by Government Agency B.) {record_delimiter} (PERSON C{tuple_delimiter}NONE{tuple_delimiter}CORRUPTION{tuple_delimiter}SUSPECTED{tuple_delimiter}2015-01-01T00:00:00{tuple_delimiter}2015-12-30T00:00:00{tuple_delimiter}Person C was suspected of engaging in corruption activities in 2015{tuple_delimiter}The company is owned by Person C who was suspected of engaging in corruption activities in 2015) {completion_delimiter}
-Real Data-
- Use the following input for your answer.
- Entity specification: {entity_specs}
- Claim description: {claim_description}
- Text: {input_text}
- Output:
graph rag dataflow