引言
想快速进入人工智能领域的Java程序员?你准备好了吗?
构建智能问答系统之环境篇
为了确保篇幅合适,使得内容更易理解和学习,我将开发前的技术讲解和准备工作分成两篇文章来详细介绍。这样做的目的是为了帮助你更快地掌握如何利用向量数据库建立起自己的知识宝库。
前面我们已经做好了必要的准备工作,包括对相关知识点的了解以及环境的安装。今天我们将重点关注代码方面的内容。如果你已经具备了Java编程基础,那么理解Python语法应该不会成为问题,毕竟只是语法的差异而已。随着时间的推移,你自然会逐渐熟悉和掌握这门语言。现在让我们开始吧!
环境安装命令
在使用之前,我们需要先进行一些必要的准备工作,其中包括执行一些命令。如果你已经仔细阅读了Milvus的官方文档,你应该已经了解到了这一点。下面是需要执行的一些命令示例:
代码语言:shell复制pip3 install langchain
pip3 install openai
pip3 install protobuf==3.20.0
pip3 install grpcio-tools
python3 -m pip install pymilvus==2.3.2
python3 -c "from pymilvus import Collection"
快速入门
现在,我们来尝试使用官方示例,看看在没有集成LangChain的情况下,我们需要编写多少代码才能完成插入、查询等操作。官方示例已经在前面的注释中详细讲解了所有的流程。总体流程如下:
- 连接到数据库
- 创建集合(这里还有分区的概念,我们不深入讨论)
- 插入向量数据(我看官方文档就简单插入了一些数字...)
- 创建索引(根据官方文档的说法,通常在一定数据量下是不会经常创建索引的)
- 查询数据
- 删除数据
- 断开与数据库的连接
通过以上步骤,你会发现与连接MySQL数据库的操作非常相似。
代码语言:python代码运行次数:0复制# hello_milvus.py demonstrates the basic operations of PyMilvus, a Python SDK of Milvus.
# 1. connect to Milvus
# 2. create collection
# 3. insert data
# 4. create index
# 5. search, query, and hybrid search on entities
# 6. delete entities by PK
# 7. drop collection
import time
import numpy as np
from pymilvus import (
connections,
utility,
FieldSchema, CollectionSchema, DataType,
Collection,
)
fmt = "n=== {:30} ===n"
search_latency_fmt = "search latency = {:.4f}s"
num_entities, dim = 3000, 8
#################################################################################
# 1. connect to Milvus
# Add a new connection alias `default` for Milvus server in `localhost:19530`
# Actually the "default" alias is a buildin in PyMilvus.
# If the address of Milvus is the same as `localhost:19530`, you can omit all
# parameters and call the method as: `connections.connect()`.
#
# Note: the `using` parameter of the following methods is default to "default".
print(fmt.format("start connecting to Milvus"))
connections.connect("default", host="localhost", port="19530")
has = utility.has_collection("hello_milvus")
print(f"Does collection hello_milvus exist in Milvus: {has}")
#################################################################################
# 2. create collection
# We're going to create a collection with 3 fields.
# - ------------ ------------ ------------------ ------------------------------
# | | field name | field type | other attributes | field description |
# - ------------ ------------ ------------------ ------------------------------
# |1| "pk" | VarChar | is_primary=True | "primary field" |
# | | | | auto_id=False | |
# - ------------ ------------ ------------------ ------------------------------
# |2| "random" | Double | | "a double field" |
# - ------------ ------------ ------------------ ------------------------------
# |3|"embeddings"| FloatVector| dim=8 | "float vector with dim 8" |
# - ------------ ------------ ------------------ ------------------------------
fields = [
FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
FieldSchema(name="random", dtype=DataType.DOUBLE),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim)
]
schema = CollectionSchema(fields, "hello_milvus is the simplest demo to introduce the APIs")
print(fmt.format("Create collection `hello_milvus`"))
hello_milvus = Collection("hello_milvus", schema, consistency_level="Strong")
################################################################################
# 3. insert data
# We are going to insert 3000 rows of data into `hello_milvus`
# Data to be inserted must be organized in fields.
#
# The insert() method returns:
# - either automatically generated primary keys by Milvus if auto_id=True in the schema;
# - or the existing primary key field from the entities if auto_id=False in the schema.
print(fmt.format("Start inserting entities"))
rng = np.random.default_rng(seed=19530)
entities = [
# provide the pk field because `auto_id` is set to False
[str(i) for i in range(num_entities)],
rng.random(num_entities).tolist(), # field random, only supports list
rng.random((num_entities, dim)), # field embeddings, supports numpy.ndarray and list
]
insert_result = hello_milvus.insert(entities)
hello_milvus.flush()
print(f"Number of entities in Milvus: {hello_milvus.num_entities}") # check the num_entities
################################################################################
# 4. create index
# We are going to create an IVF_FLAT index for hello_milvus collection.
# create_index() can only be applied to `FloatVector` and `BinaryVector` fields.
print(fmt.format("Start Creating index IVF_FLAT"))
index = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128},
}
hello_milvus.create_index("embeddings", index)
################################################################################
# 5. search, query, and hybrid search
# After data were inserted into Milvus and indexed, you can perform:
# - search based on vector similarity
# - query based on scalar filtering(boolean, int, etc.)
# - hybrid search based on vector similarity and scalar filtering.
#
# Before conducting a search or a query, you need to load the data in `hello_milvus` into memory.
print(fmt.format("Start loading"))
hello_milvus.load()
# -----------------------------------------------------------------------------
# search based on vector similarity
print(fmt.format("Start searching based on vector similarity"))
vectors_to_search = entities[-1][-2:]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 10},
}
start_time = time.time()
result = hello_milvus.search(vectors_to_search, "embeddings", search_params, limit=3, output_fields=["random"])
end_time = time.time()
for hits in result:
for hit in hits:
print(f"hit: {hit}, random field: {hit.entity.get('random')}")
print(search_latency_fmt.format(end_time - start_time))
# -----------------------------------------------------------------------------
# query based on scalar filtering(boolean, int, etc.)
print(fmt.format("Start querying with `random > 0.5`"))
start_time = time.time()
result = hello_milvus.query(expr="random > 0.5", output_fields=["random", "embeddings"])
end_time = time.time()
print(f"query result:n-{result[0]}")
print(search_latency_fmt.format(end_time - start_time))
# -----------------------------------------------------------------------------
# pagination
r1 = hello_milvus.query(expr="random > 0.5", limit=4, output_fields=["random"])
r2 = hello_milvus.query(expr="random > 0.5", offset=1, limit=3, output_fields=["random"])
print(f"query pagination(limit=4):nt{r1}")
print(f"query pagination(offset=1, limit=3):nt{r2}")
# -----------------------------------------------------------------------------
# hybrid search
print(fmt.format("Start hybrid searching with `random > 0.5`"))
start_time = time.time()
result = hello_milvus.search(vectors_to_search, "embeddings", search_params, limit=3, expr="random > 0.5",
output_fields=["random"])
end_time = time.time()
for hits in result:
for hit in hits:
print(f"hit: {hit}, random field: {hit.entity.get('random')}")
print(search_latency_fmt.format(end_time - start_time))
###############################################################################
# 6. delete entities by PK
# You can delete entities by their PK values using boolean expressions.
ids = insert_result.primary_keys
expr = f'pk in ["{ids[0]}" , "{ids[1]}"]'
print(fmt.format(f"Start deleting with expr `{expr}`"))
result = hello_milvus.query(expr=expr, output_fields=["random", "embeddings"])
print(f"query before delete by expr=`{expr}` -> result: n-{result[0]}n-{result[1]}n")
hello_milvus.delete(expr)
result = hello_milvus.query(expr=expr, output_fields=["random", "embeddings"])
print(f"query after delete by expr=`{expr}` -> result: {result}n")
###############################################################################
# 7. drop collection
# Finally, drop the hello_milvus collection
print(fmt.format("Drop collection `hello_milvus`"))
utility.drop_collection("hello_milvus")
升级版
现在,让我们来看一下使用LangChain版本的代码。由于我们使用的是封装好的Milvus,所以我们需要一个嵌入模型。在这里,我们选择了HuggingFaceEmbeddings中的sensenova/piccolo-base-zh
模型作为示例,当然你也可以选择其他模型,这里没有限制。只要能将其作为一个变量传递给LangChain定义的函数调用即可。
下面是一个简单的示例,包括数据库连接、插入数据、查询以及得分情况的定义:
代码语言:python代码运行次数:0复制from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Milvus
model_name = "sensenova/piccolo-base-zh"
embeddings = HuggingFaceEmbeddings(model_name=model_name)
print("链接数据库")
vector_db = Milvus(
embeddings,
connection_args={"host": "localhost", "port": "19530"},
collection_name="hello_milvus",
)
print("简单传入几个值")
vector_db.add_texts(["12345678","789","努力的小雨是一个知名博主,其名下有公众号【灵墨AI探索室】,博客:稀土掘金、博客园、51CTO及腾讯云等","你好啊","我不好"])
print("查询前3个最相似的结果")
docs = vector_db.similarity_search_with_score("你好啊",3)
print("查看其得分情况,分值越低越接近")
for text in docs:
print('文本:%s,得分:%s'%(text[0].page_content,text[1]))
注意,以上代码只是一个简单示例,具体的实现可能会根据你的具体需求进行调整和优化。
在langchain版本的代码中,如果你想要执行除了自己需要开启docker中的milvus容器之外的操作,还需要确保你拥有网络代理。这里不多赘述,因为HuggingFace社区并不在国内。
个人定制版
接下来,我们将详细了解如何调用openai模型来回答问题!
代码语言:python代码运行次数:0复制from dotenv import load_dotenv
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate;
from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain.chat_models.openai import ChatOpenAI
from langchain.schema import BaseOutputParser
# 加载env环境变量里的key值
load_dotenv()
# 格式化输出
class CommaSeparatedListOutputParser(BaseOutputParser):
"""Parse the output of an LLM call to a comma-separated list."""
def parse(self, text: str):
"""Parse the output of an LLM call."""
return text.strip().split(", ")
# 先从数据库查询问题解
docs = vector_db.similarity_search("努力的小雨是谁?")
doc = docs[0].page_content
chat = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
template = "请根据我提供的资料回答问题,资料: {input_docs}"
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_template = "{text}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])
# chat_prompt.format_messages(input_docs=doc, text="努力的小雨是谁?")
chain = LLMChain(
llm=chat,
prompt=chat_prompt,
output_parser=CommaSeparatedListOutputParser()
)
chain.run(input_docs=doc, text="努力的小雨是谁?")
当你成功运行完代码后,你将会得到你所期望的答案。如下图所示,这些答案将会展示在你的屏幕上。不然,如果系统不知道这些问题的答案,那它又如何能够给出正确的回答呢?
总结
通过本系列文章的学习,我们已经对个人或企业知识库有了一定的了解。尽管OpenAI已经提供了私有知识库的部署选项,但是其高昂的成本对于一些企业来说可能是难以承受的。无论将来国内企业是否会提供个人或企业知识库的解决方案,我们都需要对其原理有一些了解。无论我们的预算多少,都可以找到适合自己的玩法,因为不同预算的玩法也会有所不同。
如果你认为本篇文章对于你来说可能有些复杂,很难上手Milvus向量数据库,那我建议你可以考虑查看腾讯云官方的向量数据库。腾讯云的向量数据库简单、快捷、方便,并且其文档也是国内的,这将为你提供更好的学习和使用体验。
从零到一快速理解腾讯云向量数据库
如果觉得可以继续深入学习,腾讯云向量数据库,我还写了一些其他方面的应用供你参考。
探索向量数据库之图像相似搜索-文字版
这两篇文章是我今年专门为了介绍向量数据库而编写的,主要目的是让自己对这个新兴技术有更深入的了解。之前我对向量数据库一无所知,所以我希望本篇文章能帮助到你,快速学习和掌握向量数据库的知识。希望这些内容能够对你有所帮助!
我正在参与2024腾讯技术创作特训营第五期有奖征文,快来和我瓜分大奖!