PyMilvus 与 Embedding 模型集成

Milvus 是一款专为 AI 应用设计的开源向量数据库。无论是机器学习、深度学习还是其他与 AI 相关的项目，Milvus 向量数据库都能提供强大高效的大规模非结构化数据处理解决方案。

现在，Milvus 的 Python SDK——PyMilvus 中已集成模型模块，支持直接添加 Embedding 和重排（Reranker）模型，大幅简化了将数据转化为向量以及对搜索结果进行重排的流程，十分适用于检索增强生成（RAG）应用。

我们将通过文本回顾稠密向量 Embedding 模型、稀疏向量 Embedding 模型和 Reranker，并演示如何在轻量版 Milvus——Milvus Lite 中直接运行 Embedding 模型。

01.稠密向量vs稀疏向量

在开始介绍如何直接使用 Milvus 中的 Embedding 和 Reranker 模型前，让我们先来简要回顾下 Embedding 向量的两大主要类别：

稠密向量：大部分维度上的元素都是非零值的高维向量，适合表征文本语义。
稀疏向量：大部分维度上的元素都为零的高维向量，适合精确关键词检索以及域外知识检索。

Milvus 支持这两种类型的 Embedding 向量，并提供混合搜索，允许用户在同一 Collection 中的多个向量字段之间进行搜索。使用不同的 Embedding 模型或采用不同的数据处理方法生成的向量不同，这些向量可以代表数据的不同特征。

02.如何在Milvus 中使用

Embedding 和 Reranker 模型

下面，我们将通过 3 个示例展示如何在 Milvus 中使用集成的Embedding模型来生成向量并进行向量搜索。

示例 1:

使用默认 Embedding function 生成稠密向量

如需使用 embedding 和 rerank function，您需要在安装pymilvus时一同安装model包。

代码语言：javascript复制

pip install 'pymilvus[model]'

这个步骤将安装 Milvus Lite。Milvus Lite 也新增了模型包，支持直接使用 Embedding 和 reranker 模型。

模型包涵盖了多种 Embedding 模型，包括 OpenAI 的模型、Sentence Transformers、BGE-M3、 BM25、SPLADE 和 Jina AI 的预训练模型。

本例中，我们将使用DefaultEmbeddingFunction和all-MiniLM-L6-v2 Sentence Transformer 模型。该模型大约 70 MB，需在首次使用时下载。

代码语言：javascript复制

from pymilvus import model

# This will download "all-MiniLM-L6-v2", a lightweight model.
ef = model.DefaultEmbeddingFunction()

# Data from which embeddings are to be generated
docs = ["Artificial intelligence was founded as an academic discipline in 1956.","Alan Turing was the first person to conduct substantial research in AI.","Born in Maida Vale, London, Turing was raised in southern England.",]
embeddings = ef.encode_documents(docs)

print("Embeddings:", embeddings)
# Print dimension and shape of embeddingsprint("Dim:", ef.dim, embeddings[0].shape)

以下为输出结果：

代码语言：javascript复制

Embeddings: [array([-3.09392996e-02, -1.80662833e-02,  1.34775648e-02,  2.77156215e-02,-4.86349640e-03, -3.12581174e-02, -3.55921760e-02,  5.76934684e-03,2.80773244e-03,  1.35783911e-01,  3.59678417e-02,  6.17732145e-02,
...-4.61330153e-02, -4.85207550e-02,  3.13997865e-02,  7.82178566e-02,-4.75336798e-02,  5.21207601e-02,  9.04406682e-02, -5.36676683e-02],
     dtype=float32)]
Dim: 384 (384,)

示例 2: 使用 BM25 模型生成稀疏向量

BM25 使用单词出现频率来确定查询和文档之间的相关性。本例中，我们将展示如何使用BM25EmbeddingFunction生成查询和文档的稀疏向量。

在使用 BM25 时，一个重要的步骤是计算文档中的统计数据以获得 IDF（逆文档频率）。IDF 是指含有这个单词的文章在语料库中的占比的倒数。

代码语言：javascript复制

from pymilvus.model.sparse import BM25EmbeddingFunction

# 1. Prepare a small corpus to search
docs = [
   "Artificial intelligence was founded as an academic discipline in 1956.",
   "Alan Turing was the first person to conduct substantial research in AI.",
   "Born in Maida Vale, London, Turing was raised in southern England.",
]
query = "Where was Turing born?"
bm25_ef = BM25EmbeddingFunction()

# 2. Fit the corpus to get BM25 model parameters on your documents.
bm25_ef.fit(docs)

# 3. Store the fitted parameters to expedite future processing.
bm25_ef.save("bm25_params.json")

# 4. Load the saved params
new_bm25_ef = BM25EmbeddingFunction()
new_bm25_ef.load("bm25_params.json")

docs_embeddings = new_bm25_ef.encode_documents(docs)
query_embeddings = new_bm25_ef.encode_queries([query])
print("Dim:", new_bm25_ef.dim, list(docs_embeddings)[0].shape)

示例 3: 使用 Reranker

搜索系统需要快速高效地搜索最相关的数据。传统做法通常使用 BM25 或 TF-IDF 进行关键词匹配对搜索结果进行重新排序。但最近也出现了许多新的做法，例如基于 Embedding 的余弦相似度对搜索结果进行重排。这种做法虽然直接但可能会错过细微的语义含义且忽视了文档与查询意图之间的关系。

因此，我们推荐使用 Reranker。Reranker 是一种高级 AI 模型，能够接收来自搜索的初步结果并重新评估这些结果，以确保它们更加符合用户的意图。Reranker 不仅仅只关注术语的表面上的匹配度，而且还会考虑搜索查询与文档内容之间更深层次的关联。

本例中我们将使用 Jina AI Reranker.

代码语言：javascript复制

from pymilvus.model.reranker import JinaRerankFunction
jina_api_key = "<YOUR_JINA_API_KEY>"
rf = JinaRerankFunction("jina-reranker-v1-base-en", jina_api_key)
query = "What event in 1956 marked the official birth of artificial intelligence as a discipline?"
documents = ["In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.","The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.","In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.","The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems."]
results = rf(query, documents)

for result in results:
    print(f"Index: {result.index}")
    print(f"Score: {result.score:.6f}")
    print(f"Text: {result.text}n")

以下为输出结果：

代码语言：javascript复制

Index: 1
Score: 0.937096
Text: The Dartmouth Conference in 1956 is considered the birthplace of artificial intelligence as a field; here, John McCarthy and others coined the term 'artificial intelligence' and laid out its basic goals.
Index: 3
Score: 0.354210
Text: The invention of the Logic Theorist by Allen Newell, Herbert A. Simon, and Cliff Shaw in 1955 marked the creation of the first true AI program, which was capable of solving logic problems, akin to proving mathematical theorems.
Index: 0
Score: 0.349866
Text: In 1950, Alan Turing published his seminal paper, 'Computing Machinery and Intelligence,' proposing the Turing Test as a criterion of intelligence, a foundational concept in the philosophy and development of artificial intelligence.
Index: 2
Score: 0.272896
Text: In 1951, British mathematician and computer scientist Alan Turing also developed the first program designed to play chess, demonstrating an early example of AI in game strategy.

embedding 模型数据搜索数据处理

0 人点赞