为了一碟醋，我包了两顿饺子

上周五赶上公司的 mental health day，连着周末休了三天。

于是我有了三天时间赶我的极客时间「Rust 第一课」专栏的稿子。我想着三天怎么也能交出两篇稿子，结果就周五忙活一天，熬出一篇。

难道周六周日我去打酱油了么？还是穿着非主流跟小宝小贝混糖吃去了？非也非也，周六一大早，我就坐在电脑旁，准备新的稿子：「实操项目：使用 pyo3/neon 开发 Python3/nodejs 模块」。

本来这个题目写起来毫无压力，不过因为专栏一开始我临时加了个 get hands dirty 系列，自己一上头就已经实操了 pyo3 和 neon，所以这篇写起来会毫无新意。

于是我就想：有什么东西足够复杂，足够有用，又是 python 社区或者 nodejs 社区缺乏的？

思来想去，我选定了内嵌式的搜索引擎。目前开源和非开源世界里，搜索引擎服务器已经遍地开花，有开源界口碑良好的 elasticsearch，有创业圈行业标杆的 agolia，当然，Rust 自己也有 meilisearch，sonic 和 quickwit。这些搜索引擎服务器可以通过任何语言撰写的客户端访问。不过，如果你就想在自己的系统里嵌入一个无服务器的，本地运行，本地索引的，却又能处理海量数据的搜索引擎，python 和 nodejs 还真没有什么太好的选择，尤其是 python，至少人家 nodejs 还有 flexsearch 拿得出手，恕我孤陋寡闻，python 就真没什么好选择。基于 lucene 的 pylucene 没毛病，但运行个 python 的搜索引擎还要在 python 里起个 java VM，总让人如鲠在喉。这就是我为什么选择用内嵌搜索引擎为例，谈如何让 Rust 为 python 和 nodejs 提供支持。

Rust 下，我们有 tantivy 这个表现相当不错的内嵌式搜索引擎，它也是 quickwit 的底层库（quickwit 是 tantivy 的维护者）。你可以看 https://tantivy-search.github.io/bench/，这里对比了 tantivy，lucene，pisa，bleve 和 rucene 这几个库的性能。基本上 tantivy 和 C 撰写的 pisa 性能旗鼓相当，但功能更全面一些。

不过，tantivy 有自己的 tantivy-py，我再做一个类似的意义不大。我翻了翻 tantivy-py 的代码，发现它基本上就是 Rust 库的封装，而 tantivy 自身因为是定位底层实现，所以 API 并不那么友好。所以，tantivy-py 难说是给搜索引擎小白提供的傻瓜版本。所以，我可以做一个定位不太一样的 python 搜索引擎库。我希望的是，它的 API 是这样使用的感觉：

代码语言：javascript复制

In [1]: from xunmi import *
# 从配置里直接加载（或者创建）索引
In [2]: indexer = Indexer("./fixtures/config.yml")
# 获取修改索引的句柄
In [3]: updater = indexer.get_updater()

In [4]: f = open("./fixtures/wiki_00.xml")
# 获取要索引的数据
In [5]: data = f.read()

In [6]: f.close()
# 提供一些简单的规则格式化数据（比如数据类型，字段重命名，类型转换）
# 支持 xml / json / yml 等数据，数据需要与索引匹配，否则需要用
# mapping 和 conversion 规则转换
In [7]: input_config = InputConfig("xml", [("$value", "content")], [("id", ("string", "number"))])
# 更新索引
In [8]: updater.update(data, input_config)
# 搜索更新后的索引，使用 "title", "content" 字段搜索，找 offset 0 处的 5 个结果返回
In [9]: result = indexer.search("历史", ["title", "content"], 5, 0)
# 返回结果包含 score 和索引中存储的数据（这里 content 只索引，没有存储）
In [10]: result
Out[10]: 
[(13.932347297668457,
  '{"id":[22],"title":["历史"],"url":["https://zh.wikipedia.org/wiki?curid=22"]}'),
 (11.62932014465332,
  '{"id":[399],"title":["非洲历史"],"url":["https://zh.wikipedia.org/wiki?curid=399"]}'),
 (11.526201248168945,
  '{"id":[2239],"title":["美国历史"],"url":["https://zh.wikipedia.org/wiki?curid=2239"]}'),
 (11.521516799926758,
  '{"id":[374],"title":["亚洲历史"],"url":["https://zh.wikipedia.org/wiki?curid=374"]}'),
 (11.342496871948242,
  '{"id":[182],"title":["中国历史"],"url":["https://zh.wikipedia.org/wiki?curid=182"]}')]

其中，索引的配置文件长这个样子：

代码语言：javascript复制

---
path: /tmp/searcher_index # 索引路径
schema: # 索引的 schema，对于文本，使用 CANG_JIE 做中文分词
  - name: id
    type: u64
    options:
      indexed: true
      fast: single
      stored: true
  - name: url
    type: text
    options:
      indexing: ~
      stored: true
  - name: title
    type: text
    options:
      indexing:
        record: position
        tokenizer: CANG_JIE
      stored: true
  - name: content
    type: text
    options:
      indexing:
        record: position
        tokenizer: CANG_JIE
      stored: false
text_lang:
  chinese: true # 如果是 true，自动做繁体到简体的转换
writer_memory: 100000000

所以，在写文章之前，我需要先写一个使用 pyo3 把 Rust 代码封装成 Python FFI，供 Python 使用。有了这个代码，才好写文章。而写这个代码之前，我需要先写一个 Rust 库把 tantivy 封装一下，提供友好的 API。

于是有了第一顿饺子：xunmi（寻觅）。

我写了一个简单的 tanvity 封装，放在 github 的 tyrchen/xunmi 下，使用方法：

代码语言：javascript复制

use std::str::FromStr;
use xunmi::*;

fn main() {
    // you can load a human-readable configuration (e.g. yaml)
    let config = IndexConfig::from_str(include_str!("../fixtures/config.yml")).unwrap();

    // then open or create the index based on the configuration
    let indexer = Indexer::open_or_create(config).unwrap();

    // then you can get the updater for adding / updating index
    let mut updater = indexer.get_updater().unwrap();

    // data to index could comes from json / yaml / xml, as long as they're compatible with schema
    let content = include_str!("../fixtures/wiki_00.xml");

    // you could even provide mapping for renaming fields and converting data types
    // e.g. index schema has id as u64, content as text, but xml has id as string
    // and it doesn't have a content field, instead the $value of the doc is the content.
    // so we can use mapping / conversion to normalize these.
    let config = InputConfig::new(
        InputType::Xml,
        vec![("$value".into(), "content".into())],
        vec![("id".into(), (ValueType::String, ValueType::Number))],
    );

    // you could use add() or update() to add data into the search index
    // if you add, it will insert new docs; if you update, and if the doc
    // contains an "id" field, updater will first delete the term matching
    // id (so id shall be unique), then insert new docs.
    // all data added/deleted will be committed.
    updater.update(content, &config).unwrap();

    // by default the indexer will be auto reloaded upon every commit,
    // but that has delays in tens of milliseconds, so for this example,
    // we shall reload immediately.
    indexer.reload().unwrap();

    println!("total: {}", indexer.num_docs());

    // you could provide a query and fields you want to search
    let result = indexer.search("历史", &["title", "content"], 5, 0).unwrap();
    for (score, doc) in result.iter() {
        println!("score: {}, doc: {:?}", score, doc);
    }
}

可以看到，它几乎和 Python 的示例代码一致。

在写 xunmi 的过程中，我发现，中文的繁体到简体的转换工具，不太理想。我先找到一个下载量还不错，又把自己标记为 1.0 的 character_converter 库，发现转换一篇维基百科的文章，慢得肉眼可见。后来又发现了貌似很牛逼的，用 C 写的 opencc，以及它的封装 opencc-rust，可惜 opencc-rust 做的不好，编译时需要系统先安装好 opencc 才能用，我在 github action 里跑的时候，即便 "apt install opencc" 还是会编译错误，故而我萌生了自己写一个的念头。我想，不就是繁体字到简体字的一个映射么？也就是一两百行代码的事情：我编译期生成一个映射表，运行时把字符串一个字符一个字符转换不就行了么？

于是我又开始折腾第二顿饺子：fast2s（tyrchen/fast2s）。

在做的过程中，我突然想到一直觉得很牛逼但不知道用在哪里的 fst 库。fst 是一个用有限自动机做有序 set / map 的查询的库，效率比 HashMap 差一些，但非常非常节省内存。当然，繁体字到简体字的转换也就两千个汉字，内存节省收益不大，但我就是觉得找到了 fst 的一个应用场景，技痒想试试。做 fast2s 需要繁体字到简体字的转换表，在找转换表时，我又发现了 simplet2s-rs，于是就把它的转换表拿来用。很快写出来的第一版和几个已有的库比较：

代码语言：javascript复制

| tests | fast2s | simplet2s-rs | opencc-rust | character_conver |
| ----- | ------ | ------------ | ----------- | ---------------- |
| zht   | 446us  | 616us        | 5.08ms      | 1.23s            |
| zhc   | 491us  | 798us        | 6.08ms      | 2.87s            |
| en    | 68us   | 2.82ms       | 12.24ms     | 26.11s           |

Test result (mutate existing string):

| tests | fast2s | simplet2s-rs | opencc-rust | character_conver |
| ----- | ------ | ------------ | ----------- | ---------------- |
| zht   | 438us  | N/A          | N/A         | N/A              |
| zhc   | 503us  | N/A          | N/A         | N/A              |
| en    | 34us   | N/A          | N/A         | N/A              |

发现优胜的是 fast2s 和 simplet2s-rs。

按理说用 fst 做出来的 fast2s 要比用 HashMap 的 simplet2s 慢，可是结果让我吃了一惊。看了一下 simplet2s-ts 的代码才发现，我还有一些特殊情况没有处理。于是我把 simplet2s 对应的特殊情况的处理表改动了一下，用字符数组取代字符串，这样可以避免在访问哈希表时额外的指针跳转（如果你看我 Rust 专栏哈希表那一讲，可以明白这两者的区别）：

代码语言：javascript复制

// fast2s 的代码，key 和 value 都使用了字符/字符数组
// thanks https://github.com/bosondata/simplet2s-rs/blob/master/src/lib.rs#L8 for this special logic
// Traditional Chinese -> Not convert case
static ref T2S_EXCLUDE: HashMap<char, HashSet<Word>> = {
    hashmap!{
        '兒' => hashset!{['兒','寬']},
        '覆' => hashset!{['答', '覆'], ['批','覆'], ['回','覆']},
        '夥' => hashset!{['甚','夥']},
        '藉' => hashset!{['慰','藉'], ['狼','藉']},
        '瞭' => hashset!{['瞭','望']},
        '麽' => hashset!{['幺','麽']},
        '幺' => hashset!{['幺','麽']},
        '於' => hashset!{['樊','於']}
    }
};

// simplet2s 的代码，key 和 value 都使用了字符串
// Traditional Chinese -> Not convert case
static ref T2S_EXCLUDE: HashMap<&'static str, HashSet<&'static str>> = {
    hashmap!{
        "兒" => hashset!{"兒寬"},
        "覆" => hashset!{"答覆", "批覆", "回覆"},
        "夥" => hashset!{"甚夥"},
        "藉" => hashset!{"慰藉", "狼藉"},
        "瞭" => hashset!{"瞭望"},
        "麽" => hashset!{"幺麽"},
        "幺" => hashset!{"幺麽"},
        "於" => hashset!{"樊於"}
    }
};

处理好特殊情况后，fast2s 和 simplet2s-rs 的结果差不多，但因为我的 fast2s 用了一些特殊的优化，所以在使用 fst 的情况下，依旧性能和 simplet2s 旗鼓相当：

代码语言：javascript复制

| tests | fast2s | simplet2s-rs | opencc-rust | character_conver |
| ----- | ------ | ------------ | ----------- | ---------------- |
| zht   | 596us  | 579us        | 4.93ms      | 1.23s            |
| zhc   | 643us  | 750us        | 5.89ms      | 2.87s            |
| en    | 59us   | 2.68ms       | 11.46ms     | 26.11s           |

Test result (mutate existing string):

| tests | fast2s | simplet2s-rs | opencc-rust | character_conver |
| ----- | ------ | ------------ | ----------- | ---------------- |
| zht   | 524us  | N/A          | N/A         | N/A              |
| zhc   | 609us  | N/A          | N/A         | N/A              |
| en    | 48us   | N/A          | N/A         | N/A              |

在 fast2s 里，我不光提供了直接的转换，还提供了对已有字符串的修改，而不是生成新的字符串的功能。这个能力对大容量的字符串或者文件（文件可以 mmap）的繁简转换很有意义，因为它能节省内存的分配和消耗。

第二顿饺子 fast2s 包好，基本上周六的时间就悉数花去。

然后周日我又掉转头继续包第一顿饺子 xunmi。待 xunmi 折腾好，我处理完要撰写的文章所需要的 xunmi-py，已经是周日深夜 12 点。我为了 xunmi-py 的 96 行代码，付出了近 700 行（300 377）代码的代价：

代码语言：javascript复制

fast2s
❯ tokei .
-------------------------------------------------------------------------------
 Language            Files        Lines         Code     Comments       Blanks
-------------------------------------------------------------------------------
 Markdown                3           56           56            0            0
 Rust                    5          365          300           13           52
 Plain Text              8         6428         6428            0            0
 TOML                    3          259           99          142           18
-------------------------------------------------------------------------------
 Total                  19         7108         6883          155           70
-------------------------------------------------------------------------------

xunmi
❯ tokei .
-------------------------------------------------------------------------------
 Language            Files        Lines         Code     Comments       Blanks
-------------------------------------------------------------------------------
 Markdown                2           62           62            0            0
 Rust                    5          458          377           19           62
 TOML                    2          239           84          142           13
 XML                     1        63054        59462            0         3592
 YAML                    3          348          344            0            4
-------------------------------------------------------------------------------
 Total                  13        64161        60329          161         3671
-------------------------------------------------------------------------------

geek-time-rust-resources/31/xunmi-py
❯ tokei .
-------------------------------------------------------------------------------
 Language            Files        Lines         Code     Comments       Blanks
-------------------------------------------------------------------------------
 Makefile                1           24           18            0            6
 Python                  1            1            1            0            0
 Rust                    2          112           96            0           16
 TOML                    1           19           14            0            5
 XML                     1        63054        59462            0         3592
 YAML                    2          321          318            0            3
-------------------------------------------------------------------------------
 Total                   9        63531        59909            0         3622
-------------------------------------------------------------------------------

虽然包饺子花了比我预期要长得多的时间，但在这个过程中，我学到了一些奇妙的东西。

比如我一直苦恼如何把多个数据源（json / yaml / xml / ...）的数据，在不用定义 Rust struct 的情况下（如果可以定义 struct，那么就可以直接用 serde 转换），整合成一套方案。为此我甚至一开始走错了方向，试图自动检测文本类型，然后将它们统一转换成 JSON（这个检测和转换的代码还是有些挑战的）。

后来发现，使用 serde，我可以把 serde_xml_rs 提供的转换能力，让 xml 文本转换成一个 serde_json 下的 Value 结构。就好比把猪大肠安在牛肚子里，竟然不排异：

代码语言：javascript复制

let data: serde_json::Value = serde_xml_rs::from_str(&input);

神奇吧？

于是多个数据源统一处理就可以简化成下面这样子，简单到让人不敢相信自己的眼睛：

代码语言：javascript复制

pub type JsonObject = serde_json::Map<String, JsonValue>;
pub struct JsonObjects(Vec<JsonObject>);

impl JsonObjects {
    pub fn new(input: &str, config: &InputConfig, t2s: bool) -> Result<Self> {
        let input = match t2s {
            true => Cow::Owned(fast2s::convert(input)),
            false => Cow::Borrowed(input),
        };
        let err_fn =
            || DocParsingError::NotJson(format!("Failed to parse: {:?}...", &input[0..20]));
        let result: std::result::Result<Vec<JsonObject>, _> = match config.input_type {
            InputType::Json => serde_json::from_str(&input).map_err(|_| err_fn()),
            InputType::Yaml => serde_yaml::from_str(&input).map_err(|_| err_fn()),
            InputType::Xml => serde_xml_rs::from_str(&input).map_err(|_| err_fn()),
        };

        let data = match result {
            Ok(v) => v,
            Err(_) => {
                let obj: JsonObject = match config.input_type {
                    InputType::Json => serde_json::from_str(&input).map_err(|_| err_fn())?,
                    InputType::Yaml => serde_yaml::from_str(&input).map_err(|_| err_fn())?,
                    InputType::Xml => serde_xml_rs::from_str(&input).map_err(|_| err_fn())?,
                };
                vec![obj]
            }
        };

        Ok(Self(data))
    }
}

好了，饺子的事，我们就先说到这儿。醋，下周就能尝到 :)

禅定时刻

你问我为啥都 9102 年又两年了，还要支持似乎已经过时的 xml？hmm...因为很多数据源都还是 xml，比如 wikipedia 的 dump。

对于 xunmi 来说，目前的处理方式还不够好，在往索引里添加文档时，应该用 channel 把处理流程分成几个阶段，这样，索引的添加就不会影响到查询，python 使用者整体的体验会更好：

有空我继续把这顿饺子继续整得薄皮大馅。

rust 搜索引擎 json node.js python

0 人点赞