- 相信我们很多人做中文搜索的时候,在
Github
找了ik中分分词插件 - 然后建立
mapping
的时候,很自然的使用这样的参数(参照官方分词文档实例)
代码语言:javascript
复制{
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
- 假设我们在已经建立的
test
index - 那么我们来看一下全部数据(打火车和火车两条数据)
代码语言:javascript
复制curl 127.0.0.1:9200/test/_search | jq
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_score": 1,
"_source": {
"id": 1,
"title": "打火车"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "Video_2",
"_score": 1,
"_source": {
"id": 2,
"title": "火车"
}
}
]
}
}
代码语言:javascript
复制curl 127.0.0.1:9200/test/_search?q=打火车 | jq
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.21110919,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "Video_2",
"_score": 0.21110919,
"_source": {
"id": 2,
"title": "火车"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_score": 0.160443,
"_source": {
"id": 1,
"title": "打火车"
}
}
]
}
}
- 这时候我们惊奇的发现
火车
的分值是0.21110919
居然比打火车
的0.160443
还高 - 中间经过一路排查, 首先感谢https://github.com/mobz/elasticsearch-head插件, 让排查数据的时候减少很多操作.
- 之后查看文档分词结果就得知了答案
代码语言:javascript
复制curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"title": {
"field_statistics": {
"sum_doc_freq": 3,
"doc_count": 2,
"sum_ttf": 3
},
"terms": {
"打火": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 2
}
]
},
"火车": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 1,
"end_offset": 3
}
]
}
}
}
}
}
- 很惊奇的发现打火车被划分成
打火
和火车
两个词, 所以这之中肯定有问题了(当然对于搜索引擎是没有问题的). 打火车
文档中的火车
得到了分值,但打火
会使搜索得分下降, 导致火车
文档的排名靠前- 所以我决定把两个分词器设置成一样
代码语言:javascript
复制{
"properties": {
"title": {
"type": "text",
"analyzer": "ik_smart",
"search_analyzer": "ik_smart"
}
}
}
- 然后再看一下分词数据(这次分词的数据的确是我们预想的)
代码语言:javascript
复制curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"title": {
"field_statistics": {
"sum_doc_freq": 3,
"doc_count": 2,
"sum_ttf": 3
},
"terms": {
"打": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 1
}
]
},
"火车": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 1,
"end_offset": 3
}
]
}
}
}
}
}
- 这时我们再搜索一次数据排名, 看到得分值排名的确是我们想要的了.
代码语言:javascript
复制curl 127.0.0.1:9200/test/_search?q=打火车 | jq
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.77041256,
"hits": [
{
"_index": "test",
"_type": "_doc",
"_id": "Video_1",
"_score": 0.77041256,
"_source": {
"id": 1,
"title": "打火车"
}
},
{
"_index": "test",
"_type": "_doc",
"_id": "Video_2",
"_score": 0.21110919,
"_source": {
"id": 2,
"title": "火车"
}
}
]
}
}