简介
Es官方文档
match_phrase的特点:
- 词项匹配(查询分词的词项必须完全匹配到索引分词的词项中,并且词项的相对位置position必须一致)
- 分词后的相对位置也必须要精准匹配(slop)
- 使用slop之后,位置越近的得分就越高
- 短语查询和邻近查询都比简单的
query
查询代价更高 。 一个match
查询仅仅是看词条是否存在于倒排索引中,而一个match_phrase
查询是必须计算并比较多个可能重复词项的位置
总结:
1.使用短语查询时使用Es默认的标准分词器(标准分词器:细粒度切分)最好,这样可以使查询分词和索引分词的词项最大可能的达到匹配
2.特别适合在一段文本中不连续的词的搭配情景(例:文章、说明、长文本...)
准备数据
代码语言:javascript复制新建索引:
PUT test_phrase
设置索引mapping:
PUT /test_phrase/_mapping/_doc
{
"properties": {
"name": {
"type":"text"
}
}
}
结果:
{
"mapping": {
"_doc": {
"properties": {
"name": {
"type": "text"
}
}
}
}
}
插入数据:
PUT test_phrase/_doc/2
{
"name":"我爱北京天安门"
}
查询数据:
POST test_phrase/_search
{
"query": {"match_all": {}}
}
结果:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [
{
"_index" : "test_phrase",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "我爱北京天安门"
}
}
]
}
}
查看分词词项:
POST test_phrase/_analyze
{
"field": "name",
"text": "我爱北京天安门"
}
结果:
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "爱",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "北",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "京",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "天",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "安",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
},
{
"token" : "门",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
}
]
}
演示阶段
关键词"我"
代码语言:javascript复制POST test_phrase/_search
{
"query": {
"match_phrase": {
"name": {
"query": "我"
}
}
}
}
结果:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "test_phrase",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.2876821,
"_source" : {
"name" : "我爱北京天安门"
}
}
]
}
}
分析:
POST test_phrase/_analyze
{
"field": "name",
"text": "我"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
}
]
}
查询分词"我"的position位置是0,首先文档"我爱北京天安门"的索引分词中有"我"且position为0,符合短语查询的要求,因此可以正确返回。
关键词"我爱"
代码语言:javascript复制POST test_phrase/_search
{
"query": {
"match_phrase": {
"name": {
"query": "我爱"
}
}
}
}
结果:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.5753642,
"hits" : [
{
"_index" : "test_phrase",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.5753642,
"_source" : {
"name" : "我爱北京天安门"
}
}
]
}
}
分析:
POST test_phrase/_analyze
{
"field": "name",
"text": "我爱"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "爱",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
}
]
}
查询分词"我爱"的position分别是"我"-0、"爱"-1,
索引分词中也存在"我"、"爱"词项,其次"我"-0、"爱"-1的相对position也符合要求,因此可以正确返回。
关键词"我北"
代码语言:javascript复制POST test_phrase/_search
{
"query": {
"match_phrase": {
"name": {
"query": "我北"
}
}
}
}
结果:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
分析:
POST test_phrase/_analyze
{
"field": "name",
"text": "我北"
}
{
"tokens" : [
{
"token" : "我",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "北",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
}
]
}
查询分词中"我"的position是0,"北"的position是1,
索引分词中"我"的position是0,"北"的position是2,
虽然查询分词的词项在索引分词的词项中都存在,但是相对的position并未匹配要求,导致搜索结果不能正确返回。
修正:"slop": 1
POST test_phrase/_search
{
"query": {
"match_phrase": {
"name": {
"query": "我北",
"slop": 1
}
}
}
}
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.37229446,
"hits" : [
{
"_index" : "test_phrase",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.37229446,
"_source" : {
"name" : "我爱北京天安门"
}
}
]
}
}
关键词“爱京”
代码语言:javascript复制POST test_phrase/_search
{
"query": {
"match_phrase": {
"name": {
"query": "爱北京"
}
}
}
}
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "test_phrase",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.8630463,
"_source" : {
"name" : "我爱北京天安门"
}
}
]
}
}
查询分词中"爱"的position是0,"北"的position是1,"京"的position是2。
索引分词中"爱"的position是1,"北"的position是2,"京"的position是3。
查询分词和索引分词的词项都匹配,同时词项的相对位置也符合要求,所以可以检索成功。
提升相关度
使用邻近度提高相关度
我们可以将一个简单的 match
查询作为一个 must
子句。 这个查询将决定哪些文档需要被包含到结果集中。 我们可以用 minimum_should_match
参数去除长尾。 然后我们可以以 should
子句的形式添加更多特定查询。 每一个匹配成功的都会增加匹配文档的相关度。
GET /my_index/my_type/_search
{
"query": {
"bool": {
"must": {
"match": { #must 子句从结果集中包含或者排除文档
"title": {
"query": "quick brown fox",
"minimum_should_match": "30%"
}
}
},
"should": {
"match_phrase": { #should 子句增加了匹配到文档的相关度评分。
"title": {
"query": "quick brown fox",
"slop": 50
}
}
}
}
}
}