Elasticsearch使用:Match_phrase查询

2021-03-30 17:51:17 浏览数 (1)

简介

Es官方文档

match_phrase的特点:

  • 词项匹配(查询分词的词项必须完全匹配到索引分词的词项中,并且词项的相对位置position必须一致)
  • 分词后的相对位置也必须要精准匹配(slop)
  • 使用slop之后,位置越近的得分就越高
  • 短语查询和邻近查询都比简单的 query 查询代价更高 。 一个 match 查询仅仅是看词条是否存在于倒排索引中,而一个 match_phrase 查询是必须计算并比较多个可能重复词项的位置

总结:

1.使用短语查询时使用Es默认的标准分词器(标准分词器:细粒度切分)最好,这样可以使查询分词和索引分词的词项最大可能的达到匹配

2.特别适合在一段文本中不连续的词的搭配情景(例:文章、说明、长文本...)

准备数据

代码语言:javascript复制
新建索引:
PUT test_phrase

设置索引mapping:
PUT /test_phrase/_mapping/_doc
{
    "properties": {
        "name": {
            "type":"text"
        }
    }
}
结果:
{
  "mapping": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text"
        }
      }
    }
  }
}

插入数据:
PUT test_phrase/_doc/2
{
  "name":"我爱北京天安门"
}

查询数据:
POST test_phrase/_search
{
  "query": {"match_all": {}}
}
结果:
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_phrase",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "我爱北京天安门"
        }
      }
    ]
  }
}


查看分词词项:
POST test_phrase/_analyze
{
  "field": "name",   
  "text": "我爱北京天安门"
}
结果:
{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "爱",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "北",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "京",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "天",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "安",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "门",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    }
  ]
}

演示阶段

关键词"我"

代码语言:javascript复制
POST test_phrase/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "我"
      }
    }
  }
}

结果:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test_phrase",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.2876821,
        "_source" : {
          "name" : "我爱北京天安门"
        }
      }
    ]
  }
}

分析:
POST test_phrase/_analyze
{
  "field": "name",   
  "text": "我"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    }
  ]
}
查询分词"我"的position位置是0,首先文档"我爱北京天安门"的索引分词中有"我"且position为0,符合短语查询的要求,因此可以正确返回。

关键词"我爱"

代码语言:javascript复制
POST test_phrase/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "我爱"
      }
    }
  }
}

结果:
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.5753642,
    "hits" : [
      {
        "_index" : "test_phrase",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.5753642,
        "_source" : {
          "name" : "我爱北京天安门"
        }
      }
    ]
  }
}

分析:
POST test_phrase/_analyze
{
  "field": "name",   
  "text": "我爱"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "爱",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    }
  ]
}
查询分词"我爱"的position分别是"我"-0、"爱"-1,
索引分词中也存在"我"、"爱"词项,其次"我"-0、"爱"-1的相对position也符合要求,因此可以正确返回。

关键词"我北"

代码语言:javascript复制
POST test_phrase/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "我北"
      }
    }
  }
}

结果:
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

分析:
POST test_phrase/_analyze
{
  "field": "name",   
  "text": "我北"
}

{
  "tokens" : [
    {
      "token" : "我",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "北",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    }
  ]
}

查询分词中"我"的position是0,"北"的position是1,
索引分词中"我"的position是0,"北"的position是2,
虽然查询分词的词项在索引分词的词项中都存在,但是相对的position并未匹配要求,导致搜索结果不能正确返回。

修正:"slop": 1
POST test_phrase/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "我北",
        "slop": 1
      }
    }
  }
}
{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.37229446,
    "hits" : [
      {
        "_index" : "test_phrase",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.37229446,
        "_source" : {
          "name" : "我爱北京天安门"
        }
      }
    ]
  }
}

关键词“爱京”

代码语言:javascript复制
POST test_phrase/_search
{
  "query": {
    "match_phrase": {
      "name": {
        "query": "爱北京"
      }
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.8630463,
    "hits" : [
      {
        "_index" : "test_phrase",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.8630463,
        "_source" : {
          "name" : "我爱北京天安门"
        }
      }
    ]
  }
}

查询分词中"爱"的position是0,"北"的position是1,"京"的position是2。
索引分词中"爱"的position是1,"北"的position是2,"京"的position是3。
查询分词和索引分词的词项都匹配,同时词项的相对位置也符合要求,所以可以检索成功。

提升相关度

使用邻近度提高相关度

我们可以将一个简单的 match 查询作为一个 must 子句。 这个查询将决定哪些文档需要被包含到结果集中。 我们可以用 minimum_should_match 参数去除长尾。 然后我们可以以 should 子句的形式添加更多特定查询。 每一个匹配成功的都会增加匹配文档的相关度。

代码语言:javascript复制
GET /my_index/my_type/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {   #must 子句从结果集中包含或者排除文档
          "title": {
            "query":                "quick brown fox",
            "minimum_should_match": "30%"
          }
        }
      },
      "should": {
        "match_phrase": {   #should 子句增加了匹配到文档的相关度评分。
          "title": {
            "query": "quick brown fox",
            "slop":  50
          }
        }
      }
    }
  }
}

0 人点赞