# 体验如何控制全文检索结果的精准度

总结

增加测试数据，添加一个 title 字段

POST /forum/article/_bulk
{ "update": { "_id": "1"} }
{ "doc" : {"title" : "this is java and elasticsearch blog"} }
{ "update": { "_id": "2"} }
{ "doc" : {"title" : "this is java blog"} }
{ "update": { "_id": "3"} }
{ "doc" : {"title" : "this is elasticsearch blog"} }
{ "update": { "_id": "4"} }
{ "doc" : {"title" : "this is java, elasticsearch, hadoop blog"} }
{ "update": { "_id": "5"} }
{ "doc" : {"title" : "this is spark blog"} }

1
2
3
4
5
6
7
8
9
10
11

# 1、搜索标题中包含 java 或 elasticsearch 的 blog

term query：搜索 exact value
match query：full text 全文检索

如果要检索的 field，是 not_analyzed 类型的，那么 match query 也相当于 term query。

GET /forum/article/_search
{
    "query": {
        "match": {
            "title": "java elasticsearch"
        }
    }
}

1
2
3
4
5
6
7
8

会出来 4 条数据。

# 2、搜索标题中包含 java 和 elasticsearch 的 blog

GET /forum/article/_search
{
    "query": {
        "match": {
            "title": {
          		"query": "java elasticsearch",
          		"operator": "and"
   	        }
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11

搜索的结果是包含 java 和 elasticsearch 两个关键词的结果，并不是 term query 的精准匹配

# 3、搜索包含 java、elasticsearch、spark、hadoop 4 个关键字中，至少 3 个的 blog

GET /forum/article/_search
{
    "query": {
        "match": {
            "title": {
          		"query": "java elasticsearch spark hadoop",
          		"minimum_should_match": "75%"
   	        }
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11

minimum_should_match：必须至少匹配其中的多少个关键字，才能作为结果返回，默认是一个

# 4、用 bool 组合多个搜索条件，来搜索 title

GET /forum/article/_search
{
  "query": {
    "bool": {
      "must":     { "match": { "title": "java" }},
      "must_not": { "match": { "title": "spark"  }},
      "should": [
          { "match": { "title": "hadoop" }},
          { "match": { "title": "elasticsearch"   }}
      ]
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13

再来解说下这个：必须包含 java，且不能包含 spark，且可以包含或者不包含 hadoop 和 elasticsearch；

这里的 should 这样用我觉得没有什么必要，那有什么用么？下面会讲解相关作用

# 5、bool 组合多个搜索条件，如何计算 relevance score

must 和 should 搜索对应的分数，加起来，除以 must 和 should 的总数

排名第一：java，同时包含 should 中所有的关键字，hadoop，elasticsearch
排名第二：java，同时包含 should 中的 elasticsearch
排名第三：java，不包含 should 中的任何关键字

should 是可以影响相关度分数的

must 是确保说，谁必须有这个关键字，同时会根据这个 must 的条件去计算出 document 对这个搜索条件的 relevance score

在满足 must 的基础之上，should 中的条件，不匹配也可以，但是如果匹配的更多，那么 document 的 relevance score 就会更高

看下面的结果排名，对照上面的就清楚了

"hits": [
      {
        "_index": "forum",
        "_type": "article",
        "_id": "4",
        "_score": 1.3375794,
        "_source": {
          "articleID": "QQPX-R-3956-#aD8",
          "userID": 2,
          "hidden": true,
          "postDate": "2017-01-02",
          "tag": [
            "java",
            "elasticsearch"
          ],
          "tag_cnt": 2,
          "view_cnt": 80,
          "title": "this is java, elasticsearch, hadoop blog"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "1",
        "_score": 0.53484553,
        "_source": {
          "articleID": "XHDK-A-1293-#fJ3",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-01",
          "tag": [
            "java",
            "hadoop"
          ],
          "tag_cnt": 2,
          "view_cnt": 30,
          "title": "this is java and elasticsearch blog"
        }
      },
      {
        "_index": "forum",
        "_type": "article",
        "_id": "2",
        "_score": 0.19856805,
        "_source": {
          "articleID": "KDKE-B-9947-#kL5",
          "userID": 1,
          "hidden": false,
          "postDate": "2017-01-02",
          "tag": [
            "java"
          ],
          "tag_cnt": 1,
          "view_cnt": 50,
          "title": "this is java blog"
        }
      }
    ]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

# 6、搜索 java、hadoop、spark、elasticsearch，至少包含其中 3 个关键字

至少满足 should 中的 3个条件才返回结果

GET /forum/article/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": "java" }},
        { "match": { "title": "elasticsearch"   }},
        { "match": { "title": "hadoop"   }},
	      { "match": { "title": "spark"   }}
      ],
      "minimum_should_match": 3
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14

# 总结

全文检索的时候，进行多个值的检索，有两种做法，
- match query
- should
控制搜索结果精准度：
- operator and
- minimum_should_match

← range filter 范围过滤多关键词底层原理 term + bool →