# ngram 实现搜索推荐

# 什么是 ngram？

比如一个单词 quick，5 种长度下的 ngram

ngram length=1，q u i c k
ngram length=2，qu ui ic ck
ngram length=3，qui uic ick
ngram length=4，quic uick
ngram length=5，quick

1
2
3
4
5

被切分的词叫做 ngram。

更细化的一个名词 edge ngram；它的表现形式如下：

anchor 首字母后进行 ngram

q
qu
qui
quic
quick

1
2
3
4
5
6
7

其实这个形式已经能想到了，这个就是我们搜索的时候进行的推荐那样的效果，类似前缀索引的效果；

在数据写入的时候就将这种情况进行倒排索引，查询的时候和普通 match 一样了，匹配倒排索引，匹配到则 ok，不用扫描所有的倒排索引了

# 实践 ngram

首先自定义分词器

DELETE /my_index

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "autocomplete_filter": {
                    "type":     "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 20
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter"
                    ]
                }
            }
        }
    }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

分词相关定义详细请参考修改和定制分词器

min_gram：最小
max_gram：最大

比如：quick ，max_gram = 2，那么只会切分成
- q
- qu
最大 gram 数量为 2

查看分词效果

GET /my_index/_analyze
{
  "analyzer": "autocomplete",
  "text": "quick brown"
}

1
2
3
4
5

响应结果

{
  "tokens": [
    {
      "token": "q",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "qu",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "qui",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "quic",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "b",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "br",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "bro",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "brow",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 1
    }
  ]
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

建立 mapping

PUT /my_index/_mapping/my_type
{
  "properties": {
      "title": {
          "type":     "string",
          "analyzer": "autocomplete",
          "search_analyzer": "standard"
      }
  }
}

1
2
3
4
5
6
7
8
9
10

插入实验数据


put my_index/my_type/1
{
  "title": "hello w"
}

put my_index/my_type/2
{
  "title": "hello word"
}

put my_index/my_type/3
{
  "title": "hello wo"
}

put my_index/my_type/4
{
  "title": "hello"
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

查询

GET /my_index/my_type/_search
{
  "query": {
    "match": {
      "title": "hello w"
    }
  }
}

1
2
3
4
5
6
7
8

会发现 4 条语句都会查询出来，是因为 match 是全文检索，只是分数比较低

可以改用 match_phrase 来查询，要求每个 term 都有，而且 position 刚好靠着1位，符合我们的期望

GET /my_index/my_type/_search
{
  "query": {
    "match_phrase": {
      "title": "hello w"
    }
  }
}

1
2
3
4
5
6
7
8

这次 id=4 的 hello 不会被搜索出来了

← match_phrase_prefix 实现搜索推荐 TF&IDF 算法以及向量空间模型算法 →