# 前缀、通配符、正则搜索

前缀搜索
前缀搜索的原理
通配符搜索
正则搜索

# 前缀搜索

C3D0-KD345
C3K5-DFG65
C4I8-UI365

搜索 C3：需要将前两条搜索出来，这个就是前缀搜索

1
2
3
4
5

伪造新 index 和测试数据

DELETE my_index

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "keyword"
        }
      }
    }
  }
}

PUT my_index/my_index/1
{
  "title":"C3D0-KD345"
}

PUT my_index/my_index/2
{
  "title":"C3K5-DFG65"
}

PUT my_index/my_index/3
{
  "title":"C4I8-UI365"
}

GET my_index/_search

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

使用前缀查询语法

GET my_index/_search
{
  "query": {
    "prefix": {
      "title": {
        "value": "C3"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10

# 前缀搜索的原理

prefix query 不计算 relevance score，与 prefix filter 唯一的区别就是，filter 会 cache bitset

前缀越短，要处理的 doc 越多，性能越差，尽可能用长前缀搜索

前缀搜索，它是怎么执行的？性能为什么差呢？

下面举例来说明：

C3-D0-KD345
C3-K5-DFG65
C4-I8-UI365

1
2
3

全文检索,每个字符串都需要被分词

c3			doc1,doc2
d0
kd345
k5
dfg65
c4
i8
ui365

1
2
3
4
5
6
7
8

搜索目标 C3，扫描到 C3 即可停止了，因为能拿到目标 doc 了

而前缀搜索，你没有办法分词了，建立的倒排索引是一整个字符串，整个时候需要扫描所有的倒排索引去匹配前缀

# 通配符搜索

跟前缀搜索类似，功能更加强大

C3D0-KD345
C3K5-DFG65
C4I8-UI365

1
2
3

如这样一个需求：5字符-D任意个字符5，通配符表达式如下：

5?-*5：通配符去表达更加复杂的模糊搜索的语义

GET my_index/_search
{
  "query": {
    "wildcard": {
      "title": {
        "value": "C?K*5"
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10

?：任意字符 *：0个或任意多个字符

响应

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_index",
        "_id": "2",
        "_score": 1,
        "_source": {
          "title": "C3K5-DFG65"
        }
      }
    ]
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

性能一样差，必须扫描整个倒排索引，才ok

# 正则搜索


GET /my_index/my_type/_search
{
  "query": {
    "regexp": {
      "title": "C[0-9].+"
    }
  }
}

1
2
3
4
5
6
7
8
9
10

[0-9]：指定范围内的数字 [a-z]：指定范围内的字母 .：一个字符 +：前面的正则表达式可以出现一次或多次

wildcard 和 regexp，与 prefix 原理一致，都会扫描整个索引，性能很差

主要是给大家介绍一些高级的搜索语法。在实际应用中，能不用尽量别用。性能太差了。

← rescore 机制优化近似匹配搜索的性能 match_phrase_prefix 实现搜索推荐 →