10.ElasticSearch系列之深入搜索

1. 基于词项与全文的搜索

1.1 基于term的搜索

term的重要性：term是表达语义的最小单位
特点：
- 包括term query\range query\exists query\prefix query\wildcard query
- term查询，对输入不做分词
- 可以通过constant score将查询转换成一个filtering,避免算分，并利用缓存，提高性能

GET kibana_sample_data_logs/_search
{
  "explain": true, 
  "query": {
    "term": {
      "extension.keyword": {
        "value": "css"
      }
    }
  }
}
GET kibana_sample_data_logs/_search
{
  "explain": true, 
  "query": {
    "constant_score": { // 避免算分，并利用缓存
      "filter": {
        "term": {
          "extension.keyword": "css"
        }
      }
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

1.2 基于全文的搜索

基于全文本的查找 match query\match phrase query\query string query
特点：
- 索引和搜索时都会进行分词，查询字符串先传递到一个合适的分词器，然后生成一个供查询的词项列表
- 查询时候，先会对输入的查询进行分词，然后每个词项逐个进行底层的查询，最终将结果进行合并，并为每个文档生成一个算分。例如查"新泾三村"，会查到包含新或泾或三或村的所有结果

GET amap_poi_detail/_search
{
  "query": {
    "match": {
      "name": {
        "query": "新 泾 三 村",
        "operator": "and" // 查询这四个字均包含的文档
      }
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11

2. 基于结构化的搜索

布尔、日期和数字这类结构化数据：有精确的格式，我们可以对这些格式进行逻辑操作，包括范围，比较大小
结构化文本可以做term查询或prefix查询

GET kibana_sample_data_flights/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "AvgTicketPrice": {
            "gte": 600,
            "lte": 800
          }
        }
      }
    }
  }
1
2
3
4
5
6
7
8
9
10
11
12
13
14

GET kibana_sample_data_flights/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "range": {
          "timestamp": {
            "gte": "now-1y"
          }
        }
      }
    }
  }
}
# 其中 y年 M月 w周 d天 H/h小时 m分钟 s秒
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

3. 基于bool的查询

一个bool查询，是一个或者多个查询子句的组合
- 总共包括4中子句。其中2种会影响评分，2种不影响评分
- must子句，必须匹配，贡献算分
- should选择性匹配，贡献算分
- must_not Filter Context查询子句，必须不能匹配，不贡献算分
- filter Filter Context必须匹配，但是不贡献算分

POST products/_search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "price" : "30" }
      },
      "filter": {
        "term" : { "avaliable" : "true" }
      },
      "must_not" : {
        "range" : {
          "price" : { "lte" : 10 }
        }
      },
      "should" : [
        { "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },
        { "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }
      ],
      "minimum_should_match" :1
    }
  }
}

// 同一层级下的竞争字段，具有相同的权重
// 通过嵌套bool查询，可以改变对算分的影响
POST animals/_search
{
  "query": {
    "bool": {
      "should": [
        { "term": { "text": "quick" }}, // A
        { "term": { "text": "dog"   }}, // B 与A具有相同的权重
        {
          "bool":{
            "should":[
               { "term": { "text": "brown" }}, // C
               { "term": { "text": "brown" }} // D C+D权重=A权重=B权重
            ]
          }
        }
      ]
    }
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

4. 单字符串多字段查询

Disjunction Max Query: 将任何与任一查询匹配的文档做为返回结果。采用字段上最匹配的评分作为最终评分返回

PUT /blogs/_doc/1
{
    "title": "Quick brown rabbits",
    "body":  "Brown rabbits are commonly seen."
}
PUT /blogs/_doc/2
{
    "title": "Keeping pets healthy",
    "body":  "My quick brown fox eats rabbits on a regular basis."
}
POST /blogs/_search
{
    "query": {
        "bool": {
            "should": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}
// 可以发现should match出来的排序结果并不是想要的，因此需要Disjunction Max Query
POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Brown fox" }},
                { "match": { "body":  "Brown fox" }}
            ]
        }
    }
}
POST blogs/_search
{
    "query": {
        "dis_max": {
            "queries": [
                { "match": { "title": "Quick pets" }},
                { "match": { "body":  "Quick pets" }}
            ],
	    // 可以尝试去掉该行，会发现评分会一致
            // 0<=tie_breaker<=1: 0最佳评分 1所有语句同样重要
            // 工作原理1.获得最佳匹配语句的评分_score 2.将其他匹配语句的评分与tie_breaker相乘
            // 3. 对以上评分求和并规范化
            "tie_breaker": 0.2
        }
    }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

5. 索引新增别名查询

POST blog-2021/_doc
{
  "name":"domi",
  "rating":5
}

POST blog-2022/_doc
{
  "name":"shenjian",
  "rating":3
}

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "blog-2021",
        "alias": "blog-latest"
      }
    },
    {
      "add": {
        "index": "blog-2022",
        "alias": "blog-latest"
      }
    }
  ]
}

GET blog-latest/_search
{
  "query": {
    "match_all": {}
  }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

欢迎关注公众号算法小生或沈健的技术博客

相关阅读:
python——网络编程
 水塘抽样算法及其代码实现(Scala)
VS2019 错误 MSB8066 自定义生成已退出，代码为 3
go语言包管理和变量保护
 Apache软件基金会的孵化标准和毕业标准
 深度解析NLP定义、应用与PyTorch实战
 C++静态联编和动态联编学习笔记
 异构混合阶多智能体系统编队控制的分布式优化
 刚入门软件测试行业的女生就能月薪过万骗局解秘
 85 最大矩形
原文地址：https://blog.csdn.net/SJshenjian/article/details/127434852