• Elasticsearch搜索辅助功能解析(十)


            ES提供的各种搜索辅助功能。例如,为优化搜索性能,需要指定搜索结果返回一部分字段内容。为了更好地呈现结果,需要用到结果计数和分页功能;当遇到性能瓶颈时,需要剖析搜索各个环节的耗时;面对不符合预期的搜索结果时,需要分析各个文档的评分细节。

    指定返回的字段

            考虑到性能问题,需要对搜索结果进行“瘦身”——指定返回的字段。在ES中,通过_source子句可以设定返回结果的字段。_source指向一个JSON数组,数组中的元素是希望返回的字段名称。

    1. PUT /hotel
    2. {
    3. "mappings": {
    4. "properties": {
    5. "title":{"type": "text"},
    6. "city":{"type": "keyword"},
    7. "price":{"type": "double"},
    8. "create_time":{"type": "date","format": "yyyy-MM-dd HH:mm:ss"},
    9. "attachment":{"type": "text"},
    10. "full_room":{"type": "boolean"},
    11. "location":{"type": "geo_point"},
    12. "praise":{"type": "integer"}
    13. }
    14. }
    15. }

    向hotel新增数据:

    1. POST /_bulk
    2. {"index":{"_index":"hotel","_id":"001"}}
    3. {"title":"java旅馆","city":"深圳","price":50.00,"create_time":"2022-08-05 00:00:00","location":{"lat":40.012312,"lon":116.497122},"praise":10}
    4. {"index":{"_index":"hotel","_id":"002"}}
    5. {"title":"python旅馆","city":"北京","price":50.00,"create_time":"2022-08-05 00:00:00","location":{"lat":40.012312,"lon":116.497122},"praise":10}
    6. {"index":{"_index":"hotel","_id":"003"}}
    7. {"title":"go旅馆","city":"上海","price":50.00,"create_time":"2022-08-05 00:00:00","location":{"lat":40.012312,"lon":116.497122},"praise":10}
    8. {"index":{"_index":"hotel","_id":"004"}}
    9. {"title":"C++旅馆","city":"广州","price":50.00,"create_time":"2022-08-05 00:00:00","location":{"lat":40.012312,"lon":116.497122},"praise":10}

    下面的DSL指定搜索结果只返回title和city字段:

    1. POST /hotel/_search
    2. {
    3. "_source": ["title","city"],
    4. "query": {
    5. "term": {
    6. "city": {
    7. "value": "深圳"
    8. }
    9. }
    10. }
    11. }

    返回结果:

    1. {
    2. "took" : 361,
    3. "timed_out" : false,
    4. "_shards" : {
    5. "total" : 1,
    6. "successful" : 1,
    7. "skipped" : 0,
    8. "failed" : 0
    9. },
    10. "hits" : {
    11. "total" : {
    12. "value" : 1,
    13. "relation" : "eq"
    14. },
    15. "max_score" : 1.2039728,
    16. "hits" : [
    17. {
    18. "_index" : "hotel",
    19. "_type" : "_doc",
    20. "_id" : "001",
    21. "_score" : 1.2039728,
    22. "_source" : {
    23. "city" : "深圳",
    24. "title" : "java旅馆"
    25. }
    26. }
    27. ]
    28. }
    29. }

    在上述搜索结果中,每个命中文档的_source结构体中只包含指定的city和title两个字段的数据。

            在Java客户端中,通过调用searchSourceBuilder.fetchSource()方法可以设定搜索返回的字段,该方法接收两个参数,即需要的字段数组和不需要的字段数组。以下代码片段将和上面的DSL呈现相同的效果:

    1. @Test
    2. public void testQueryNeedFields() throws IOException {
    3. RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(Arrays.stream("127.0.0.1:9200".split(","))
    4. .map(host->{
    5. String[] split = host.split(":");
    6. String hostName = split[0];
    7. int port = Integer.parseInt(split[1]);
    8. return new HttpHost(hostName,port,HttpHost.DEFAULT_SCHEME_NAME);
    9. }).filter(Objects::nonNull).toArray(HttpHost[]::new)));
    10. SearchRequest request = new SearchRequest("hotel");
    11. SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    12. sourceBuilder.query(new TermQueryBuilder("city","深圳"));
    13. sourceBuilder.fetchSource(new String[]{"title","city"},null);
    14. request.source(sourceBuilder);
    15. SearchResponse search = client.search(request, RequestOptions.DEFAULT);
    16. System.out.println(search.getHits());
    17. }

    结果计数

    为提升搜索体验,需要给前端传递搜索匹配结果的文档条数,即需要对搜索结果进行计数。针对这个要求,ES提供了_count API功能,在该API中,用户提供query子句用于结果匹配,ES会返回匹配的文档条数。下面的DSL将返回城市为“北京”的旅馆个数:

    1. POST /hotel/_count
    2. {
    3. "query": {
    4. "term": {
    5. "city": {
    6. "value": "北京"
    7. }
    8. }
    9. }
    10. }

    执行上述DSL后,返回信息如下:

    1. {
    2. "count" : 1,
    3. "_shards" : {
    4. "total" : 1,
    5. "successful" : 1,
    6. "skipped" : 0,
    7. "failed" : 0
    8. }
    9. }

    由结果可知,ES不仅返回了匹配的文档数量(值为1),并且还返回了和分片相关的元数据,如总共扫描的分片个数,以及成功、失败、跳过的分片个数等。

            在Java客户端中,通过CountRequest执行_count API,然后调用CountRequest对象的source()方法设置查询逻辑。countRequest.source()方法返回CountResponse对象,通过countResponse.getCount()方法可以得到匹配的文档条数。以下代码将和上面的DSL呈现相同的效果:

    1. @Test
    2. public void testCount() throws IOException {
    3. RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(Arrays.stream("127.0.0.1:9200".split(","))
    4. .map(host->{
    5. String[] split = host.split(":");
    6. String hostName = split[0];
    7. int port = Integer.parseInt(split[1]);
    8. return new HttpHost(hostName,port,HttpHost.DEFAULT_SCHEME_NAME);
    9. }).filter(Objects::nonNull).toArray(HttpHost[]::new)));
    10. CountRequest countRequest = new CountRequest("hotel");
    11. SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    12. sourceBuilder.query(new TermQueryBuilder("city","深圳"));
    13. countRequest.source(sourceBuilder);
    14. CountResponse response = client.count(countRequest,RequestOptions.DEFAULT);
    15. System.out.println(response.getCount());
    16. }

    结果分页

            在实际的搜索应用中,分页是必不可少的功能。在默认情况下,ES返回前10个搜索匹配的文档。用户可以通过设置from和size来定义搜索位置和每页显示的文档数量,from表示查询结果的起始下标,默认值为0,size表示从起始下标开始返回的文档个数,默认值为10。下面的DSL将返回下标从0开始的20个结果。

    1. GET /hotel/_search
    2. {
    3. "from":0, //设置搜索起始位置
    4. "size": 2,//设置搜索返回的文档个数
    5. "query": {
    6. "term": {
    7. "city": {
    8. "value": "深圳"
    9. }
    10. }
    11. }
    12. }

    在默认情况下,用户最多可以取得10 000个文档,即from为0时,size参数最大为10 000,如果请求超过该值,ES返回如下报错信息:

    1. {
    2. "error" : {
    3. "root_cause" : [
    4. {
    5. "type" : "illegal_argument_exception",
    6. "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
    7. }
    8. ],
    9. "type" : "search_phase_execution_exception",
    10. "reason" : "all shards failed",
    11. "phase" : "query",
    12. "grouped" : true,
    13. "failed_shards" : [
    14. {
    15. "shard" : 0,
    16. "index" : "hotel",
    17. "node" : "tiANekxXS_GtirH4DamrFA",
    18. "reason" : {
    19. "type" : "illegal_argument_exception",
    20. "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
    21. }
    22. }
    23. ],
    24. "caused_by" : {
    25. "type" : "illegal_argument_exception",
    26. "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.",
    27. "caused_by" : {
    28. "type" : "illegal_argument_exception",
    29. "reason" : "Result window is too large, from + size must be less than or equal to: [10000] but was [10001]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
    30. }
    31. }
    32. },
    33. "status" : 400
    34. }

    对于普通的搜索应用来说,size设为10 000已经足够用了。如果确实需要返回多于10 000条的数据,可以适当修改max_result_window的值。以下示例将hotel索引的最大窗口值修改为了20 000。

    1. PUT /hotel/_settings
    2. {
    3. "index":{
    4. "max_result_window":20000
    5. }
    6. }

    注意,如果将配置修改得很大,一定要有足够强大的硬件作为支撑。

            作为一个分布式搜索引擎,一个ES索引的数据分布在多个分片中,而这些分片又分配在不同的节点上。一个带有分页的搜索请求往往会跨越多个分片,每个分片必须在内存中构建一个长度为from+size的、按照得分排序的有序队列,用以存储命中的文档。然后这些分片对应的队列数据都会传递给协调节点,协调节点将各个队列的数据进行汇总,需要提供一个长度为number_of_shards*(from+size)的队列用以进行全局排序,然后再按照用户的请求从from位置开始查找,找到size个文档后进行返回。

            基于上述原理,ES不适合深翻页。什么是深翻页呢?简而言之就是请求的from值很大。假设在一个3个分片的索引中进行搜索请求,参数from和size的值分别为1000和10,其响应过程如下图所示。

             当深翻页的请求过多时会增加各个分片所在节点的内存和CPU消耗。尤其是协调节点,随着页码的增加和并发请求的增多,该节点需要对这些请求涉及的分片数据进行汇总和排序,过多的数据会导致协调节点资源耗尽而停止服务。

            作为搜索引擎,ES更适合的场景是对数据进行搜索,而不是进行大规模的数据遍历。一般情况下,只需要返回前1000条数据即可,没有必要取到10 000条数据。如果确实有大规模数据遍历的需求,可以参考使用scroll模式或者考虑使用其他的存储引擎。

            在Java客户端中,可以调用SearchSourceBuilder的from()和size()方法来设定from和size参数。以下代码片段将from和size的值分别设置为20和10。

    1. @Test
    2. public void testQueryByPage() throws IOException {
    3. RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(Arrays.stream("127.0.0.1:9200".split(","))
    4. .map(host->{
    5. String[] split = host.split(":");
    6. String hostName = split[0];
    7. int port = Integer.parseInt(split[1]);
    8. return new HttpHost(hostName,port,HttpHost.DEFAULT_SCHEME_NAME);
    9. }).filter(Objects::nonNull).toArray(HttpHost[]::new)));
    10. SearchRequest request = new SearchRequest("hotel");
    11. SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    12. searchSourceBuilder.query(new TermQueryBuilder("city","深圳"));
    13. searchSourceBuilder.from(20);
    14. searchSourceBuilder.size(10);
    15. request.source(searchSourceBuilder);
    16. client.search(request,RequestOptions.DEFAULT);
    17. }

    性能分析

            在使用ES的过程中,有的搜索请求的响应可能比较慢,其中大部分的原因是DSL的执行逻辑有问题。ES提供了profile功能,该功能详细地列出了搜索时每一个步骤的耗时,可以帮助用户对DSL的性能进行剖析。开启profile功能只需要在一个正常的搜索请求的DSL中添加"profile":"true"即可。以下查询将开启profile功能:

    1. GET /hotel/_search
    2. {
    3. "profile": true,
    4. "query": {
    5. "match": {
    6. "title": "北京"
    7. }
    8. }
    9. }

    执行以上DSL后ES返回了一段比较冗长的信息:

    1. {
    2. "took" : 60,
    3. "timed_out" : false,
    4. "_shards" : {
    5. "total" : 1,
    6. "successful" : 1,
    7. "skipped" : 0,
    8. "failed" : 0
    9. },
    10. "hits" : {
    11. "total" : {
    12. "value" : 0,
    13. "relation" : "eq"
    14. },
    15. "max_score" : null,
    16. "hits" : [ ]
    17. },
    18. "profile" : {
    19. "shards" : [
    20. {
    21. "id" : "[tiANekxXS_GtirH4DamrFA][hotel][0]",
    22. "searches" : [
    23. {
    24. "query" : [
    25. {
    26. "type" : "BooleanQuery",
    27. "description" : "title:北 title:京",
    28. "time_in_nanos" : 1032417,
    29. "breakdown" : {
    30. "set_min_competitive_score_count" : 0,
    31. "match_count" : 0,
    32. "shallow_advance_count" : 0,
    33. "set_min_competitive_score" : 0,
    34. "next_doc" : 0,
    35. "match" : 0,
    36. "next_doc_count" : 0,
    37. "score_count" : 0,
    38. "compute_max_score_count" : 0,
    39. "compute_max_score" : 0,
    40. "advance" : 0,
    41. "advance_count" : 0,
    42. "score" : 0,
    43. "build_scorer_count" : 1,
    44. "create_weight" : 1023459,
    45. "shallow_advance" : 0,
    46. "create_weight_count" : 1,
    47. "build_scorer" : 8958
    48. },
    49. "children" : [
    50. {
    51. "type" : "TermQuery",
    52. "description" : "title:北",
    53. "time_in_nanos" : 182334,
    54. "breakdown" : {
    55. "set_min_competitive_score_count" : 0,
    56. "match_count" : 0,
    57. "shallow_advance_count" : 0,
    58. "set_min_competitive_score" : 0,
    59. "next_doc" : 0,
    60. "match" : 0,
    61. "next_doc_count" : 0,
    62. "score_count" : 0,
    63. "compute_max_score_count" : 0,
    64. "compute_max_score" : 0,
    65. "advance" : 0,
    66. "advance_count" : 0,
    67. "score" : 0,
    68. "build_scorer_count" : 1,
    69. "create_weight" : 179167,
    70. "shallow_advance" : 0,
    71. "create_weight_count" : 1,
    72. "build_scorer" : 3167
    73. }
    74. },
    75. {
    76. "type" : "TermQuery",
    77. "description" : "title:京",
    78. "time_in_nanos" : 15167,
    79. "breakdown" : {
    80. "set_min_competitive_score_count" : 0,
    81. "match_count" : 0,
    82. "shallow_advance_count" : 0,
    83. "set_min_competitive_score" : 0,
    84. "next_doc" : 0,
    85. "match" : 0,
    86. "next_doc_count" : 0,
    87. "score_count" : 0,
    88. "compute_max_score_count" : 0,
    89. "compute_max_score" : 0,
    90. "advance" : 0,
    91. "advance_count" : 0,
    92. "score" : 0,
    93. "build_scorer_count" : 1,
    94. "create_weight" : 14792,
    95. "shallow_advance" : 0,
    96. "create_weight_count" : 1,
    97. "build_scorer" : 375
    98. }
    99. }
    100. ]
    101. }
    102. ],
    103. "rewrite_time" : 183625,
    104. "collector" : [
    105. {
    106. "name" : "SimpleTopScoreDocCollector",
    107. "reason" : "search_top_hits",
    108. "time_in_nanos" : 32000
    109. }
    110. ]
    111. }
    112. ],
    113. "aggregations" : [ ]
    114. }
    115. ]
    116. }
    117. }

            如上所示,在带有profile的返回信息中,除了包含搜索结果外,还包含profile子句,在该子句中展示了搜索过程中各个环节的名称及耗时情况。需要注意的是,使用profile功能是有资源损耗的,建议用户只在前期调试的时候使用该功能,在生产中不要开启profile功能。

            因为一个搜索可能会跨越多个分片,所以使用shards数组放在profile子句中。每个shard子句中包含3个元素,分别是id、searches和aggregations。

    1. id表示分片的唯一标识,它的组成形式为[nodeID][indexName][shardID]。
    2. searches以数组的形式存在,因为有的搜索请求会跨多个索引进行搜索。每一个search子元素即为在同一个索引中的子查询,此处不仅返回了该search子元素耗时的信息,而且还返回了搜索“北京”的详细策略,即被拆分成“title:北”和“title:京”两个子查询。同理,children子元素给出了“title:北”“title:京”的耗时和详细搜索步骤的耗时,此处不再赘述。
    3. aggregations只有在进行聚合运算时才有内容

            上面只是一个很简单的例子,如果查询比较复杂或者命中的分片比较多,profile返回的信息将特别冗长。在这种情况下,用户进行性能剖析的效率将非常低。为此,Kibana提供了可视化的profile功能,该功能建立在ES的profile功能基础上。在Kibana的Dev Tools界面中单击Search Profiler链接,就可以使用可视化的profile了,其区域布局如下图所示:

    评分分析

            在使用搜索引擎时,一般都会涉及排序功能。如果用户不指定按照某个字段进行升序或者降序排列,那么ES会使用自己的打分算法对文档进行排序。有时我们需要知道某个文档具体的打分详情,以便于对搜索DSL问题展开排查。ES提供了explain功能来帮助使用者查看搜索时的匹配详情。explain的使用形式如下:

    1. GET /${index_name}/_explain/${doc_id}
    2. {
    3. "query":{
    4. ...
    5. }
    6. }

    以下示例为按照标题进行搜索的explain查询请求:

    1. GET /hotel/_explain/002
    2. {
    3. "query":{
    4. "match": {
    5. "title": "python"
    6. }
    7. }
    8. }

     执行上述explain查询请求后,ES返回的信息如下:

    1. {
    2. "_index" : "hotel",
    3. "_type" : "_doc",
    4. "_id" : "002",
    5. "matched" : true,
    6. "explanation" : {
    7. "value" : 1.2039728,
    8. "description" : "weight(title:python in 1) [PerFieldSimilarity], result of:",
    9. "details" : [
    10. {
    11. "value" : 1.2039728,
    12. "description" : "score(freq=1.0), computed as boost * idf * tf from:",
    13. "details" : [
    14. {
    15. "value" : 2.2,
    16. "description" : "boost",
    17. "details" : [ ]
    18. },
    19. {
    20. "value" : 1.2039728,
    21. "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
    22. "details" : [
    23. {
    24. "value" : 1,
    25. "description" : "n, number of documents containing term",
    26. "details" : [ ]
    27. },
    28. {
    29. "value" : 4,
    30. "description" : "N, total number of documents with field",
    31. "details" : [ ]
    32. }
    33. ]
    34. },
    35. {
    36. "value" : 0.45454544,
    37. "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
    38. "details" : [
    39. {
    40. "value" : 1.0,
    41. "description" : "freq, occurrences of term within document",
    42. "details" : [ ]
    43. },
    44. {
    45. "value" : 1.2,
    46. "description" : "k1, term saturation parameter",
    47. "details" : [ ]
    48. },
    49. {
    50. "value" : 0.75,
    51. "description" : "b, length normalization parameter",
    52. "details" : [ ]
    53. },
    54. {
    55. "value" : 3.0,
    56. "description" : "dl, length of field",
    57. "details" : [ ]
    58. },
    59. {
    60. "value" : 3.0,
    61. "description" : "avgdl, average length of field",
    62. "details" : [ ]
    63. }
    64. ]
    65. }
    66. ]
    67. }
    68. ]
    69. }
    70. }

    可以看到,explain返回的信息比较全面。

    另外,如果一个文档和查询不匹配,explain也会直接返回信息告知用户,具体如下:

    1. {
    2. "_index" : "hotel",
    3. "_type" : "_doc",
    4. "_id" : "001",
    5. "matched" : false,
    6. "explanation" : {
    7. "value" : 0.0,
    8. "description" : "no matching term",
    9. "details" : [ ]
    10. }
    11. }

  • 相关阅读:
    linux parted给磁盘分区
    QT入门10个小demo——MP4视频播放器
    计算机组成与设计硬软件接口学习2
    驱动开发:内核枚举ShadowSSDT基址
    西煤交易平台竞拍学习
    mysql的主从创建及mycat的安装
    线性递推数列的通项公式(非常简单,三步完成)
    nacos解决启动报错 Unable to start embedded Tomcat
    Python实现---南邮离散数学实验三:盖住关系的求取及格的判定
    氟尼辛肽核酸寡聚体复合物|规活性基团Alkyne炔烃,SH Thiol炔基修饰肽核酸
  • 原文地址:https://blog.csdn.net/ntzzzsj/article/details/126166557