• 从查询重写角度理解elasticsearch的高亮原理


    一、高亮的一些问题

    elasticsearch提供了三种高亮方式,前面我们已经简单的了解了elasticsearch的高亮原理; 高亮处理跟实际使用查询类型有十分紧密的关系,其中主要的一点就是muti term 查询的重写,例如wildcard、prefix等,由于查询本身和高亮都涉及到查询语句的重写,如果两者之间的重写机制不同,那么就可能会碰到以下情况

    相同的查询语句, 使用unified和fvh得到的高亮结果是不同的,甚至fvh Highlighter无任何高亮信息返回;

    二、数据环境

    elasticsearch 8.0

    PUT highlight_test
    {
      "mappings": {
        "properties": {
          "text":{
            "type": "text",
            "term_vector": "with_positions_offsets"
          }
        }
      },
      "settings": {
        "number_of_replicas":0,
        "number_of_shards": 1
      }
    }
    
    PUT highlight_test/_doc/1
    {
      "name":"mango",
      "text":"my name is mongo, i am test hightlight in elastic search"
    }
    

    三、muti term查询重写简介

    所谓muti term查询就是查询中并不是明确的关键字,而是需要elasticsearch重写查询语句,进一步明确关键字;以下查询会涉及到muti term查询重写;

    fuzzy
    prefix
    query_string
    regexp
    wildcard
    

    以上查询都支持rewrite参数,最终将查询重写为bool查询或者bitset;

    查询重写主要影响以下几方面

    重写需要抓取哪些关键字以及抓取的数量;

    抓取关键字的相关性计算方式;

    查询重写支持以下参数选项

    constant_score,默认值,如果需要抓取的关键字比较少,则重写为bool查询,否则抓取所有的关键字并重写为bitset;直接使用boost参数作为文档score,一般term level的查询的boost默认值为1;

    constant_score_boolean,将查询重写为bool查询,并使用boost参数作为文档的score,受到indices.query.bool.max_clause_count 限制,所以默认最多抓取1024个关键字;

    scoring_boolean,将查询重写为bool查询,并计算文档的相对权重,受到indices.query.bool.max_clause_count 限制,所以默认最多抓取1024个关键字;

    top_terms_blended_freqs_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;选择命中文档的所有关键字中权重最大的作为文档的score;

    top_terms_boost_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;直接使用boost作为文档的score;

    top_terms_N,抓取得分最高的前N个关键字,并将查询重写为bool查询;此选项不受indices.query.bool.max_clause_count 限制;计算命中文档的相对权重作为评分;

    三、wildcard查询重写分析

    我们通过elasticsearch来查看一下以下查询语句的重写逻辑;

    {
        "query":{
            "wildcard":{
                "text":{
                    "value":"m*"
                }
            }
        }
    }
    

    通过查询使用的字段映射类型构建WildCardQuery,并使用查询语句中配置的rewrite对应的MultiTermQuery.RewriteMethod;

    //WildcardQueryBuilder.java
    @Override
    protected Query doToQuery(SearchExecutionContext context) throws IOException {
        MappedFieldType fieldType = context.getFieldType(fieldName);
    
        if (fieldType == null) {
            throw new IllegalStateException("Rewrite first");
        }
    
        MultiTermQuery.RewriteMethod method = QueryParsers.parseRewriteMethod(rewrite, null, LoggingDeprecationHandler.INSTANCE);
        return fieldType.wildcardQuery(value, method, caseInsensitive, context);
    }
    

    根据查询语句中配置的rewrite,查找对应的MultiTermQuery.RewriteMethod,由于我们没有在wildcard查询语句中设置rewrite参数,这里直接返回null;

    //QueryParsers.java
    public static MultiTermQuery.RewriteMethod parseRewriteMethod(
        @Nullable String rewriteMethod,
        @Nullable MultiTermQuery.RewriteMethod defaultRewriteMethod,
        DeprecationHandler deprecationHandler
    ) {
        if (rewriteMethod == null) {
            return defaultRewriteMethod;
        }
        if (CONSTANT_SCORE.match(rewriteMethod, deprecationHandler)) {
            return MultiTermQuery.CONSTANT_SCORE_REWRITE;
        }
        if (SCORING_BOOLEAN.match(rewriteMethod, deprecationHandler)) {
            return MultiTermQuery.SCORING_BOOLEAN_REWRITE;
        }
        if (CONSTANT_SCORE_BOOLEAN.match(rewriteMethod, deprecationHandler)) {
            return MultiTermQuery.CONSTANT_SCORE_BOOLEAN_REWRITE;
        }
    
        int firstDigit = -1;
        for (int i = 0; i < rewriteMethod.length(); ++i) {
            if (Character.isDigit(rewriteMethod.charAt(i))) {
                firstDigit = i;
                break;
            }
        }
    
        if (firstDigit >= 0) {
            final int size = Integer.parseInt(rewriteMethod.substring(firstDigit));
            String rewriteMethodName = rewriteMethod.substring(0, firstDigit);
    
            if (TOP_TERMS.match(rewriteMethodName, deprecationHandler)) {
                return new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(size);
            }
            if (TOP_TERMS_BOOST.match(rewriteMethodName, deprecationHandler)) {
                return new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(size);
            }
            if (TOP_TERMS_BLENDED_FREQS.match(rewriteMethodName, deprecationHandler)) {
                return new MultiTermQuery.TopTermsBlendedFreqScoringRewrite(size);
            }
        }
    
        throw new IllegalArgumentException("Failed to parse rewrite_method [" + rewriteMethod + "]");
    }
    }
    

    WildCardQuery继承MultiTermQuery,直接调用rewrite方法进行重写,由于我们没有在wildcard查询语句中设置rewrite参数,这里直接使用默认的CONSTANT_SCORE_REWRITE;

      //MultiTermQuery.java
      protected RewriteMethod rewriteMethod = CONSTANT_SCORE_REWRITE;
      
      
      @Override
      public final Query rewrite(IndexReader reader) throws IOException {
        return rewriteMethod.rewrite(reader, this);
      }
    

    可以看到CONSTANT_SCORE_REWRITE是直接使用的匿名类,rewrite方法返回的是MultiTermQueryConstantScoreWrapper的实例;

      //MultiTermQuery.java
      public static final RewriteMethod CONSTANT_SCORE_REWRITE =
          new RewriteMethod() {
            @Override
            public Query rewrite(IndexReader reader, MultiTermQuery query) {
              return new MultiTermQueryConstantScoreWrapper<>(query);
            }
          };
    

    在以下方法中,首先会得到查询字段对应的所有term集合;
    然后通过 query.getTermsEnum获取跟查询匹配的所有term集合;
    最后根据collectTerms调用的返回值决定是否构建bool查询还是bit set;

          //MultiTermQueryConstantScoreWrapper.java
          private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
            final Terms terms = context.reader().terms(query.field);
            if (terms == null) {
              // field does not exist
              return new WeightOrDocIdSet((DocIdSet) null);
            }
    
            final TermsEnum termsEnum = query.getTermsEnum(terms);
            assert termsEnum != null;
    
            PostingsEnum docs = null;
    
            final List<TermAndState> collectedTerms = new ArrayList<>();
            if (collectTerms(context, termsEnum, collectedTerms)) {
              // build a boolean query
              BooleanQuery.Builder bq = new BooleanQuery.Builder();
              for (TermAndState t : collectedTerms) {
                final TermStates termStates = new TermStates(searcher.getTopReaderContext());
                termStates.register(t.state, context.ord, t.docFreq, t.totalTermFreq);
                bq.add(new TermQuery(new Term(query.field, t.term), termStates), Occur.SHOULD);
              }
              Query q = new ConstantScoreQuery(bq.build());
              final Weight weight = searcher.rewrite(q).createWeight(searcher, scoreMode, score());
              return new WeightOrDocIdSet(weight);
            }
    
            // Too many terms: go back to the terms we already collected and start building the bit set
            DocIdSetBuilder builder = new DocIdSetBuilder(context.reader().maxDoc(), terms);
            if (collectedTerms.isEmpty() == false) {
              TermsEnum termsEnum2 = terms.iterator();
              for (TermAndState t : collectedTerms) {
                termsEnum2.seekExact(t.term, t.state);
                docs = termsEnum2.postings(docs, PostingsEnum.NONE);
                builder.add(docs);
              }
            }
    
            // Then keep filling the bit set with remaining terms
            do {
              docs = termsEnum.postings(docs, PostingsEnum.NONE);
              builder.add(docs);
            } while (termsEnum.next() != null);
    
            return new WeightOrDocIdSet(builder.build());
          }
    

    调用collectTerms默认只会提取查询命中的16个关键字;

          //MultiTermQueryConstantScoreWrapper.java
          private static final int BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD = 16;
          private boolean collectTerms(
              LeafReaderContext context, TermsEnum termsEnum, List<TermAndState> terms)
              throws IOException {
            final int threshold =
                Math.min(BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD, IndexSearcher.getMaxClauseCount());
            for (int i = 0; i < threshold; ++i) {
              final BytesRef term = termsEnum.next();
              if (term == null) {
                return true;
              }
              TermState state = termsEnum.termState();
              terms.add(
                  new TermAndState(
                      BytesRef.deepCopyOf(term),
                      state,
                      termsEnum.docFreq(),
                      termsEnum.totalTermFreq()));
            }
            return termsEnum.next() == null;
          }
    

    通过以上分析wildcard查询默认情况下,会提取字段中所有命中查询的关键字;

    四、fvh Highlighter中wildcard的查询重写

    在muti term query中,提取查询关键字是高亮逻辑一个很重要的步骤;

    我们使用以下高亮语句,分析以下高亮中提取查询关键字过程中的查询重写;

    {
        "query":{
            "wildcard":{
                "text":{
                    "value":"m*"
                }
            }
        },
        "highlight":{
            "fields":{
                "text":{
                    "type":"fvh"
                }
            }
        }
    }
    

    默认情况下只有匹配的字段才会进行高亮,这里构建CustomFieldQuery;

    //FastVectorHighlighter.java
    if (field.fieldOptions().requireFieldMatch()) {
        /*
         * we use top level reader to rewrite the query against all readers,
         * with use caching it across hits (and across readers...)
         */
        entry.fieldMatchFieldQuery = new CustomFieldQuery(
            fieldContext.query,
            hitContext.topLevelReader(),
            true,
            field.fieldOptions().requireFieldMatch()
        );
    }
    

    通过调用flatten方法得到重写之后的flatQueries,然后将每个提取的关键字重写为BoostQuery;

      //FieldQuery.java
      public FieldQuery(Query query, IndexReader reader, boolean phraseHighlight, boolean fieldMatch)
          throws IOException {
        this.fieldMatch = fieldMatch;
        Set<Query> flatQueries = new LinkedHashSet<>();
        flatten(query, reader, flatQueries, 1f);
        saveTerms(flatQueries, reader);
        Collection<Query> expandQueries = expand(flatQueries);
    
        for (Query flatQuery : expandQueries) {
          QueryPhraseMap rootMap = getRootMap(flatQuery);
          rootMap.add(flatQuery, reader);
          float boost = 1f;
          while (flatQuery instanceof BoostQuery) {
            BoostQuery bq = (BoostQuery) flatQuery;
            flatQuery = bq.getQuery();
            boost *= bq.getBoost();
          }
          if (!phraseHighlight && flatQuery instanceof PhraseQuery) {
            PhraseQuery pq = (PhraseQuery) flatQuery;
            if (pq.getTerms().length > 1) {
              for (Term term : pq.getTerms()) rootMap.addTerm(term, boost);
            }
          }
        }
      }
    

    由于WildCardQuery是MultiTermQuery的子类,所以在flatten方法中最终直接使用MultiTermQuery.TopTermsScoringBooleanQueryRewrite进行查询重写,这里的top N是MAX_MTQ_TERMS = 1024;

      //FieldQuery.java
      
      private static final int MAX_MTQ_TERMS = 1024;
      
      protected void flatten(
          Query sourceQuery, IndexReader reader, Collection<Query> flatQueries, float boost)
          throws IOException {
          
         ..................................
         ..................................
          
         else if (reader != null) {
          Query query = sourceQuery;
          Query rewritten;
          if (sourceQuery instanceof MultiTermQuery) {
            rewritten =
                new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS)
                    .rewrite(reader, (MultiTermQuery) query);
          } else {
            rewritten = query.rewrite(reader);
          }
          if (rewritten != query) {
            // only rewrite once and then flatten again - the rewritten query could have a speacial
            // treatment
            // if this method is overwritten in a subclass.
            flatten(rewritten, reader, flatQueries, boost);
          }
          // if the query is already rewritten we discard it
        }
        // else discard queries
      }
    

    这里首先计算设置的size和getMaxSize(默认值1024, IndexSearcher.getMaxClauseCount())计算最终提取的命中关键字数量,这里最终是1024个;

    这里省略了传入collectTerms的TermCollector匿名子类的实现,其余最终提取关键字数量有关;

      //FieldQuery.java
    
      @Override
      public final Query rewrite(final IndexReader reader, final MultiTermQuery query)
          throws IOException {
        final int maxSize = Math.min(size, getMaxSize());
        final PriorityQueue<ScoreTerm> stQueue = new PriorityQueue<>();
        collectTerms(
            reader,
            query,
            new TermCollector() {       
    
              ................
    
            });
    
        .............
        return build(b);
      }
    

    这里首先获取查询字段对应的所有term集合,然后获取所有的与查询匹配的term集合,最终通过传入的collector提取关键字;

      //TermCollectingRewrite.java
      final void collectTerms(IndexReader reader, MultiTermQuery query, TermCollector collector)
          throws IOException {
        IndexReaderContext topReaderContext = reader.getContext();
        for (LeafReaderContext context : topReaderContext.leaves()) {
          final Terms terms = context.reader().terms(query.field);
          if (terms == null) {
            // field does not exist
            continue;
          }
    
          final TermsEnum termsEnum = getTermsEnum(query, terms, collector.attributes);
          assert termsEnum != null;
    
          if (termsEnum == TermsEnum.EMPTY) continue;
    
          collector.setReaderContext(topReaderContext, context);
          collector.setNextEnum(termsEnum);
          BytesRef bytes;
          while ((bytes = termsEnum.next()) != null) {
            if (!collector.collect(bytes))
              return; // interrupt whole term collection, so also don't iterate other subReaders
          }
        }
      }
    

    这里通过控制最终提取匹配查询的关键字的数量不超过maxSize;

              //TopTermsRewrite.java
              @Override
              public boolean collect(BytesRef bytes) throws IOException {
                final float boost = boostAtt.getBoost();
    
                // make sure within a single seg we always collect
                // terms in order
                assert compareToLastTerm(bytes);
    
                // System.out.println("TTR.collect term=" + bytes.utf8ToString() + " boost=" + boost + "
                // ord=" + readerContext.ord);
                // ignore uncompetitive hits
                if (stQueue.size() == maxSize) {
                  final ScoreTerm t = stQueue.peek();
                  if (boost < t.boost) return true;
                  if (boost == t.boost && bytes.compareTo(t.bytes.get()) > 0) return true;
                }
                ScoreTerm t = visitedTerms.get(bytes);
                final TermState state = termsEnum.termState();
                assert state != null;
                if (t != null) {
                  // if the term is already in the PQ, only update docFreq of term in PQ
                  assert t.boost == boost : "boost should be equal in all segment TermsEnums";
                  t.termState.register(
                      state, readerContext.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());
                } else {
                  // add new entry in PQ, we must clone the term, else it may get overwritten!
                  st.bytes.copyBytes(bytes);
                  st.boost = boost;
                  visitedTerms.put(st.bytes.get(), st);
                  assert st.termState.docFreq() == 0;
                  st.termState.register(
                      state, readerContext.ord, termsEnum.docFreq(), termsEnum.totalTermFreq());
                  stQueue.offer(st);
                  // possibly drop entries from queue
                  if (stQueue.size() > maxSize) {
                    st = stQueue.poll();
                    visitedTerms.remove(st.bytes.get());
                    st.termState.clear(); // reset the termstate!
                  } else {
                    st = new ScoreTerm(new TermStates(topReaderContext));
                  }
                  assert stQueue.size() <= maxSize : "the PQ size must be limited to maxSize";
                  // set maxBoostAtt with values to help FuzzyTermsEnum to optimize
                  if (stQueue.size() == maxSize) {
                    t = stQueue.peek();
                    maxBoostAtt.setMaxNonCompetitiveBoost(t.boost);
                    maxBoostAtt.setCompetitiveTerm(t.bytes.get());
                  }
                }
    
                return true;
              }
    

    通过以上分析可以看到,fvh Highlighter对multi term query的重写,直接使用MultiTermQuery.TopTermsScoringBooleanQueryRewrite,并限制只能最多提取查询关键字1024个;

    五、重写可能导致的高亮问题原因分析

    经过以上对查询和高亮的重写过程分析可以知道,默认情况下

    query阶段提取的是命中查询的所有的关键字,具体行为可以通过rewrite参数进行定制;

    Highlight阶段提取的是命中查询的关键字中的前1024个,具体行为不受rewrite参数的控制;

    如果查询的字段是大文本字段,导致字段的关键字很多,就可能会出现查询命中的文档的关键字不在前1024个里边,从而导致明明匹配了文档,但是却没有返回高亮信息;

    六、解决方案

    1. 进一步明确查询关键字,减少查询命中的关键字的数量,例如输入更多的字符,;
    2. 使用其他类型的查询替换multi term query;
  • 相关阅读:
    8c sql手册 六
    Netapp数据恢复—Netapp存储误删除lun的数据恢复过程
    Android逆向——脱壳解析
    【408】数据结构知识点(查漏补缺)
    我们离成为C++、C#、MySQL之父有多远?
    java连接SQL Sever数据库(超详细!)
    CYEZ 模拟赛 4
    『忘了再学』Shell基础 — 12、用户自定义变量
    Kamiya丨Kamiya艾美捷人α1-抗糜蛋白酶ELISA说明书
    C++指针危险的原因
  • 原文地址:https://www.cnblogs.com/wufengtinghai/p/16074845.html