• elasticsearch之使用正则表达式自定义分词逻辑


    一、Pattern Analyzer简介

    elasticsearch在索引和搜索之前都需要对输入的文本进行分词,elasticsearch提供的pattern analyzer使得我们可以通过正则表达式的简单方式来定义分隔符,从而达到自定义分词的处理逻辑;

    内置的的pattern analyzer的名字为pattern,其使用的模式是W+,即除了字母和数字之外的所有非单词字符;

    analyzers.add(new PreBuiltAnalyzerProviderFactory("pattern", CachingStrategy.ELASTICSEARCH,
                () -> new PatternAnalyzer(Regex.compile("\\W+" /*PatternAnalyzer.NON_WORD_PATTERN*/, null), true,
                CharArraySet.EMPTY_SET)));
    

    作为全局的pattern analyzer,我们可以直接使用

    POST _analyze
    {
      "analyzer": "pattern",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    {
      "tokens" : [
        {
          "token" : "the",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "2",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "quick",
          "start_offset" : 6,
          "end_offset" : 11,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "brown",
          "start_offset" : 12,
          "end_offset" : 17,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "foxes",
          "start_offset" : 18,
          "end_offset" : 23,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "jumped",
          "start_offset" : 24,
          "end_offset" : 30,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "over",
          "start_offset" : 31,
          "end_offset" : 35,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "the",
          "start_offset" : 36,
          "end_offset" : 39,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "lazy",
          "start_offset" : 40,
          "end_offset" : 44,
          "type" : "word",
          "position" : 8
        },
        {
          "token" : "dog",
          "start_offset" : 45,
          "end_offset" : 48,
          "type" : "word",
          "position" : 9
        },
        {
          "token" : "s",
          "start_offset" : 49,
          "end_offset" : 50,
          "type" : "word",
          "position" : 10
        },
        {
          "token" : "bone",
          "start_offset" : 51,
          "end_offset" : 55,
          "type" : "word",
          "position" : 11
        }
      ]
    }
    
    

    二、自定义Pattern Analyzer

    我们可以通过以下方式自定pattern analyzer,并设置分隔符为所有的空格符号;

    PUT my_pattern_test_space_analyzer
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_pattern_test_space_analyzer": {
              "type":      "pattern",
              "pattern":   "[\\p{Space}]", 
              "lowercase": true
            }
          }
        }
      }
    }
    

    我们使用自定义的pattern analyzer测试一下效果

    POST my_pattern_test_space_analyzer/_analyze
    {
      "analyzer": "my_pattern_test_space_analyzer",
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
    }
    
    
    {
      "tokens" : [
        {
          "token" : "the",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "2",
          "start_offset" : 4,
          "end_offset" : 5,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "quick",
          "start_offset" : 6,
          "end_offset" : 11,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "brown-foxes",
          "start_offset" : 12,
          "end_offset" : 23,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "jumped",
          "start_offset" : 24,
          "end_offset" : 30,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "over",
          "start_offset" : 31,
          "end_offset" : 35,
          "type" : "word",
          "position" : 5
        },
        {
          "token" : "the",
          "start_offset" : 36,
          "end_offset" : 39,
          "type" : "word",
          "position" : 6
        },
        {
          "token" : "lazy",
          "start_offset" : 40,
          "end_offset" : 44,
          "type" : "word",
          "position" : 7
        },
        {
          "token" : "dog's",
          "start_offset" : 45,
          "end_offset" : 50,
          "type" : "word",
          "position" : 8
        },
        {
          "token" : "bone.",
          "start_offset" : 51,
          "end_offset" : 56,
          "type" : "word",
          "position" : 9
        }
      ]
    }
    

    三、常用的Java中的正则表达式

    elasticsearch的Pattern Analyzer使用的Java Regular Expressions,只有了解Java中一些常用的正则表达式才能更好的自定义pattern analyzer;

    单字符定义

    x	        The character x
    \\	        The backslash character
    \0n	        The character with octal value 0n (0 <= n <= 7)
    \0nn	    The character with octal value 0nn (0 <= n <= 7)
    \0mnn	    The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
    \xhh	    The character with hexadecimal value 0xhh
    \uhhhh	    The character with hexadecimal value 0xhhhh
    \x{h...h}	The character with hexadecimal value 0xh...h (Character.MIN_CODE_POINT  <= 0xh...h <=  Character.MAX_CODE_POINT)
    \t	        The tab character ('\u0009')
    \n	        The newline (line feed) character ('\u000A')
    \r	        The carriage-return character ('\u000D')
    \f	        The form-feed character ('\u000C')
    \a	        The alert (bell) character ('\u0007')
    \e	        The escape character ('\u001B')
    \cx	        The control character corresponding to x
    

    字符分组

    [abc]	        a, b, or c (simple class)
    [^abc]	        Any character except a, b, or c (negation)
    [a-zA-Z]	    a through z or A through Z, inclusive (range)
    [a-d[m-p]]	    a through d, or m through p: [a-dm-p] (union)
    [a-z&&[def]]	d, e, or f (intersection)
    [a-z&&[^bc]]	a through z, except for b and c: [ad-z] (subtraction)
    [a-z&&[^m-p]]	a through z, and not m through p: [a-lq-z](subtraction)
    

    预定义的字符分组

    .	Any character (may or may not match line terminators)
    \d	A digit: [0-9]
    \D	A non-digit: [^0-9]
    \h	A horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]
    \H	A non-horizontal whitespace character: [^\h]
    \s	A whitespace character: [ \t\n\x0B\f\r]
    \S	A non-whitespace character: [^\s]
    \v	A vertical whitespace character: [\n\x0B\f\r\x85\u2028\u2029]
    \V	A non-vertical whitespace character: [^\v]
    \w	A word character: [a-zA-Z_0-9]
    \W	A non-word character: [^\w]
    

    POSIX字符分组

    \p{Lower}	A lower-case alphabetic character: [a-z]
    \p{Upper}	An upper-case alphabetic character:[A-Z]
    \p{ASCII}	All ASCII:[\x00-\x7F]
    \p{Alpha}	An alphabetic character:[\p{Lower}\p{Upper}]
    \p{Digit}	A decimal digit: [0-9]
    \p{Alnum}	An alphanumeric character:[\p{Alpha}\p{Digit}]
    \p{Punct}	Punctuation: One of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
    \p{Graph}	A visible character: [\p{Alnum}\p{Punct}]
    \p{Print}	A printable character: [\p{Graph}\x20]
    \p{Blank}	A space or a tab: [ \t]
    \p{Cntrl}	A control character: [\x00-\x1F\x7F]
    \p{XDigit}	A hexadecimal digit: [0-9a-fA-F]
    \p{Space}	A whitespace character: [ \t\n\x0B\f\r]
    

    以下我们通过正则表达式[\p{Punct}|\p{Space}]可以找出字符串中的标点符号;

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class Main {
        public static void main(String[] args) {
            Pattern p = Pattern.compile("[\\p{Punct}|\\p{Space}]");
            Matcher matcher = p.matcher("The 2 QUICK Brown-Foxes jumped over the lazy dog's bone.");
            while(matcher.find()){
                System.out.println("find "+matcher.group()
                        +" position: "+matcher.start()+"-"+matcher.end());
            }
        }
    }
    
    
    find   position: 3-4
    find   position: 5-6
    find   position: 11-12
    find - position: 17-18
    find   position: 23-24
    find   position: 30-31
    find   position: 35-36
    find   position: 39-40
    find   position: 44-45
    find ' position: 48-49
    find   position: 50-51
    find . position: 55-56
    

    四、 Pattern Analyzer的实现

    PatternAnalyzer会根据具体的配置信息,使用PatternTokenizer、LowerCaseFilter、StopFilter来组合构建TokenStreamComponents

    PatternAnalyzer.java 
    
    protected TokenStreamComponents createComponents(String s) {
        final Tokenizer tokenizer = new PatternTokenizer(pattern, -1);
        TokenStream stream = tokenizer;
        if (lowercase) {
            stream = new LowerCaseFilter(stream);
        }
        if (stopWords != null) {
            stream = new StopFilter(stream, stopWords);
        }
        return new TokenStreamComponents(tokenizer, stream);
    }
    

    PatternTokenizer里的incrementToken会对输入的文本进行分词处理;由于PatternAnalyzer里初始化PatternTokenizer里的incrementToken会对输入的文本进行分词处理的时候对group设置为-1,所以这里走else分支,最终提取命中符号之间的单词;

    PatternTokenizer.java
    
      @Override
      public boolean incrementToken() {
        if (index >= str.length()) return false;
        clearAttributes();
        if (group >= 0) {
        
          // match a specific group
          while (matcher.find()) {
            index = matcher.start(group);
            final int endIndex = matcher.end(group);
            if (index == endIndex) continue;       
            termAtt.setEmpty().append(str, index, endIndex);
            offsetAtt.setOffset(correctOffset(index), correctOffset(endIndex));
            return true;
          }
          
          index = Integer.MAX_VALUE; // mark exhausted
          return false;
          
        } else {
        
          // String.split() functionality
          while (matcher.find()) {
            if (matcher.start() - index > 0) {
              // found a non-zero-length token
              termAtt.setEmpty().append(str, index, matcher.start());
              offsetAtt.setOffset(correctOffset(index), correctOffset(matcher.start()));
              index = matcher.end();
              return true;
            }
            
            index = matcher.end();
          }
          
          if (str.length() - index == 0) {
            index = Integer.MAX_VALUE; // mark exhausted
            return false;
          }
          
          termAtt.setEmpty().append(str, index, str.length());
          offsetAtt.setOffset(correctOffset(index), correctOffset(str.length()));
          index = Integer.MAX_VALUE; // mark exhausted
          return true;
        }
      }
    
  • 相关阅读:
    CocosCreator3.8研究笔记(二十四)CocosCreator 动画系统-动画编辑器实操-关键帧实现动态水印动画效果
    K8S 证书替换
    Python绘图系统14:用tkinter做一个绘图风格控件
    (附源码)springboot学生宿舍管理系统 毕业设计 211955
    Benchmarking Detection Transfer Learning with Vision Transformers(2021-11)
    秒杀项目——多级缓存
    Xilinx XC7Z020双核ARM+FPGA开发板试用合集——体验ZYNQ下linux
    css实现风车效果
    在Python中调用imageJ开发
    ssh服务攻防与加固
  • 原文地址:https://www.cnblogs.com/wufengtinghai/p/17139668.html