• Elasticsearch:从零开始构建一个定制的分词器


    Elasticsearch 提供了大量的 analyzer 和 tokenizer 来满足开箱即用的一般需求。 有时,我们需要通过添加新的分析器来扩展 Elasticsearch 的功能。尽管 Elastic 提供了丰富的分词器,但是在很多的时候,我们希望为自己的语言或一种特殊的需求来定制一个属于自己的分词器。通常,你可以在需要执行以下操作时创建分析器插件:

    • 添加 Elasticsearch 未提供的标准 Lucene 分词器/标记器(tokenizer)。
    • 集成第三方分析器。
    • 添加自定义分析器。

    针对中文的处理,有很多非常有名的分词器:

    上述这些分词器都是开源的分词器。它们专为中文而构建的。如果你想了解更多关于分词器的使用,请参阅文章 “Elastic:开发者上手指南” 中的 “中文分词器” 章节。

    在今天的文章中,我们将添加一个新的自定义英语分析器,类似于 Elasticsearch 提供的分析器。在今天的练习中,我们将以最新的 Elastic Stack 8.4.0 来构建一个定制的分词器。

     安装

    如果你还没有安装好自己的 Elastic Stack,请参考如下的文章来安装 Elasticsearch 及 Kibana

    创建插件模板

    在我之前的文章 

    我已经展示了如何为 ingest pipeline 创建 processors。在上面的第二篇文章中,我使用了一个叫做 elasticsearch-plugin-archtype 的插件。我们可以使用如下的命令来创建一个最为基本的插件模板:

    1. mvn archetype:generate \
    2. -DarchetypeGroupId=org.codelibs \
    3. -DarchetypeArtifactId=elasticsearch-plugin-archetype \
    4. -DarchetypeVersion=6.6.0 \
    5. -DgroupId=com.liuxg \
    6. -DartifactId=elasticsearch-plugin \
    7. -Dversion=1.0-SNAPSHOT \
    8. -DpluginName=analyzer

    上面已经帮我们创建了一个最为基本的插件模板。它在当前的目录下创建了一个叫做 elasticsearch-plugin 的目录。我们首先进入到该目录中:

    1. $ pwd
    2. /Users/liuxg/java/plugins/elasticsearch-plugin
    3. $ tree -L 8
    4. .
    5. ├── pom.xml
    6. └── src
    7. └── main
    8. ├── assemblies
    9. │   └── plugin.xml
    10. ├── java
    11. │   └── com
    12. │   └── liuxg
    13. │   ├── analyzerPlugin.java
    14. │   └── rest
    15. │   └── RestanalyzerAction.java
    16. └── plugin-metadata
    17. └── plugin-descriptor.properties

    由于上面的模板最初是为 REST handler 而设计的,所以,我们修改它的文档架构为如下的结构:

    1. $ pwd
    2. /Users/liuxg/java/plugins/elasticsearch-plugin
    3. $ tree -L 8
    4. .
    5. ├── pom.xml
    6. └── src
    7. └── main
    8. ├── assemblies
    9. │   └── plugin.xml
    10. ├── java
    11. │   └── com
    12. │   └── liuxg
    13. │   ├── index
    14. │   │   └── analysis
    15. │   │   └── RestanalyzerAction.java
    16. │   └── plugin
    17. │   └── analysis
    18. │   └── analyzerPlugin.java
    19. └── plugin-metadata
    20. └── plugin-descriptor.properties

    上面是它的文件结构。因为我们想为 Elastic Stack 8.4.0 构建插件,所以,我们必须在 pom.xml 中修改相应的版本信息:

    pom.xml

    1. <?xml version="1.0" encoding="UTF-8"?>
    2. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    3. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    4. <name>elasticsearch-plugin</name>
    5. <modelVersion>4.0.0</modelVersion>
    6. <groupId>com.liuxg</groupId>
    7. <artifactId>elasticsearch-plugin</artifactId>
    8. <version>1.0-SNAPSHOT</version>
    9. <packaging>jar</packaging>
    10. <description>elasticsearch analyzer plugin</description>
    11. <inceptionYear>2019</inceptionYear>
    12. <licenses>
    13. <license>
    14. <name>The Apache Software License, Version 2.0</name>
    15. <url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
    16. <distribution>repo</distribution>
    17. </license>
    18. </licenses>
    19. <properties>
    20. <elasticsearch.version>8.4.0</elasticsearch.version>
    21. <elasticsearch.plugin.classname>com.liuxg.analyzerPlugin</elasticsearch.plugin.classname>
    22. <log4j.version>2.11.1</log4j.version>
    23. <maven.compiler.source>1.8</maven.compiler.source>
    24. <maven.compiler.target>1.8</maven.compiler.target>
    25. </properties>
    26. <build>
    27. <plugins>
    28. <plugin>
    29. <artifactId>maven-compiler-plugin</artifactId>
    30. <version>3.8.0</version>
    31. <configuration>
    32. <source>${maven.compiler.source}</source>
    33. <target>${maven.compiler.target}</target>
    34. <encoding>UTF-8</encoding>
    35. </configuration>
    36. </plugin>
    37. <plugin>
    38. <artifactId>maven-surefire-plugin</artifactId>
    39. <version>2.22.1</version>
    40. <configuration>
    41. <includes>
    42. <include>**/*Tests.java</include>
    43. </includes>
    44. </configuration>
    45. </plugin>
    46. <plugin>
    47. <artifactId>maven-source-plugin</artifactId>
    48. <version>3.0.1</version>
    49. <executions>
    50. <execution>
    51. <id>attach-sources</id>
    52. <goals>
    53. <goal>jar</goal>
    54. </goals>
    55. </execution>
    56. </executions>
    57. </plugin>
    58. <plugin>
    59. <artifactId>maven-assembly-plugin</artifactId>
    60. <version>3.1.0</version>
    61. <configuration>
    62. <appendAssemblyId>false</appendAssemblyId>
    63. <outputDirectory>${project.build.directory}/releases/</outputDirectory>
    64. <descriptors>
    65. <descriptor>${basedir}/src/main/assemblies/plugin.xml</descriptor>
    66. </descriptors>
    67. </configuration>
    68. <executions>
    69. <execution>
    70. <phase>package</phase>
    71. <goals>
    72. <goal>single</goal>
    73. </goals>
    74. </execution>
    75. </executions>
    76. </plugin>
    77. </plugins>
    78. </build>
    79. <dependencies>
    80. <dependency>
    81. <groupId>org.elasticsearch</groupId>
    82. <artifactId>elasticsearch</artifactId>
    83. <version>${elasticsearch.version}</version>
    84. <scope>provided</scope>
    85. </dependency>
    86. <dependency>
    87. <groupId>org.apache.logging.log4j</groupId>
    88. <artifactId>log4j-api</artifactId>
    89. <version>${log4j.version}</version>
    90. <scope>provided</scope>
    91. </dependency>
    92. </dependencies>
    93. </project>

    在上面,我们把 elasticsearch.version 设置为 8.4.0。其它的保持不变。

    接下来,我们来修改 analyzerPlugin.java 文件:

    analyzerPlugin.java

    1. package com.liuxg.plugin.analysis;
    2. import org.elasticsearch.plugins.Plugin;
    3. import org.apache.lucene.analysis.Analyzer;
    4. import org.elasticsearch.index.analysis.AnalyzerProvider;
    5. import com.liuxg.index.analysis.CustomEnglishAnalyzerProvider;
    6. import org.elasticsearch.indices.analysis.AnalysisModule;
    7. import java.util.HashMap;
    8. import java.util.Map;
    9. public class analyzerPlugin extends Plugin implements org.elasticsearch.plugins.AnalysisPlugin {
    10. @Override
    11. public Map<String, AnalysisModule.AnalysisProvider<AnalyzerProviderextends Analyzer>>> getAnalyzers() {
    12. Map<String, AnalysisModule.AnalysisProvider<AnalyzerProviderextends Analyzer>>> analyzers = new HashMap<>();
    13. analyzers.put(CustomEnglishAnalyzerProvider.NAME, CustomEnglishAnalyzerProvider::getCustomEnglishAnalyzerProvider);
    14. return analyzers;
    15. }
    16. }

    在上面的代码中,我们在插件中注册我们的分词器。

    我们接下来修改上面的文件 RestanalyzerAction.java 为 CustomEnglishAnalyzerProvider.java:

    CustomEnglishAnalyzerProvider.java

    1. package com.liuxg.index.analysis;
    2. import org.apache.lucene.analysis.en.EnglishAnalyzer;
    3. import org.apache.lucene.analysis.CharArraySet;
    4. import org.elasticsearch.common.settings.Settings;
    5. import org.elasticsearch.env.Environment;
    6. import org.elasticsearch.index.IndexSettings;
    7. import org.elasticsearch.index.analysis.AbstractIndexAnalyzerProvider;
    8. import org.elasticsearch.index.analysis.Analysis;
    9. public class CustomEnglishAnalyzerProvider extends AbstractIndexAnalyzerProvider<EnglishAnalyzer> {
    10. public static String NAME = "custom_english";
    11. private final EnglishAnalyzer analyzer;
    12. public CustomEnglishAnalyzerProvider(IndexSettings indexSettings, Environment env, String name, Settings settings,
    13. boolean ignoreCase) {
    14. super(name, settings);
    15. analyzer = new EnglishAnalyzer(
    16. Analysis.parseStopWords(env, settings, null, ignoreCase),
    17. Analysis.parseStemExclusion(settings, CharArraySet.EMPTY_SET));
    18. }
    19. public static CustomEnglishAnalyzerProvider getCustomEnglishAnalyzerProvider(IndexSettings indexSettings,
    20. Environment env, String name,
    21. Settings settings) {
    22. return new CustomEnglishAnalyzerProvider(indexSettings, env, name, settings, true);
    23. }
    24. @Override
    25. public EnglishAnalyzer get() {
    26. return this.analyzer;
    27. }
    28. }

    请注意,在上面,我们定义了分词器的名字为 custom_english。为了区分正常的 english 分词器,我们在实例化 analyzer 时,特意把它的 stop words 设置为 null:

    1. analyzer = new EnglishAnalyzer(
    2. Analysis.parseStopWords(env, settings, null, ignoreCase),
    3. Analysis.parseStemExclusion(settings, CharArraySet.EMPTY_SET));

    正常的 english 分词器为:

    1. analyzer = new EnglishAnalyzer(
    2. Analysis.parseStopWords(env, settings, EnglishAnalyzer.getDefaultStopSet(), ignoreCase),
    3. Analysis.parseStemExclusion(settings, CharArraySet.EMPTY_SET));

    也就是说我们定制的 cusom_english 分词器是没有任何 stop words 的。

    经过这样的修改,我们的文件架构变为:

    1. $ pwd
    2. /Users/liuxg/java/plugins/elasticsearch-plugin
    3. $ tree -L 8
    4. .
    5. ├── pom.xml
    6. └── src
    7. └── main
    8. ├── assemblies
    9. │   └── plugin.xml
    10. ├── java
    11. │   └── com
    12. │   └── liuxg
    13. │   ├── index
    14. │   │   └── analysis
    15. │   │   └── CustomEnglishAnalyzerProvider.java
    16. │   └── plugin
    17. │   └── analysis
    18. │   └── analyzerPlugin.java
    19. └── plugin-metadata
    20. └── plugin-descriptor.properties

    由于我们已经修改了文件的架构,我们需要重新修改 pom.xml 的如下的这个部分:

    编译

    我们在项目的根目录下使人如下的命令来进行编译:

    mvn clean install
    1. $ pwd
    2. /Users/liuxg/java/plugins/elasticsearch-plugin
    3. $ mvn clean install
    4. [INFO] Scanning for projects...
    5. [INFO]
    6. [INFO] -------------------< com.liuxg:elasticsearch-plugin >-------------------
    7. [INFO] Building elasticsearch-plugin 1.0-SNAPSHOT
    8. [INFO] --------------------------------[ jar ]---------------------------------
    9. [INFO]
    10. [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ elasticsearch-plugin ---
    11. [INFO]
    12. [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ elasticsearch-plugin ---
    13. [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
    14. [INFO] skip non existing resourceDirectory /Users/liuxg/java/plugins/elasticsearch-plugin/src/main/resources
    15. [INFO]
    16. [INFO] --- maven-compiler-plugin:3.8.0:compile (default-compile) @ elasticsearch-plugin ---
    17. [INFO] Changes detected - recompiling the module!
    18. [INFO] Compiling 2 source files to /Users/liuxg/java/plugins/elasticsearch-plugin/target/classes
    19. [INFO]
    20. [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ elasticsearch-plugin ---
    21. [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
    22. [INFO] skip non existing resourceDirectory /Users/liuxg/java/plugins/elasticsearch-plugin/src/test/resources
    23. [INFO]
    24. [INFO] --- maven-compiler-plugin:3.8.0:testCompile (default-testCompile) @ elasticsearch-plugin ---
    25. [INFO] No sources to compile
    26. [INFO]
    27. [INFO] --- maven-surefire-plugin:2.22.1:test (default-test) @ elasticsearch-plugin ---
    28. [INFO] No tests to run.
    29. [INFO]
    30. [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ elasticsearch-plugin ---
    31. [INFO] Building jar: /Users/liuxg/java/plugins/elasticsearch-plugin/target/elasticsearch-plugin-1.0-SNAPSHOT.jar
    32. [INFO]
    33. [INFO] >>> maven-source-plugin:3.0.1:jar (attach-sources) > generate-sources @ elasticsearch-plugin >>>
    34. [INFO]
    35. [INFO] <<< maven-source-plugin:3.0.1:jar (attach-sources) < generate-sources @ elasticsearch-plugin <<<
    36. [INFO]
    37. [INFO]
    38. [INFO] --- maven-source-plugin:3.0.1:jar (attach-sources) @ elasticsearch-plugin ---
    39. [INFO] Building jar: /Users/liuxg/java/plugins/elasticsearch-plugin/target/elasticsearch-plugin-1.0-SNAPSHOT-sources.jar
    40. [INFO]
    41. [INFO] --- maven-assembly-plugin:3.1.0:single (default) @ elasticsearch-plugin ---
    42. [INFO] Reading assembly descriptor: /Users/liuxg/java/plugins/elasticsearch-plugin/src/main/assemblies/plugin.xml
    43. [WARNING] The following patterns were never triggered in this artifact exclusion filter:
    44. o 'org.elasticsearch:elasticsearch'
    45. [INFO] Building zip: /Users/liuxg/java/plugins/elasticsearch-plugin/target/releases/elasticsearch-plugin-1.0-SNAPSHOT.zip
    46. [INFO]
    47. [INFO] --- maven-install-plugin:2.4:install (default-install) @ elasticsearch-plugin ---
    48. [INFO] Installing /Users/liuxg/java/plugins/elasticsearch-plugin/target/elasticsearch-plugin-1.0-SNAPSHOT.jar to /Users/liuxg/.m2/repository/com/liuxg/elasticsearch-plugin/1.0-SNAPSHOT/elasticsearch-plugin-1.0-SNAPSHOT.jar
    49. [INFO] Installing /Users/liuxg/java/plugins/elasticsearch-plugin/pom.xml to /Users/liuxg/.m2/repository/com/liuxg/elasticsearch-plugin/1.0-SNAPSHOT/elasticsearch-plugin-1.0-SNAPSHOT.pom
    50. [INFO] Installing /Users/liuxg/java/plugins/elasticsearch-plugin/target/elasticsearch-plugin-1.0-SNAPSHOT-sources.jar to /Users/liuxg/.m2/repository/com/liuxg/elasticsearch-plugin/1.0-SNAPSHOT/elasticsearch-plugin-1.0-SNAPSHOT-sources.jar
    51. [INFO] Installing /Users/liuxg/java/plugins/elasticsearch-plugin/target/releases/elasticsearch-plugin-1.0-SNAPSHOT.zip to /Users/liuxg/.m2/repository/com/liuxg/elasticsearch-plugin/1.0-SNAPSHOT/elasticsearch-plugin-1.0-SNAPSHOT.zip
    52. [INFO] ------------------------------------------------------------------------
    53. [INFO] BUILD SUCCESS
    54. [INFO] ------------------------------------------------------------------------
    55. [INFO] Total time: 5.266 s
    56. [INFO] Finished at: 2022-09-07T13:54:06+08:00
    57. [INFO] ------------------------------------------------------------------------

    编译成功后,我们可以在 target 目录先看到如下的安装文件:

    1. $ pwd
    2. /Users/liuxg/java/plugins/elasticsearch-plugin
    3. $ ls target/releases/
    4. elasticsearch-plugin-1.0-SNAPSHOT.zip

    上面显示的 elasticsearch-plugin-1.0-SNAPSHOT.zip 就是我们可以安装的插件文件。

    安装插件并测试插件

    我们接下来换到 Elasticsearch 的安装目录下,并打入如下的命令:

    1. $ pwd
    2. /Users/liuxg/elastic0/elasticsearch-8.4.0
    3. $ bin/elasticsearch-plugin install file:Users/liuxg/java/plugins/elasticsearch-plugin/target/releases/elasticsearch-plugin-1.0-SNAPSHOT.zip
    4. -> Installing file:Users/liuxg/java/plugins/elasticsearch-plugin/target/releases/elasticsearch-plugin-1.0-SNAPSHOT.zip
    5. -> Downloading file:Users/liuxg/java/plugins/elasticsearch-plugin/target/releases/elasticsearch-plugin-1.0-SNAPSHOT.zip
    6. [=================================================] 100%  
    7. -> Installed analyzer
    8. -> Please restart Elasticsearch to activate any plugins installed
    9. $ ./bin/elasticsearch-plugin list
    10. analyzer

    从上面的显示中,我们可以看出来 analyzer 插件已经被成功地安装。我们接下来需要重新启动 Elasticsearch。这个非常重要!

    我们打开 Kibana,并打入如下的命令:

    在正常的情况下,我们使用命令:

    1. GET _analyze
    2. {
    3. "text": "This is so cool. I like the idea",
    4. "analyzer": "english"
    5. }

    在上面,我们使用了 english 分词器。它将返回上面的结果。从上面我们可以看出来,this, is, the 都是 stop words。它们都不在返回的 token 之列。 

    我们可以使用如下的命令来调用我们刚才所生产的 custom_english 分词器:

    1. GET _analyze
    2. {
    3. "text": "This is so cool. I like the idea",
    4. "analyzer": "custom_english"
    5. }

    在上面,我们使用了我们刚才创建的 custom_english 分词器,它返回的结果如下:

    从上面,我们可以看出来 is,the, 及 thi 都变成了 token,这个是因为在我们定制的分词器中,我们没有设置 stop words 的缘故。虽然这个在实际的使用中并没有多大的用处,但是它显示了我们定制的分词器是可以工作的。

    整个代码可以在地址 GitHub - liu-xiao-guo/analyzer_plugin 下载。

  • 相关阅读:
    1.http和https
    如何使用python绘制ROC曲线?
    【STM32】--基础了解
    tokenizer添加token的详细demo
    【入门-05】存储空间
    Windows(二):windows+nginx+openssl本地搭建nginx并配置ssl实现https访问
    C# ZBar解码测试(QRCode、一维码条码)并记录里面隐藏的坑
    HTTPS如何保证数据传输的安全性 以及CA签发证书&验签
    1.6 IntelliJ IDEA开发工具
    MySQL中Date、DateTime、TimeStamp和Time用法
  • 原文地址:https://blog.csdn.net/UbuntuTouch/article/details/126743397