• ElasticSearch7.3学习(十一)----定制分词器(Analyzer)


    1、默认的分词器

    关于分词器,前面的博客已经有介绍了,链接:ElasticSearch7.3 学习之倒排索引揭秘及初识分词器(Analyzer)。这里就只介绍默认的分词器standard analyzer

    2、 修改分词器的设置

    首先自定义一个分词器es_std。启用english停用词token filter

    复制代码
    PUT /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "es_std": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
      }
    }
    复制代码

    返回:

    接下来开始测试两种不同的分词器,首先是默认的分词器

    GET /my_index/_analyze
    {
      "analyzer": "standard", 
      "text": "a dog is in the house"
    }

    返回结果

    复制代码
    {
      "tokens" : [
        {
          "token" : "a",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "dog",
          "start_offset" : 2,
          "end_offset" : 5,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "is",
          "start_offset" : 6,
          "end_offset" : 8,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "in",
          "start_offset" : 9,
          "end_offset" : 11,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "the",
          "start_offset" : 12,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "house",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "<ALPHANUM>",
          "position" : 5
        }
      ]
    }
    复制代码

    可以看到就是简单的按单词进行拆分,在接下来测试上面自定义的一个分词器es_std

    GET /my_index/_analyze
    {
      "analyzer": "es_std",
      "text":"a dog is in the house"
    }

    返回:

    复制代码
    {
      "tokens" : [
        {
          "token" : "dog",
          "start_offset" : 2,
          "end_offset" : 5,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "house",
          "start_offset" : 16,
          "end_offset" : 21,
          "type" : "<ALPHANUM>",
          "position" : 5
        }
      ]
    }
    复制代码

    可以看到结果只有两个单词了,把停用词都给去掉了。

    3、定制化自己的分词器

    首先删除掉上面建立的索引

    DELETE my_index

    然后运行下面的语句。简单说下下面的规则吧,首先去除html标签,把&转换成and,然后采用standard进行分词,最后转换成小写字母及去掉停用词a the,建议读者好好看看,下面我也会对这个分词器进行测试。

    复制代码
    PUT /my_index
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "&_to_and": {
              "type": "mapping",
              "mappings": [
                "&=> and"
              ]
            }
          },
          "filter": {
            "my_stopwords": {
              "type": "stop",
              "stopwords": [
                "the",
                "a"
              ]
            }
          },
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "char_filter": [
                "html_strip",
                "&_to_and"
              ],
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "my_stopwords"
              ]
            }
          }
        }
      }
    }
    复制代码

    返回

    {
      "acknowledged" : true,
      "shards_acknowledged" : true,
      "index" : "my_index"
    }

    老规矩,测试这个分词器

    GET /my_index/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "tom&jerry are a friend in the house, <a>, HAHA!!"
    }

    结果如下:

    复制代码
    {
      "tokens" : [
        {
          "token" : "tomandjerry",
          "start_offset" : 0,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "are",
          "start_offset" : 10,
          "end_offset" : 13,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "friend",
          "start_offset" : 16,
          "end_offset" : 22,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "in",
          "start_offset" : 23,
          "end_offset" : 25,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "house",
          "start_offset" : 30,
          "end_offset" : 35,
          "type" : "<ALPHANUM>",
          "position" : 6
        },
        {
          "token" : "haha",
          "start_offset" : 42,
          "end_offset" : 46,
          "type" : "<ALPHANUM>",
          "position" : 7
        }
      ]
    }
    复制代码

    最后我们可以在实际使用时设置某个字段使用自定义分词器,语法如下:

    复制代码
    PUT /my_index/_mapping/
    {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
    复制代码

     



    如果您觉得阅读本文对您有帮助,请点一下“推荐”按钮,您的“推荐”将是我最大的写作动力!欢迎各位转载,但是未经作者本人同意,转载文章之后必须在文章页面明显位置给出作者和原文连接,否则保留追究法律责任的权利。
  • 相关阅读:
    御神楽的学习记录之基于FPGA的AHT10温湿度数据采集
    Docker高级——2 DockerFile解析和3 Docker微服务实战
    MATLIB从excel表中读取数据并画出函数图像
    JSD-2204-(业务逻辑开发)-续秒杀业务-消息队列-Day14
    如何在 Java 中实现二叉搜索树
    java绘图技术基础
    自动驾驶中的感知模型:实现安全与智能驾驶的关键
    【C#】基于JsonConvert解析Json数据
    Mysql开启binlog
    Git 分支合并情况
  • 原文地址:https://www.cnblogs.com/xiaoyh/p/16024163.html