• ElasticSerach7.15.2插件中文分词器(IK+pinyin)


    前言

    如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组。

    这是因为使用了Elasticsearch中默认的标准分词器,这个分词器在处理中文的时候会把中文单词切分成一个一个的汉字,因此引入中文的分词器就能解决这个问题。

    默认已安装:ElasticSerach7.15.2 和 Kibana-7.15.2

    插件下载

    IK分词器:https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.15.2/elasticsearch-analysis-ik-7.15.2.zip

    pinyin分词器:https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.15.2/elasticsearch-analysis-pinyin-7.15.2.zip

    下载编译压缩包方法:

    • pingying 下载地址: https://github.com/medcl/elasticsearch-analysis-pinyin

    image

    image

    将下载且编译好的zip包进行解压,进行拷贝,上传到ElasticSerach的 plugins的文件夹下(/usr/local/elasticsearch-7.15.2/plugins)

    image

    • 重启ElasticSearch:systemctl restart elasticsearch.service

    IK分词器测试

    1. #创建索引
    2. put index
    3. #创建映射
    4. post index/_mapping
    5. {
    6. "properties": {
    7. "content": {
    8. "type": "text",
    9. "analyzer": "ik_max_word",
    10. "search_analyzer": "ik_smart"
    11. }
    12. }
    13. }
    14. #添加内容
    15. post /index/_create/1
    16. {"content":"美国留给伊拉克的是个烂摊子吗"}
    17. post /index/_create/2
    18. {"content":"公安部:各地校车将享最高路权"}
    19. post /index/_create/3
    20. {"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}
    21. post /index/_create/4
    22. {"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}
    23. #根据条件查询
    24. post /index/_search
    25. {
    26. "query" : { "match" : { "content" : "中国" }},
    27. "highlight" : {
    28. "pre_tags" : ["", ""],
    29. "post_tags" : ["", ""],
    30. "fields" : {
    31. "content" : {}
    32. }
    33. }
    34. }

    返回结果:

    1. {
    2. "took" : 4,
    3. "timed_out" : false,
    4. "_shards" : {
    5. "total" : 1,
    6. "successful" : 1,
    7. "skipped" : 0,
    8. "failed" : 0
    9. },
    10. "hits" : {
    11. "total" : {
    12. "value" : 2,
    13. "relation" : "eq"
    14. },
    15. "max_score" : 0.642793,
    16. "hits" : [
    17. {
    18. "_index" : "index",
    19. "_type" : "_doc",
    20. "_id" : "3",
    21. "_score" : 0.642793,
    22. "_source" : {
    23. "content" : "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
    24. },
    25. "highlight" : {
    26. "content" : [
    27. "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
    28. ]
    29. }
    30. },
    31. {
    32. "_index" : "index",
    33. "_type" : "_doc",
    34. "_id" : "4",
    35. "_score" : 0.642793,
    36. "_source" : {
    37. "content" : "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
    38. },
    39. "highlight" : {
    40. "content" : [
    41. "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
    42. ]
    43. }
    44. }
    45. ]
    46. }
    47. }

    拼音分词器测试

    1、使用自定义拼音分析器创建索引

    1. PUT /medcl/
    2. {
    3. "settings" : {
    4. "analysis" : {
    5. "analyzer" : {
    6. "pinyin_analyzer" : {
    7. "tokenizer" : "my_pinyin"
    8. }
    9. },
    10. "tokenizer" : {
    11. "my_pinyin" : {
    12. "type" : "pinyin",
    13. "keep_separate_first_letter" : false,
    14. "keep_full_pinyin" : true,
    15. "keep_original" : true,
    16. "limit_first_letter_length" : 16,
    17. "lowercase" : true,
    18. "remove_duplicated_term" : true
    19. }
    20. }
    21. }
    22. }
    23. }

    2、测试分析器,分析一个中文名称,如:刘德华

    1. GET /medcl/_analyze
    2. {
    3. "text": ["刘德华"],
    4. "analyzer": "pinyin_analyzer"
    5. }
    1. {
    2. "tokens" : [
    3. {
    4. "token" : "liu",
    5. "start_offset" : 0,
    6. "end_offset" : 1,
    7. "type" : "word",
    8. "position" : 0
    9. },
    10. {
    11. "token" : "de",
    12. "start_offset" : 1,
    13. "end_offset" : 2,
    14. "type" : "word",
    15. "position" : 1
    16. },
    17. {
    18. "token" : "hua",
    19. "start_offset" : 2,
    20. "end_offset" : 3,
    21. "type" : "word",
    22. "position" : 2
    23. },
    24. {
    25. "token" : "刘德华",
    26. "start_offset" : 0,
    27. "end_offset" : 3,
    28. "type" : "word",
    29. "position" : 3
    30. },
    31. {
    32. "token" : "ldh",
    33. "start_offset" : 0,
    34. "end_offset" : 3,
    35. "type" : "word",
    36. "position" : 4
    37. }
    38. ]
    39. }

    3、创建映射

    1. POST /medcl/_mapping
    2. {
    3. "properties": {
    4. "name": {
    5. "type": "keyword",
    6. "fields": {
    7. "pinyin": {
    8. "type": "text",
    9. "store": false,
    10. "term_vector": "with_offsets",
    11. "analyzer": "pinyin_analyzer",
    12. "boost": 10
    13. }
    14. }
    15. }
    16. }
    17. }

    4、添加内容

    1. POST /medcl/_create/andy
    2. {"name":"刘德华"}

    5、测试搜索

    1. #根据条件查询
    2. GET /medcl/_search
    3. {
    4. "query": {
    5. "match": {
    6. "name.pinyin": "liu"
    7. }
    8. }
    9. }

    6、使用过滤

    1. PUT /medcl1/
    2. {
    3. "settings" : {
    4. "analysis" : {
    5. "analyzer" : {
    6. "user_name_analyzer" : {
    7. "tokenizer" : "whitespace",
    8. "filter" : "pinyin_first_letter_and_full_pinyin_filter"
    9. }
    10. },
    11. "filter" : {
    12. "pinyin_first_letter_and_full_pinyin_filter" : {
    13. "type" : "pinyin",
    14. "keep_first_letter" : true,
    15. "keep_full_pinyin" : false,
    16. "keep_none_chinese" : true,
    17. "keep_original" : false,
    18. "limit_first_letter_length" : 16,
    19. "lowercase" : true,
    20. "trim_whitespace" : true,
    21. "keep_none_chinese_in_first_letter" : true
    22. }
    23. }
    24. }
    25. }
    26. }

    Token Test:刘德华 张学友 郭富城 黎明 四大天王

    1. GET /medcl1/_analyze
    2. {
    3. "text": ["刘德华 张学友 郭富城 黎明 四大天王"],
    4. "analyzer": "user_name_analyzer"
    5. }
    1. {
    2. "tokens" : [
    3. {
    4. "token" : "ldh",
    5. "start_offset" : 0,
    6. "end_offset" : 3,
    7. "type" : "word",
    8. "position" : 0
    9. },
    10. {
    11. "token" : "zxy",
    12. "start_offset" : 4,
    13. "end_offset" : 7,
    14. "type" : "word",
    15. "position" : 1
    16. },
    17. {
    18. "token" : "gfc",
    19. "start_offset" : 8,
    20. "end_offset" : 11,
    21. "type" : "word",
    22. "position" : 2
    23. },
    24. {
    25. "token" : "lm",
    26. "start_offset" : 12,
    27. "end_offset" : 14,
    28. "type" : "word",
    29. "position" : 3
    30. },
    31. {
    32. "token" : "sdtw",
    33. "start_offset" : 15,
    34. "end_offset" : 19,
    35. "type" : "word",
    36. "position" : 4
    37. }
    38. ]
    39. }

    7、用于短语查询

    1. PUT /medcl2/
    2. {
    3. "settings" : {
    4. "analysis" : {
    5. "analyzer" : {
    6. "pinyin_analyzer" : {
    7. "tokenizer" : "my_pinyin"
    8. }
    9. },
    10. "tokenizer" : {
    11. "my_pinyin" : {
    12. "type" : "pinyin",
    13. "keep_first_letter":false,
    14. "keep_separate_first_letter" : false,
    15. "keep_full_pinyin" : true,
    16. "keep_original" : false,
    17. "limit_first_letter_length" : 16,
    18. "lowercase" : true
    19. }
    20. }
    21. }
    22. }
    23. }
    24. GET /medcl2/_search
    25. {
    26. "query": {"match_phrase": {
    27. "name.pinyin": "刘德华"
    28. }}
    29. }
    1. PUT /medcl3/
    2. {
    3. "settings" : {
    4. "analysis" : {
    5. "analyzer" : {
    6. "pinyin_analyzer" : {
    7. "tokenizer" : "my_pinyin"
    8. }
    9. },
    10. "tokenizer" : {
    11. "my_pinyin" : {
    12. "type" : "pinyin",
    13. "keep_first_letter":true,
    14. "keep_separate_first_letter" : true,
    15. "keep_full_pinyin" : true,
    16. "keep_original" : false,
    17. "limit_first_letter_length" : 16,
    18. "lowercase" : true
    19. }
    20. }
    21. }
    22. }
    23. }
    24. POST /medcl3/_mapping
    25. {
    26. "properties": {
    27. "name": {
    28. "type": "keyword",
    29. "fields": {
    30. "pinyin": {
    31. "type": "text",
    32. "store": false,
    33. "term_vector": "with_offsets",
    34. "analyzer": "pinyin_analyzer",
    35. "boost": 10
    36. }
    37. }
    38. }
    39. }
    40. }
    41. GET /medcl3/_analyze
    42. {
    43. "text": ["刘德华"],
    44. "analyzer": "pinyin_analyzer"
    45. }
    46. POST /medcl3/_create/andy
    47. {"name":"刘德华"}
    48. GET /medcl3/_search
    49. {
    50. "query": {"match_phrase": {
    51. "name.pinyin": "刘德h"
    52. }}
    53. }
    54. GET /medcl3/_search
    55. {
    56. "query": {"match_phrase": {
    57. "name.pinyin": "刘dh"
    58. }}
    59. }
    60. GET /medcl3/_search
    61. {
    62. "query": {"match_phrase": {
    63. "name.pinyin": "liudh"
    64. }}
    65. }
    66. GET /medcl3/_search
    67. {
    68. "query": {"match_phrase": {
    69. "name.pinyin": "liudeh"
    70. }}
    71. }
    72. GET /medcl3/_search
    73. {
    74. "query": {"match_phrase": {
    75. "name.pinyin": "liude华"
    76. }}
    77. }
  • 相关阅读:
    使用wxJava开发微信服务(公众)号,实现新建素材的功能
    【Linux】多路IO复用技术②——poll详解&如何使用poll模型实现简易的一对多服务器(附图解与代码实现)
    【微服务~Nacos】Nacos服务提供者和服务消费者
    区间重叠问题
    竞赛 大数据商城人流数据分析与可视化 - python 大数据分析
    玩转代码|wordpress过滤函数使用方法(防止sql注入)
    设计模式--总结和对比
    阿里大牛亲自教学分享,最新发布 Spring Cloud Alibaba 笔记
    多继承的实例介绍
    Android ViewPager2 + Fragment + BottomNavigationView 联动
  • 原文地址:https://blog.csdn.net/qq_30665009/article/details/126060220