可以看到 马拉巴尔
被拆分了
GET /news/_analyze
{
"text":"四国联盟将在澳大利亚举行“马拉巴尔2023”演习",
"analyzer": "ik_max_word"
}
...
{
"token" : "马拉",
"start_offset" : 13,
"end_offset" : 15,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "拉巴",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "尔",
"start_offset" : 16,
"end_offset" : 17,
"type" : "CN_CHAR",
"position" : 11
},
...
vim ./plugins/ik/config/custom/location.dic
在 location.dic
中添加马拉巴尔以及其它自定义分词
vim ./plugins/ik/config/IKAnalyzer.cfg.xml
在IKAnalyzer.cfg.xml
中引用自定义字典
DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置comment>
<entry key="ext_dict">custom/location.dicentry>
<entry key="ext_stopwords">entry>
properties>
最终结果:
{
"token" : "马拉巴尔",
"start_offset" : 13,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 9
},
{
"token" : "马拉",
"start_offset" : 13,
"end_offset" : 15,
"type" : "CN_WORD",
"position" : 10
},
{
"token" : "拉巴",
"start_offset" : 14,
"end_offset" : 16,
"type" : "CN_WORD",
"position" : 11
},
{
"token" : "尔",
"start_offset" : 16,
"end_offset" : 17,
"type" : "CN_CHAR",
"position" : 12
}
注意:添加自定义分词字典后需要重建索引,或者更新相关数据