注意:字符过滤器用于在将字符流传递给分词器之前对其进行预处理
此过滤器会替换掉HTML标签,且会转换HTML实体 如:& 会被替换为 &。
{
"tokenizer": "keyword",
"char_filter": [
"html_strip"
],
"text": "I'm so happy!
"
}
解析结果:
[ \nI'm so happy!\n ]
因为是 p 标签,所以有前后的换行符。如果使用标签就不会有换行符了。
HTML 元素。
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "keyword",
"char_filter": [
"my_custom_html_strip_char_filter"
]
}
},
"char_filter": {
"my_custom_html_strip_char_filter": {
"type": "html_strip",
"escaped_tags": [
"b"
]
}
}
}
}
}
自定义字符过滤器 my_custom_html_strip_char_filter ,以 html_strip 过滤器为基础,设置了跳过 b 标签不过滤。
配置键和值的映射,每当遇到与键相同的字符串时,它就会用与该键关联的值替换它们
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "mapping",
"mappings": [
"0 => 零",
"1 => 壹",
"2 => 贰",
"3 => 叁",
"4 => 肆",
"5 => 伍",
"6 => 陆",
"7 => 柒",
"8 => 捌",
"9 => 玖"
]
}
],
"text": "9527就是你的终身代号"
}
解析结果:
{
"tokens": [
{
"token": "玖伍贰柒就是你的终身代号",
"start_offset": 0,
"end_offset": 12,
"type": "word",
"position": 0
}
]
}
以上两个参数二选一即可。
{
"tokenizer": "keyword",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "(\\d{3})(\\d{4})(\\d{4})",
"replacement":"$1****$3"
}
],
"text": "13199838273"
}
解析结果:
{
"tokens": [
{
"token": "131****8273",
"start_offset": 0,
"end_offset": 11,
"type": "word",
"position": 0
}
]
}
看到结果你就知道我们示例的作用了,关于写法可以看看可配参数的说明。
分析器只能配置一个分词器,所以很多分词器的名称和分析器的名称是一致的
standard词器提供基于语法的分词(基于 Unicode 文本分割算法)并且适用于大多数语言。
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
解析结果:
[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]
如果仔细对比,你还是能发现和 standard 分析器处理结果的区别的。
我们来试试中文
{
"tokenizer": "standard",
"text": "我是中国人"
}
解析结果:
[我,是,中,国,人]
分词是分词了,但是貌似不符合我们的要求,关于中文的分词我们后面再说。
PUT /person1
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length": 5
}
}
}
}
}
注意配置参数:我们配置了一个自定义的分词器 my_tokenizer ,以 standard 为基础类型,然后配置了一个自定义的分析器 my_analyzer,设置该分析器的分词器为 my_tokenizer 。
只要遇到不是字母的字符,分词器就会将文本分解。它对大多数欧洲语言都做得很好,但对一些亚洲语言来说就很糟糕,因为在这些语言中单词没有用空格分隔。
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
解析结果:
[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone]
其作用和 letter 分词器一样,只是会将字母转换为小写。此处我们就不贴示例了。
适用于英语文档。此分词器具有对首字母缩写词、公司名称、电子邮件地址和 Internet 主机名进行特殊处理的启发式方法。然而,这些规则并不总是有效,分词器对除英语以外的大多数语言都不能很好地工作
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. email: abc@cormm.com"
}
解析结果:
{
"tokens": [
{
"token": "The",
"start_offset": 0,
"end_offset": 3,
"type": "" ,
"position": 0
},
{
"token": "2",
"start_offset": 4,
"end_offset": 5,
"type": "" ,
"position": 1
},
{
"token": "QUICK",
"start_offset": 6,
"end_offset": 11,
"type": "" ,
"position": 2
},
{
"token": "Brown",
"start_offset": 12,
"end_offset": 17,
"type": "" ,
"position": 3
},
{
"token": "Foxes",
"start_offset": 18,
"end_offset": 23,
"type": "" ,
"position": 4
},
{
"token": "jumped",
"start_offset": 24,
"end_offset": 30,
"type": "" ,
"position": 5
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "" ,
"position": 6
},
{
"token": "the",
"start_offset": 36,
"end_offset": 39,
"type": "" ,
"position": 7
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "" ,
"position": 8
},
{
"token": "dog's",
"start_offset": 45,
"end_offset": 50,
"type": "" ,
"position": 9
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "" ,
"position": 10
},
{
"token": "email",
"start_offset": 57,
"end_offset": 62,
"type": "" ,
"position": 11
},
{
"token": "abc@cormm.com",
"start_offset": 64,
"end_offset": 77,
"type": "" ,
"position": 12
}
]
}
关于与 standard 分词器的区别,可以自行验证一下。
POST _analyze
{
"tokenizer": "path_hierarchy",
"text": "/one/two/three"
}
解析结果:
[ /one, /one/two, /one/two/three ]
拆分 - 字符,并将它们替换为 / 并跳过前两个标记
PUT /person1
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "path_hierarchy",
"delimiter": "-",
"replacement": "/",
"skip": 2
}
}
}
}
}
{
"analyzer": "my_analyzer",
"text": "one-two-three-four-five"
}
解析结果:
[ /three, /three/four, /three/four/five ]
如果设置 reverse 为 true
[ one/two/three/, two/three/, three/ ]
{
"tokenizer": "uax_url_email",
"text": "Email me at john.smith@global-international.com"
}
解析结果:
[ Email, me, at, john.smith@global-international.com ]
令牌过滤器,是在标记之后执行。es 提供的令牌过滤器非常多,我们只列一些可能会有用的来说一说。
{
"tokenizer" : "standard",
"filter" : ["uppercase"],
"text" : "the Quick FoX JUMPs"
}
解析结果
[ THE, QUICK, FOX, JUMPS ]
{
"tokenizer" : "standard",
"filter" : ["lowercase"],
"text" : "THE Quick FoX JUMPs"
}
解析结果:
[ the, quick, fox, jumps ]
{
"tokenizer": "standard",
"filter": [ "stemmer" ],
"text": "fox running and jumping jumped"
}
解析结果:
[ fox, run, and, jump, jump ]
注意标记提取了词干。比如:jumping 和 jumped 提取为了 jump 。
该过滤器默认将如下词语作为停用词:
a, an, and, are, as, at, be, but, by, for, if, in, into, is,
it, no, not, of, on, or, such, that, the, their, then, there,
these, they, this, to, was, will, with
{
"tokenizer": "standard",
"filter": [ "stop" ],
"text": "a quick fox jumps over the lazy dog"
}
解析结果:
[ quick, fox, jumps, over, lazy, dog ]
此过滤器支持中日韩的文字,但标记只对文字进行两两组合,严格上说对中文的支持也不是十分好。
{
"tokenizer" : "standard",
"filter" : ["cjk_bigram"],
"text" : "我们都是中国人"
}
解析结果:
{
"tokens": [
{
"token": "我们",
"start_offset": 0,
"end_offset": 2,
"type": "" ,
"position": 0
},
{
"token": "们都",
"start_offset": 1,
"end_offset": 3,
"type": "" ,
"position": 1
},
{
"token": "都是",
"start_offset": 2,
"end_offset": 4,
"type": "" ,
"position": 2
},
{
"token": "是中",
"start_offset": 3,
"end_offset": 5,
"type": "" ,
"position": 3
},
{
"token": "中国",
"start_offset": 4,
"end_offset": 6,
"type": "" ,
"position": 4
},
{
"token": "国人",
"start_offset": 5,
"end_offset": 7,
"type": "" ,
"position": 5
}
]
}
除去我们以上介绍的,ES 的令牌过滤器还有很多,我们就不过多说明了,因为他们大多数都是不支持中文的。