多字段特性及Mapping中配置自定义Analyzer
Posted anyux
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了多字段特性及Mapping中配置自定义Analyzer相关的知识,希望对你有一定的参考价值。
目录
报错org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried
#原因:线程占用
#杀死进程,启动进程
kill -9 `ps -ef | grep [e]lasticsearch | grep [j]ava | awk '{print $2}'`
elasticsearch
多字段特性
厂商名字实现精确匹配
增加一个keyword字段
使用不同的analyzer
不同语言
pinyin字段的搜索
还支持为搜索和索引指定不通的analyzer
Exact Values v.s Full Text
Exact Values:包括数字、日期、具体一个字符串(例如"Apple Store")
Elasticsearch中的keyword
全文本,非结构化的文本数据
Elasticsearch中的text
Exact Values不需要分词
Elasticsearch为每一个字段创建一个倒排索引
Exact Value在索引时,不需要做特殊的分词处理
自定义分词
当Elasticsearch自带的分词器无法满足时,可以自定义分词器。通过自组合不同的组件实现
Character Filter
Tokenizer
Token Filter
Character Filter
在Tokenizer之前对文本进行处理,例如增加删除及替换字符。可以配置多个Character Filter。会影响Tokenizer的position和offset信息
一些自带的Character Filters
html strip - 去除 html 标签
Mapping - 字符串替换
Pattern replace - 正则匹配替换
Tokenizer
将原始的文本按照一定的规则,切分为词(term or token)
Elasticsearch内置的Tokenizers
whitespace / standard / uax_url_email / pattern / keyword / path hierarchy
可以用java开发插件,实现自己的Tokenizer
Token Filters
将Tokenizer输出的单词(term),进行增加,修改,删除
自带的Token Filters
Lowercase / stop / synonym(添加近义词)
设置一个Custom Analyzer
提交请求,清除html标签
POST _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text":"<b>hello world</b>"
}
返回响应
{
"tokens" : [
{
"token" : "hello world",
"start_offset" : 3,
"end_offset" : 18,
"type" : "word",
"position" : 0
}
]
}
使用char filter进行替换减号
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type":"mapping",
"mappings":["- => _"]
}
],
"text": "123-456, I-test! test-990 650-555-1234"
}
返回结果
{
"tokens" : [
{
"token" : "123_456",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "I_test",
"start_offset" : 9,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "test_990",
"start_offset" : 17,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "650_555_1234",
"start_offset" : 26,
"end_offset" : 38,
"type" : "<NUM>",
"position" : 3
}
]
}
char filter 替换表情符号
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type":"mapping",
"mappings":[":) => happy",":( => sad"]
}
],
"text": ["I am felling :)","Feeling :( today"]
}
返回响应
{
"tokens" : [
{
"token" : "I",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "am",
"start_offset" : 2,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "felling",
"start_offset" : 5,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "happy",
"start_offset" : 13,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "Feeling",
"start_offset" : 16,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 104
},
{
"token" : "sad",
"start_offset" : 24,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 105
},
{
"token" : "today",
"start_offset" : 27,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 106
}
]
}
正则表达式
GET _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type":"pattern_replace",
"pattern":"http://(.*)",
"replacement":"$1"
}
],
"text": "http://www.elastic.co"
}
返回值
"tokens" : [
{
"token" : "www.elastic.co",
"start_offset" : 0,
"end_offset" : 21,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
按目录切分
POST _analyze
{
"tokenizer": "path_hierarchy",
"text": "/usr/ymruan/a/b"
}
返回结果
{
"tokens" : [
{
"token" : "/usr",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/ymruan",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/ymruan/a",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 0
},
{
"token" : "/usr/ymruan/a/b",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
}
]
}
whitespace与stop
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["stop"],
"text": ["The rain in Spain falls mainly on the plain."]
}
返回结果
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "rain",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "Spain",
"start_offset" : 12,
"end_offset" : 17,
"type" : "word",
"position" : 3
},
{
"token" : "falls",
"start_offset" : 18,
"end_offset" : 23,
"type" : "word",
"position" : 4
},
{
"token" : "mainly",
"start_offset" : 24,
"end_offset" : 30,
"type" : "word",
"position" : 5
},
{
"token" : "plain.",
"start_offset" : 38,
"end_offset" : 44,
"type" : "word",
"position" : 8
}
]
}
remove 加入lowercase后,The被当成stopword删除
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase","stop"],
"text": ["The rain in Spain falls mainly on the plain."]
}
返回结果
{
"tokens" : [
{
"token" : "rain",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "spain",
"start_offset" : 12,
"end_offset" : 17,
"type" : "word",
"position" : 3
},
{
"token" : "falls",
"start_offset" : 18,
"end_offset" : 23,
"type" : "word",
"position" : 4
},
{
"token" : "mainly",
"start_offset" : 24,
"end_offset" : 30,
"type" : "word",
"position" : 5
},
{
"token" : "plain.",
"start_offset" : 38,
"end_offset" : 44,
"type" : "word",
"position" : 8
}
]
}
听20章10分钟视频再记录
以上是关于多字段特性及Mapping中配置自定义Analyzer的主要内容,如果未能解决你的问题,请参考以下文章
Elasticsearch - 自动检测及动态映射Dynamic Mapping
R语言使用survminer包的ggcompetingrisks函数可视化竞争风险的累积事件曲线使用可视化对象的mapping属性自定义修改配置颜色分组及线条类型
ElasticSearch03_Mapping字段映射常用类型数据迁移ik分词器自定义分词器
ElasticSearch03_Mapping字段映射常用类型数据迁移ik分词器自定义分词器