大数据技术 Elasticsearch5.3.1 IK分词，同义词/联想搜索设置-职坐标

大数据技术 Elasticsearch5.3.1 IK分词，同义词/联想搜索设置

沉沙 2018-09-21 来源：阅读 1424 评论 0

摘要：本篇教程探讨了大数据技术 Elasticsearch5.3.1 IK分词，同义词/联想搜索设置，希望阅读本篇文章以后大家有所收获，帮助大家对大数据技术的理解更加深入。

本篇教程探讨了大数据技术 Elasticsearch5.3.1 IK分词，同义词/联想搜索设置，希望阅读本篇文章以后大家有所收获，帮助大家对大数据技术的理解更加深入。

本文主要是记录Elasticsearch5.3.1 IK分词，同义词/联想搜索设置，本来是要写fscrawler的多种格式（html,pdf,word...）数据导入的，但是IK分词和同义词配置还是折腾了两天，没有很详细的内容，这里决定还是记录下来。IK Analyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。从2006年12月推出1.0版开始， IKAnalyzer已经推出了3个大版本。最初，它是以开源项目Luence为应用主体的，结合词典分词和文法分析算法的中文分词组件。新版本的IK Analyzer 3.0则发展为面向Java的公用分词组件，独立于Lucene项目，同时提供了对Lucene的默认优化实现。所以IK跟ES本来是天生一对，当然是对于中文来说，起码对于英文分词来说，空格分词就足够简单粗暴。中文检错为了达到更好的检索效果分词效果还是很重要的，所以IK分词插件有必要一试。
一、IK分词的安装：
1、下载IK分词器： https://github.com/medcl/elasticsearch-analysis-ik/releases 我这里下载的是5.3.2的已经编译的版本，因为这里没有5.3.1的版本。
2、在Elasticsearch的plugins目录下新建目录analysis-ik： mkdir analysis-ik
3、将IK分词器的压缩包解压到analysis-ik目录下：

[rzxes@rzxes analysis-ik]$ unzip elasticsearch-analysis-ik-5.3.2.zip 查看目录结构如下：

4、编辑plugin-sescriptor.properties：

修改一些配置，主要是修改elasticsearch.version,因为下载的是5.3.2的而我本身是5.3.1的elasticsearch所以这里修改对应即可。

5、启动Elasticsearch测试IK分词： [rzxes@rzxes elasticsearch-5.3.1]$ bin/elasticsearch

如下图可以看到loaded plugin [analysis-ik],说明已经加载了插件

IK分词支持两种分析器Analyzer: ik_smart , ik_max_word , 两种分词器Tokenizer: ik_smart , ik_max_word，
ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

试验一下能否进行分词：调用Elasticsearch的分词器API

默认分词器standard【analyzer=standard】： //192.168.230.150:9200/_analyze?analyzer=standard&pretty=true&text=hello word西红柿结果如下：

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "word",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "西",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "红",
      "start_offset" : 11,
      "end_offset" : 12,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "柿",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

采用IK分词器【analyzer=ik_smart】： //192.168.230.150:9200/_analyze?analyzer=ik_smart&pretty=true&text=hello word西红柿结果如下：

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "word",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "西红柿",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "9f",
      "start_offset" : 13,
      "end_offset" : 15,
      "type" : "LETTER",
      "position" : 3
    }
  ]
}

采用IK分词器【analyzer=ik_max_word】//192.168.230.150:9200/_analyze?analyzer=ik_max_word&pretty=true&text=hello word中华人民

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "word",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "中华人民",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "中华",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "华人",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 5
    }
  ]
}

致此IK分词就安装成功了，非常简单只需要下载编译包解压就可以了，至于修改配置是对于版本不对应的情况。

二、配置同义词对应：

配置同义词是为了能够检索一个词的时候相关词也能够检索到。关联词和同义词可以合二为一配置在这个文件里。
新建同义词文件：在Elasticsearch的confg目录下新建文件夹analysis并在其下创建文件synonyms.txt,这一步可以直接在conf目录下创建synonyms.txt并不影响，只需要在后面建立缩印的时候指定路径就行。 mkdir analysis   vim synonyms.txt

向文件synonyms.txt添加如下内容：注意‘"逗号"一定是英文的

西红柿,番茄 =>西红柿,番茄
社保,公积金 =>社保,公积金

启动Elasticsearch,此时同义词就会被加载进来。

三、测试同义词是否生效：

创建index：自定义分词器和过滤器并引用IK分词器。

curl -XPUT ‘//192.168.230.150:9200/index‘ -d‘
{
  "index": {
    "analysis": {
      "analyzer": {
        "by_smart": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "filter": ["by_tfr","by_sfr"],
          "char_filter": ["by_cfr"]
        },
        "by_max_word": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["by_tfr","by_sfr"],
          "char_filter": ["by_cfr"]
        }
      },
      "filter": {
        "by_tfr": {
          "type": "stop",
          "stopwords": [" "]
        },
        "by_sfr": {
          "type": "synonym",
          "synonyms_path": "analysis/synonyms.txt"
        }
      },
      "char_filter": {
        "by_cfr": {
          "type": "mapping",
          "mappings": ["| => |"]
        }
      }
    }
  }
}‘

创建mapping:定义一个字段title,并且设置分词器analyzer和查询分词器search_analyzer.

curl -XPUT ‘//192.168.230.150:9200/index/_mapping/typename‘ -d‘
{
  "properties": {
    "title": {
      "type": "text",
      "index": "analyzed",
      "analyzer": "by_max_word",
      "search_analyzer": "by_smart"
    }
  }
}‘

使用自定义分词器分词： curl -XGET ‘//192.168.230.150:9200/index/_analyze?pretty=true&analyzer=by_smart‘ -d ‘{"text":"番茄"}‘ 结果如下：分词西红柿会通过同义词创建相关索引。

添加数据：

curl -XPOST //192.168.230.150:9200/index/title/1 -d‘{"title":"我有一个西红柿"}‘
curl -XPOST //192.168.230.150:9200/index/title/2 -d‘{"title":"番茄炒蛋饭"}‘
curl -XPOST //192.168.230.150:9200/index/title/3 -d‘{"title":"西红柿鸡蛋面"}‘

检索数据：我们从index索引中检索关键字"番茄"并用标签标记命中的关键字。

curl -XPOST //192.168.230.150:9200/index/title/_search  -d‘
{
    "query" : { "match" : { "title" : "番茄" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "title" : {}
        }
    }
}
‘

结果如下：命中了三条数据，命中了"番茄"和他的同义词"西红柿".

致此，IK分词以及同义词的配置就完成了，。