18 Star 193 Fork 43

Indexea / ideaseg

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
Readme.md 4.79 KB
一键复制 编辑 原始数据 按行查看 历史
Indexea 提交于 2023-05-04 11:44 . add pinyin mode describe

ideaseg is a Chinese tokenizer based on the latest hanlp natural language processing toolkit, which includes the latest model data and removes the non-commercial friendly license related neuralnetworkparser code and data contained in hanlp.

Compared with other tokenizers such as ik, jcseg, hanlp greatly improves the accuracy of tokenization, but sacrifices the speed. Through optimization and configuration of hanlp, ideaseg has achieved the best balance in accuracy and tokenization speed.

Compared with other plugins based on hanlp, ideaseg synchronizes the latest hanlp code and data, removes the content that cannot be used commercially; implements automatic configuration; contains model data, no need to download by yourself, simple and convenient to use.

ideaseg provides three modules including:

  1. core ~ core tokenizer module
  2. elasticsearch ~ ideaseg tokenizer plugin for elasticsearch (up to version 7.10.2)
  3. opensearch ~ ideaseg tokenizer plugin for opensearch (default version 2.4.1)

Note about the version of elasticsearch. Since version 7.11.1, elastic has modified the license of es and changed the permission policy of plugins. It no longer allows plugins to read and write files. Because the model data of hanlp itself is very large, in order to improve the speed, its processing mechanism needs to generate some files in the data directory of the plugin as caches. So if you are using elasticsearch, please try to use version 7.10.2 or lower, and recommend using opensearch.

In addition, the data folder contains model data of hanlp.

Because the volume of the data model is large (400-500M after packaging), and the plugin mechanism of elasticsearch is strictly bound to the version of the engine itself, and the versions are numerous, this project does not provide pre-compiled binary versions, so you need to download the source code for building.

Building

The following is the process of building the plugin. Before starting, please install git, java, maven and other related tools.

First, determine the specific version of your elasticsearch or opensearch, assuming you are using elasticsearch 7.10.2, open the ideaseg/elasticsearch/pom.xml file with a text editor, and modify the value of elasticsearch.version to 7.10.2 (if it is opensearch, please modify opensearch/pom.xml).

Save the file and open the command line window, and execute the following command to start building:

$ git clone https://gitee.com/indexea/ideaseg
$ cd ideaseg
$ mvn install

After the build is completed, two plugin files ideaseg.zip will be generated in elasticsearch/target and opensearch/target respectively.

Installation

After the build is completed, we can use the plugin management tool provided by elasticsearch or opensearch to install.

The corresponding plugin management tool for elasticsearch is <elasticsearch>/bin/elasticsearch-plugin, while the corresponding management tool for opensearch is <opensearch>/bin/opensearch-plugin. The <elasticsearch> and <opensearch> are the respective directories of the two services after installation.

Install ideaseg plugin for elasticsearch

$ bin/elasticsearch-plugin install file:///<ideaseg>/elasticsearch/target/ideaseg.zip

Install ideaseg plugin for opensearch

$ bin/opensearch-plugin install file:///<ideaseg>/opensearch/target/ideaseg.zip

where <ideaseg> is the path to the ideaseg source code. Please note that the path must have file:// before it. If it is a windows system, the path needs to be added with file:///, such as file:///d:\workdir\indexea\ideaseg\elasticsearch\target\ideaseg.zip.

During the installation process, the plugin will prompt for permissions, just press enter to confirm to complete the installation, and restart the service after the installation.

Next, you can use the word segmentation test tool to test the plugin as follows:

POST _analyze
{
  "analyzer": "ideaseg",
  "text":     "你好,我用的是 ideaseg 分词插件。"
}

ideaseg provides two participle modes, standard and pinyin, which default to the standard mode and the corresponding analyzer value is ideaseg. If you want to use the pinyin pattern, change the analyzer value to ideaseg_pinyin.

In pinyin mode, the word segmentation result converts the Chinese to pinyin, while retaining the original Chinese.

For more information on word segmentation testing, please refer to ElasticSearch Documentation

Feedback

If you have any questions about using 'ideaseg', please raise them via Issues.

Special thanks

https://github.com/KennFalcon/elasticsearch-analysis-hanlp

Java
1
https://gitee.com/indexea/ideaseg.git
git@gitee.com:indexea/ideaseg.git
indexea
ideaseg
ideaseg
master

搜索帮助

53164aa7 5694891 3bd8fe86 5694891