ideaseg is a Chinese tokenizer based on the latest hanlp natural language processing toolkit, which includes the latest model data and removes the non-commercial friendly license related neuralnetworkparser code and data contained in hanlp.
Compared with other tokenizers such as ik, jcseg, hanlp greatly improves the accuracy of tokenization, but sacrifices the speed. Through optimization and configuration of hanlp, ideaseg has achieved the best balance in accuracy and tokenization speed.
Compared with other plugins based on hanlp, ideaseg synchronizes the latest hanlp code and data, removes the content that cannot be used commercially; implements automatic configuration; contains model data, no need to download by yourself, simple and convenient to use.
ideaseg provides three modules including:
core
~ core tokenizer moduleelasticsearch
~ ideaseg tokenizer plugin for elasticsearch (up to version 7.10.2)opensearch
~ ideaseg tokenizer plugin for opensearch (default version 2.4.1)Note about the version of elasticsearch. Since version 7.11.1, elastic has modified the license of es and changed the permission policy of plugins. It no longer allows plugins to read and write files. Because the model data of hanlp itself is very large, in order to improve the speed, its processing mechanism needs to generate some files in the data directory of the plugin as caches. So if you are using elasticsearch, please try to use version 7.10.2 or lower, and recommend using opensearch.
In addition, the data folder contains model data of hanlp.
Because the volume of the data model is large (400-500M after packaging), and the plugin mechanism of elasticsearch is strictly bound to the version of the engine itself, and the versions are numerous, this project does not provide pre-compiled binary versions, so you need to download the source code for building.
The following is the process of building the plugin. Before starting, please install git, java, maven and other related tools.
First, determine the specific version of your elasticsearch or opensearch, assuming you are using elasticsearch 7.10.2,
open the ideaseg/elasticsearch/pom.xml
file with a text editor, and modify the value of elasticsearch.version
to 7.10.2
(if it is opensearch, please modify opensearch/pom.xml
).
Save the file and open the command line window, and execute the following command to start building:
$ git clone https://gitee.com/indexea/ideaseg
$ cd ideaseg
$ mvn install
After the build is completed, two plugin files ideaseg.zip
will be generated in elasticsearch/target
and opensearch/target
respectively.
After the build is completed, we can use the plugin management tool provided by elasticsearch or opensearch to install.
The corresponding plugin management tool for elasticsearch is <elasticsearch>/bin/elasticsearch-plugin
,
while the corresponding management tool for opensearch is <opensearch>/bin/opensearch-plugin
.
The <elasticsearch>
and <opensearch>
are the respective directories of the two services after installation.
$ bin/elasticsearch-plugin install file:///<ideaseg>/elasticsearch/target/ideaseg.zip
$ bin/opensearch-plugin install file:///<ideaseg>/opensearch/target/ideaseg.zip
where <ideaseg>
is the path to the ideaseg
source code. Please note that the path must have file://
before it. If it is a windows system, the path needs to be added with file:///
, such as file:///d:\workdir\indexea\ideaseg\elasticsearch\target\ideaseg.zip
.
During the installation process, the plugin will prompt for permissions, just press enter to confirm to complete the installation, and restart the service after the installation.
Next, you can use the word segmentation test tool to test the plugin as follows:
POST _analyze
{
"analyzer": "ideaseg",
"text": "你好,我用的是 ideaseg 分词插件。"
}
ideaseg
provides two participle modes, standard
and pinyin
, which default to the standard
mode and the corresponding analyzer
value is ideaseg
. If you want to use the pinyin
pattern, change the analyzer
value to ideaseg_pinyin
.
In pinyin mode, the word segmentation result converts the Chinese to pinyin, while retaining the original Chinese.
For more information on word segmentation testing, please refer to ElasticSearch Documentation。
If you have any questions about using 'ideaseg', please raise them via Issues.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。