1 Star 0 Fork 5

MGL.TECH / CINO

forked from 科大讯飞 / CINO 
Create your Gitee Account
Explore and code with more than 12 million developers,Free private repositories !:)
Sign up
Clone or Download
contribute
Sync branch
Cancel
Notice: Creating folder will generate an empty file .keep, because not support in Git
Loading...
README
Apache-2.0

中文说明 | English



GitHub

Pre-trained Language Model (PLM) has been an important technique in the recent natural language processing field, including multilingual NLP research. In order to promote the NLP research for Chinese minority languages, the Joint Laboratory of HIT and iFLYTEK Research (HFL) has released the first specialized pre-trained language model **CINO** (**C**hinese m**INO**rity PLM).

Chinese LERT | Chinese/English PERT Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | TextBrewer | TextPruner

More resources by HFL: https://github.com/ymcui/HFL-Anthology

News

Oct 29, 2022 We release a new pre-trained model called LERT, check https://github.com/ymcui/LERT/

Aug 23, 2022 CINO has been accepted as a long paper at COLING 2022. We will update the final paper and release the corresponding resources after the camera-ready deadline.

Feb 21, 2022 CINO-small (6-layer, 148M parameters) have been released.

Jan 25, 2022 CINO-base-v2, CINO-large-v2, and WCM-v2 have been released.

Dec 17, 2021 We have released a model pruning toolkit TextPruner. Check https://github.com/airaria/TextPruner

Oct 25, 2021 CINO-large and Wiki-Chinese-Minority(WCM)dataset have been released.

Guide

Section Description
Introduction Introduction to CINO
Download Download links and how-to-use
Quick Load Learn how to quickly load our models through 🤗Transformers
Dataset for Chinese Minority Languages Introduce Wiki-Chinese-Minority (WCM) and other datasets
Results Results on several datasets
Citation Citation and technical report

Introduction

Multilingual Pre-trained Language Model, such as mBERT and XLM-R, adopts masked language model (MLM) and other self-supervised approaches to support multilingual and cross-lingual abilities in NLP systems, using training corpus in various languages.

However, due to the scarcity of corpus in Chinese minority languages and neglection of relevant research, current multilingual PLMs are not capable of dealing with these languages.

We made the following contributions.

  • We propose CINO (Chinese mINOrity PLM), which is built on XLM-R. We further pre-train XLM-R with corpus in Chinese minority languages.

  • To evaluate CINO as well as other multilingual PLMs, we also propose a new classification dataset called Wiki-Chinese-Minority(WCM), which is built on Wikipedia.

  • The experimental results on WCM, Tibetan News Classification Corpus (TNCC), and KLUE-TC (YNAT) show that CINO achieves state-of-the-art performances.

CINO supports the following languages:

  • Chinese,中文(zh)
  • Tibetan,藏语(bo)
  • Mongolian (Uighur form),蒙语(mn)
  • Uyghur,维吾尔语(ug)
  • Kazakh (Arabic form),哈萨克语(kk)
  • Korean,朝鲜语(ko)
  • Zhuang,壮语
  • Cantonese,粤语(yue)



Download

Direct Download

We provide CINO-small, CINO-base and CINO-large of PyTorch version (preferred version: v2). We will release more models in the future.

  • CINO-large-v2:24-layer, 1024-hidden, 16-heads, vocabulary size 136K, 442M parameters
  • CINO-base-v2 12-layer, 768-hidden, 12-heads, vocabulary size 136K, 190M parameters
  • CINO-small-v2 6-layer, 768-hidden, 12-heads, vocabulary size 136K, 148M parameters
  • CINO-large:24-layer, 1024-hidden, 16-heads, vocabulary size 275K, 585M parameters

Notice:

  • v1 model(CINO-large)supports all the languages in XLM-R and the minority languages.
  • v2 models (CINO-large-v2 and CINO-base-v2 and CINO-small-v2) have pruned vocabularies and only support Chinese and the minority languages.
Model Size Google Drive Baidu Disk
CINO-large-v2 1.6GB PyTorch PyTorch(pw: 3fjt)
CINO-base-v2 705MB PyTorch PyTorch(pw: qnvc)
CINO-small-v2 564MB PyTorch PyTorch todo(pw: 9mc8)
CINO-large 2.2GB PyTorch PyTorch (pw: wpyh)

Download from 🤗transformers

You can also download our models from 🤗transformers Model Hub, including PyTorch and Tensorflow2 models.

Model Size transformers model hub URL
CINO-large-v2 1.6GB https://huggingface.co/hfl/cino-large-v2
CINO-base-v2 705MB https://huggingface.co/hfl/cino-base-v2
CINO-small-v2 564MB https://huggingface.co/hfl/cino-small-v2
CINO-large 2.2GB https://huggingface.co/hfl/cino-large

How-to: click the model link that you wish to download (e.g., https://huggingface.co/hfl/cino-large) → Select "Files and versions" tab → Download!

How-To-Use

There are three files in PyTorch model:

pytorch_model.bin        # Model Weight
config.json              # Model Config
sentencepiece.bpe.model  # Vocabulary

CINO uses exactly the same neural architecture with XLM-R, which can be direclty loaded using XLMRobertaModel class in Transformers.

from transformers import XLMRobertaTokenizer, XLMRobertaModel
tokenizer = XLMRobertaTokenizer.from_pretrained("PATH_TO_MODEL_DIR")
model = XLMRobertaModel.from_pretrained("PATH_TO_MODEL_DIR")

Quick Load

With 🤗Transformers, the models above could be easily accessed and loaded through the following codes.

from transformers import XLMRobertaTokenizer, XLMRobertaModel
tokenizer = XLMRobertaTokenizer.from_pretrained("MODEL_NAME")
model = XLMRobertaModel.from_pretrained("MODEL_NAME")

The actual model and its MODEL_NAME are listed below.

Actual Model MODEL_NAME
CINO-large-v2 hfl/cino-large-v2
CINO-base-v2 hfl/cino-base-v2
CINO-small-v2 hfl/cino-small-v2
CINO-large hfl/cino-large

Dataset for Chinese Minority Languages

Wiki-Chinese-Minority(WCM)

We built a new classification dataset Wiki-Chinese-Minority (WCM). The dataset covers Mongolian, Tibetan, Uyghur, Cantonese, Korean, Kazakh, and Chinese, including ten categories of art, geography, history, nature, natural science, people, technology, education, economy, and health.

We use weighted-F1 for evaluation.

Name Google Drive Baidu Disk
Wiki-Chinese-Minority-v2(WCM-v2) Google Drive -
Wiki-Chinese-Minority(WCM) Google Drive -

WCM-v2 has a more balanced data distribution across categories and languages.

Dataset Statistics of WCM-v2:

Category mn bo ug yue ko Kk zh-Train zh-Dev zh-Test
Art 135 141 3 387 806 348 2657 331 335
Geography 76 339 256 1550 1197 572 12854 1589 1644
History 66 111 0 499 776 491 1771 227 248
Nature 7 0 7 606 442 361 1105 134 110
Natural Science 779 133 20 336 532 880 2314 317 287
People 1402 111 0 1230 684 169 7706 953 924
Technology 191 163 8 329 808 515 1184 134 152
Education 6 1 0 289 439 1392 936 130 118
Economy 205 0 0 445 575 637 922 113 109
Health 106 111 6 272 299 893 551 67 73
Total 2973 1110 300 5943 6558 6258 32000 3995 4000

Note:

  • The dataset includes two folders: zh and minority
  • zh: train/dev/test in Chinese
  • minority: test set for all languages

The dataset is still in its alpha stage, with possible modifications in the future.

Results

We evaluate on YNAT, TNCC, and Wiki-Chinese-Minority. For each dataset, we use the same hyper-params for all models.

Korean Text Classification (YNAT)

#Train #Dev #Test #Classes Metric
45,678 9,107 9,107 7 macro-F1

Hyper-params: Initial LR1e-5, batch size 16.

Results:

Model Dev
XLM-R-large[1] 87.3
XLM-R-large[2] 86.3
CINO-small-v2 84.1
CINO-base-v2 85.5
CINO-large-v2 87.2
CINO-large 87.4

[1] The results in the original paper.
[2] Reproduced result using the same initial LR with CINO-large.

Tibetan News Classification Corpus(TNCC)

#Train[1] #Dev #Test #Classes Metric
7,363 920 920 12 macro-F1

Hyper-params: initial LR 5e-6, batch size 16

Results:

Model Dev Test
TextCNN 65.1 63.4
XLM-R-large 14.3 13.3
CINO-small-v2 72.1 66.7
CINO-base-v2 70.3 68.4
CINO-large-v2 72.9 71.0
CINO-large 71.3 68.6

Note: there is no official train/dev/test split in this dataset. We split the dataset with the ratio of 8:1:1. Our splits are available at data/TNCC. The version "with_space_separated" reserves the spaces provided by the original author, but in our paper, we use the version "without_space_separated" where the spaces for separation have been removed.

Wiki-Chinese-Minority

We use Chinese training set to train our model and test on other languages (zero-shot). We use weighted-F1 for evaluation.

Hyper-params: initial LR 7e-6, batch size 32.

Results on WCM-v2:

Model MN BO UG YUE KO KK ZH Average
XLM-R-base 41.2 25.7 84.5 66.1 43.1 23.0 88.3 53.1
XLM-R-large 53.8 24.5 89.4 67.3 45.4 30.0 88.3 57.0
CINO-small-v2 60.3 47.9 86.5 64.6 43.2 33.2 87.9 60.5
CINO-base-v2 62.1 52.7 87.8 68.1 45.6 38.3 89.0 63.4
CINO-large-v2 73.1 58.9 90.1 66.9 45.1 42.0 88.9 66.4

Demo Code

See examples. It currently includes

Citation

If you find the technical report or resource is useful, please cite our work in your paper.

@inproceedings{yang-etal-2022-cino,
    title = "{CINO}: A {C}hinese Minority Pre-trained Language Model",
    author = "Yang, Ziqing  and
      Xu, Zihang  and
      Cui, Yiming  and
      Wang, Baoxin  and
      Lin, Min  and
      Wu, Dayong  and
      Chen, Zhigang",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.346",
    pages = "3937--3949"
}

Follow Us

Follow our official WeChat account to keep updated with our latest technologies!

qrcode.jpg

Issues

Please submit an issue.

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "{}" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright 2020 RS_RDG_AI_Group / rc / zqyang5 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

About

在自然语言处理领域中,预训练语言模型(Pre-trained Language Model, PLM)已成为重要的基础技术,在多语言的研究中,预训练模型的使用也愈加普遍。为了促进中国少数民族语言信息处理的研究与发展,哈工大讯飞联合实验室(HFL)发布少数民族语言预训练模型CINO (Chinese mINOrity PLM)。 expand collapse
Apache-2.0
Cancel

Releases

No release

Contributors

All

Activities

Load More
can not load any more
1
https://gitee.com/mgl-tech/cino.git
git@gitee.com:mgl-tech/cino.git
mgl-tech
cino
CINO
main

Search