1 Star 1 Fork 164

代码西亚 / PaddleNLP

forked from PaddlePaddle / PaddleNLP 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

简体中文 | English


PyPI - PaddleNLP Version PyPI - Python Version PyPI Status python version support os GitHub

News

  • [2021-06-07]《基于深度学习的自然语言处理》直播打卡课正在进行中🔥🔥🔥,快来打卡吧:https://aistudio.baidu.com/aistudio/course/introduce/24177
  • [2021-06-04] 新增多粒度语言知识预训练模型ERNIE-Gram,多项中文NLP任务取得SOTA成绩,获取2.0.2版本快速体验吧!
  • [2021-05-20] PaddleNLP 2.0正式版已发布!:tada:更多详细升级信息请查看Release Note.

简介

PaddleNLP 2.0是飞桨生态的文本领域核心库,具备易用的文本领域API多场景的应用示例、和高性能分布式训练三大特点,旨在提升开发者文本领域的开发效率,并提供基于飞桨2.0核心框架的NLP任务最佳实践。

  • 易用的文本领域API

    • 提供从数据加载、文本预处理、模型组网评估、到推理加速的领域API:支持丰富中文数据集加载的Dataset API;灵活高效地完成数据预处理的Data API;提供60+预训练模型的Transformer API等,可大幅提升NLP任务建模与迭代的效率。
  • 多场景的应用示例

    • 覆盖从学术到工业级的NLP应用示例,涵盖从NLP基础技术、NLP核心技术、NLP系统应用以及相关拓展应用。全面基于飞桨核心框架2.0全新API体系开发,为开发提供飞桨2.0框架在文本领域的最佳实践。
  • 高性能分布式训练

    • 基于飞桨核心框架领先的自动混合精度优化策略,结合分布式Fleet API,支持4D混合并行策略,可高效地完成超大规模参数的模型训练。

安装

环境依赖

  • python >= 3.6
  • paddlepaddle >= 2.1.0

pip安装

pip install --upgrade paddlenlp

更多关于PaddlePaddle和PaddleNLP安装的详细教程请查看Installation

易用的文本领域API

Transformer API: 强大的预训练模型生态底座

覆盖15个网络结构和67个预训练模型参数,既包括百度自研的预训练模型如ERNIE系列, PLATO, SKEP等,也涵盖业界主流的中文预训练模型。也欢迎开发者进预训练模贡献!🤗

from paddlenlp.transformers import *

ernie = ErnieModel.from_pretrained('ernie-1.0')
ernie_gram = ErnieGramModel.from_pretrained('ernie-gram-zh')
bert = BertModel.from_pretrained('bert-wwm-chinese')
albert = AlbertModel.from_pretrained('albert-chinese-tiny')
roberta = RobertaModel.from_pretrained('roberta-wwm-ext')
electra = ElectraModel.from_pretrained('chinese-electra-small')
gpt = GPTForPretraining.from_pretrained('gpt-cpm-large-cn')

对预训练模型应用范式如语义表示、文本分类、句对匹配、序列标注、问答等,提供统一的API体验。

import paddle
from paddlenlp.transformers import ErnieTokenizer, ErnieModel

tokenizer = ErnieTokenizer.from_pretrained('ernie-1.0')
text = tokenizer('自然语言处理')

# 语义表示
model = ErnieModel.from_pretrained('ernie-1.0')
pooled_output, sequence_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
# 文本分类 & 句对匹配
model = ErnieForSequenceClassification.from_pretrained('ernie-1.0')
# 序列标注
model = ErnieForTokenClassification.from_pretrained('ernie-1.0')
# 问答
model = ErnieForQuestionAnswering.from_pretrained('ernie-1.0')

请参考Transformer API文档查看目前支持的预训练模型结构、参数和详细用法。

Dataset API: 丰富的中文数据集

Dataset API提供便捷、高效的数据集加载功能;内置千言数据集,提供丰富的面向自然语言理解与生成场景的中文数据集,为NLP研究人员提供一站式的科研体验。

from paddlenlp.datasets import load_dataset

train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"])

可参考Dataset文档 查看更多数据集。

Embedding API: 一键加载预训练词向量

from paddlenlp.embeddings import TokenEmbedding

wordemb = TokenEmbedding("w2v.baidu_encyclopedia.target.word-word.dim300")
print(wordemb.cosine_sim("国王", "王后"))
>>> 0.63395125
wordemb.cosine_sim("艺术", "火车")
>>> 0.14792643

内置50+中文词向量,覆盖多种领域语料、如百科、新闻、微博等。更多使用方法请参考Embedding文档

更多API使用文档

  • Data API: 提供便捷高效的文本数据处理功能
  • Metrics API: 提供NLP任务的评估指标,与飞桨高层API兼容。

更多的API示例与使用说明请查阅PaddleNLP官方文档

多场景的应用示例

PaddleNLP提供了多粒度、多场景的NLP应用示例,面向动态图模式和全新的API体系开发,更加简单易懂。 涵盖了NLP基础技术NLP核心技术NLP系统应用以及文本相关的拓展应用如模型压缩、与知识库结合的文本知识关联、与图结合的文本图学习等。

NLP 基础技术

任务 简介
词向量 利用TokenEmbedding API展示如何快速计算词之间语义距离和词的特征提取。
词法分析 基于BiGRU-CRF模型实现了分词、词性标注和命名实体识的联合训练任务。
语言模型 提供了基于RNNLMTransformer-XL两种结构的语言模型,支持输入词序列计算其生成概率,可用于表示模型生成句子的流利程度。
语义解析:star: 语义解析Text-to-SQL任务是让机器自动让自然语言问题转换数据库可操作的SQL查询语句,是实现基于数据库自动问答的核心模块。

NLP 核心技术

文本分类 (Text Classification)

模型 简介
RNN/CNN/GRU/LSTM 实现了经典的RNN, CNN, GRU, LSTM等经典文本分类结构。
BiLSTM-Attention 基于BiLSTM网络结构引入注意力机制提升文本分类效果。
BERT/ERNIE 提供基于预训练模型的文本分类任务实现,包含训练、预测和推理部署的全流程应用。

文本匹配 (Text Matching)

模型 简介
SimNet 百度自研的语义匹配框架,使用BOW、CNN、GRNN等核心网络作为表示层,在百度内搜索、推荐等多个应用场景得到广泛易用。
ERNIE 基于ERNIE使用LCQMC数据完成中文句对匹配任务,提供了Pointwise和Pairwise两种类型学习方式。
Sentence-BERT 提供基于Siamese双塔结构的文本匹配模型Sentence-BERT实现,可用于获取文本的向量化表示。

文本生成 (Text Generation)

模型 简介
Seq2Seq 实现了经典的Seq2Seq with Attention的网络结构,并提供在自动对联的文本生成应用示例。
VAE-Seq2Seq 在Seq2Seq框架基础上,加入VAE结构以实现更加多样化的文本生成。
ERNIE-GEN ERNIE-GEN是百度NLP提出的基于多流(multi-flow)机制生成完整语义片段的预训练模型,基于该模型实现了提供了智能写诗的应用示例。

语义索引 (Semantic Indexing)

提供一套完整的语义索引开发流程,并提供了In-Batch Negative和Hardest Negatives两种策略,开发者可基于该示例实现一个轻量级的语义索引系统,更多信息请查看语义索引应用示例

信息抽取 (Information Extraction)

任务 简介
DuEE 基于DuEE数据集,使用预训练模型的方式提供句子级和篇章级的事件抽取示例。
DuIE 基于DuIE数据集,使用预训练模型的方式提供关系抽取示例。
快递单信息抽取 提供BiLSTM+CRF和预训练模型两种方式完成真实的快递单信息抽取案例。

NLP 系统应用

情感分析 (Sentiment Analysis)

模型 简介
SKEP:star2: SKEP是百度提出的基于情感知识增强的预训练算法,利用无监督挖掘的海量情感知识构建预训练目标,让模型更好理解情感语义,可为各类情感分析任务提供统一且强大的情感语义表示。

阅读理解 (Machine Reading Comprehension)

任务 简介
SQuAD 提供预训练模型在SQuAD 2.0数据集上微调的应用示例。
DuReader-yesno 提供预训练模型在千言数据集DuReader-yesno上微调的应用示例。
DuReader-robust 提供预训练模型在千言数据集DuReader-robust上微调的应用示例。

文本翻译 (Text Translation)

模型 简介
Seq2Seq-Attn 提供了Effective Approaches to Attention-based Neural Machine Translation基于注意力机制改进的Seq2Seq经典神经网络机器翻译模型实现。
Transformer 提供了基于Attention Is All You Need论文的Transformer机器翻译实现,包含了完整的训练到推理部署的全流程实现。

同传翻译 (Simultaneous Translation)

模型 简介
STACL :star: STACL是百度自研的基于Prefix-to-Prefix框架的同传翻译模型,结合Wait-k策略可以在保持较高的翻译质量的同时实现任意字级别的翻译延迟,并提供了轻量级同声传译系统搭建教程。

对话系统 (Dialogue System)

模型 简介
PLATO-2 PLATO-2是百度自研领先的基于课程学习两阶段方式训练的开放域对话预训练模型。
PLATO-mini:star2: 基于6层UnifiedTransformer预训练结构,结合海量中文对话语料数据预训练的轻量级中文闲聊对话模型。

拓展应用

文本知识关联 (Text to Knowledge)

:star2:解语是由百度知识图谱部开发的文本知识关联框架,覆盖中文全词类的知识库和知识标注工具,能够帮助开发者面对更加多元的应用场景,方便地融合自有知识体系,显著提升中文文本解析和挖掘效果,还可以便捷地利用知识增强机器学习模型效果。

文本图学习 (Text Graph Learning)

模型 简介
ERNIESage 基于飞桨PGL图学习框架结合PaddleNLP Transformer API实现的文本图学习模型。

模型压缩 (Model Compression)

模型 简介
Distill-LSTM 基于Distilling Task-Specific Knowledge from BERT into Simple Neural Networks论文策略的实现,将BERT中英文分类的下游模型知识通过蒸馏的方式迁移至LSTM的小模型结构中,取得比LSTM单独训练更好的效果。
OFA-BERT :star2: 基于PaddleSlim Once-For-ALL(OFA)策略对BERT在GLUE任务的下游模型进行压缩,在精度无损的情况下可减少33%参数量,达到模型小型化的提速的效果。

交互式Notebook教程

更多教程参见PaddleNLP on AI Studio

社区贡献与技术交流

特殊兴趣小组

  • 欢迎您加入PaddleNLP的SIG社区,贡献优秀的模型实现、公开数据集、教程与案例等。

QQ

  • 现在就加入PaddleNLP的QQ技术交流群,一起交流NLP技术吧!⬇️

版本更新

更多版本更新说明请查看ChangeLog

License

PaddleNLP遵循Apache-2.0开源协议

Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

An NLP library with Awesome pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications. 展开 收起
Python
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
Python
1
https://gitee.com/daimaxiya/PaddleNLP.git
git@gitee.com:daimaxiya/PaddleNLP.git
daimaxiya
PaddleNLP
PaddleNLP
develop

搜索帮助