7 Star 11 Fork 2

fastNLP / fastSum

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

FastSum

FastSum是基于fastNLP开发的一套完整的文本摘要任务解决方案,包括数据加载、模型调用、模型评价三个部分。

模型

FastSum中实现的模型包括:

  1. 基准模型 (LSTM/Transformer + SeqLab)
  2. Get To The Point: Summarization with Pointer-Generator Networks
  3. Extractive Summarization as Text Matching
  4. Text Summarization with Pretrained Encoders

数据集

我们提供了12个文本摘要任务的数据集:

名称 论文 类型 描述
CNN/DailyMail Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond 新闻 修改了原本用于 passage-based question answering 任务的数据库。 CNN 和 DailyMail 的网站为每篇文章都人工提供了一些要点信息总结文章。而且这些要点是抽象的而非抽取式摘要形式。 微调 Teaching Machines to Read and Comprehend 的脚本之后,作者生成了一个 multi-sentence summary 的数据集合
Xsum Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization 新闻 article-single sentence summary 的数据集。在 BBC 上,每篇文章开头都附有人工撰写的摘要,提取即可
The New York Times Annotated Corpus The New York Times Annotated Corpus 新闻 人工撰写的摘要
DUC The Effects of Human Variation in DUC Summarization Evaluation 新闻 2003 和 2004 Task1 都是对每个 doc 生成一段摘要
arXiv PubMed A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents 科学著作 从 arXiv 和 PubMed 获取的长篇文档的摘要,论文的 abstract 部分作为摘要的 ground-truth。
WikiHow WikiHow: A Large Scale Text Summarization Dataset 知识库 [WikiHow 有一个关于“怎么做”的数据库,每个步骤描述是由一段加粗摘要以及详细步骤叙述组成。作者把每个步骤的加粗摘要合并作为最终摘要,每步的剩余部分进行合并组成 article。
Multi News Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model 新闻、多文本摘要 数据集由新闻文章和这些文章的人工摘要组成,这些文章来自 newser.com 网站。每一篇摘要都是由专业编辑撰写的,并包含到引用的原始文章的链接。
BillSum BillSum: A Corpus for Automatic Summarization of US Legislation 法案文本 数据是选自美国国会和加利福尼亚州立法机构的法案文本人为编写的摘要。
AMI The AMI meeting corpus: a pre-announcement 会议 AMI会议语料库是一种多模式数据集,包含100小时的会议多模式记录。本语料库为每个单独的讲话者提供了高质量的人工记录,还包含了抽取式摘要生成式摘要、头部动作、手势、情绪状态等。
ICSI ICSI Corpus 会议 ICSI会议语料库是一个音频数据集,包含大约70个小时的会议记录。包含了抽取式摘要生成式摘要
Reddit TIFU Abstractive Summarization of Reddit Posts with Multi-level Memory Networks 在线讨论 通过从 Reddit 爬取数据,作者生成了两套摘要:用原帖的 title 作为 short summary,TL;DR summary 作为 long summary。
SAMSum SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization 对话 对话由语言学家根据日常对话写成,之后由语言学家标注摘要。

你可以运行summarizationLoader.py下载并使用它们。

模型评价

FastRougeMetric

FastRougeMetric使用python实现的ROUGE非官方库来实现在训练过程中快速计算rouge近似值。

源代码可见 https://github.com/pltrdy/rouge

在fastNLP中,该方法已经被包装成Metric.py中的FastRougeMetric类以供trainer直接使用。需要事先使用pip安装该rouge库。

pip install rouge

注意:由于实现细节的差异,该结果和官方ROUGE结果存在1-2个点的差异,仅可作为训练过程优化趋势的粗略估计。

PyRougeMetric

PyRougeMetric 使用论文 ROUGE: A Package for Automatic Evaluation of Summaries 提供的官方ROUGE 1.5.5评测库。

由于原本的ROUGE使用perl解释器,pyrouge对其进行了python包装,而PyRougeMetric将其进一步包装为trainer可以直接使用的Metric类。

为了使用ROUGE 1.5.5,需要使用sudo权限安装一系列依赖库。

sudo apt-get install libxml-perl libxml-dom-perl
pip install git+git://github.com/bheinzerling/pyrouge
export PYROUGE_HOME_DIR=the/path/to/RELEASE-1.5.5
pyrouge_set_rouge_path $PYROUGE_HOME_DIR
chmod +x $PYROUGE_HOME_DIR/ROUGE-1.5.5.pl

为避免下载失败,我们在resources中放置了所有你需要的安装文件。对于RELEASE-1.5.5,请务必在RELEASE-1.5.5/data中构建Wordnet 2.0而不是Wordnet 1.6,你可以参考此链接中的说明

cd $PYROUGE_HOME_DIR/data/WordNet-2.0-Exceptions/
./buildExeptionDB.pl . exc WordNet-2.0.exc.db
cd ../
ln -s WordNet-2.0-Exceptions/WordNet-2.0.exc.db WordNet-2.0.exc.db

测试ROUGE是否安装正确

pyrouge_set_rouge_path /absolute/path/to/ROUGE-1.5.5/directory
python -m pyrouge.test

安装PyRouge:

pip install pyrouge

ROUGE分数

数据集:CNN/DailyMail

Model ROUGE-1 ROUGE-2 ROUGE-L Paper
LEAD 3 40.11 17.64 36.32 Our data pre-process
ORACLE 55.24 31.14 50.96 Our data pre-process
LSTM + Sequence Labeling 40.72 18.27 36.98 -
Transformer + Sequence Labeling 40.86 18.38 37.18 -
LSTM + Pointer Network 39.73 39.90 36.05 Get To The Point: Summarization with Pointer-Generator Networks
BERTSUMEXT 42.83 19.92 39.18 Text Summarization with Pretrained Encoders
TransSUMEXT 41.04 18.34 37.30 Text Summarization with Pretrained Encoders
BERTSUMABS 41.17 18.72 38.16 Text Summarization with Pretrained Encoders
TransSUMABS 40.17 17.81 37.12 Text Summarization with Pretrained Encoders

安装和使用

建议在Linux上安装和使用

安装最新的FastNLP

pip install git+https://gitee.com/fastnlp/fastNLP@dev
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

包含多个模型和数据集的文本摘要项目 展开 收起
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/fastnlp/fastSum.git
git@gitee.com:fastnlp/fastSum.git
fastnlp
fastSum
fastSum
master

搜索帮助