For technical description of the algorithm, please see our paper:
Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
Accepted by NAACL-HLT 2021
ERNIE-Gram is an explicitly n-gram masking and predicting method to eliminate the limitations of previous contiguously masking strategies and incorporate coarse-grained linguistic information into pre-training sufficiently. To model the intra-dependencies and inter-relation of coarse-grained linguistic information, n-grams are masked and predicted directly using explicit n-gram identities rather than contiguous sequences of n tokens. Furthermore, ERNIE-Gram employs a generator model to sample plausible n-gram identities as optional n-gram masks and predict them in both coarse-grained and fine-grained manners to enable comprehensive n-gram prediction and relation modeling.
We construct three novel methods to model the intra-dependencies and inter-relation of coarse-grained linguistic information:
We release the checkpoints for ERNIE-Gram 16G and ERNIE-Gram 160G models which are pre-trained on the base-scale corpora (16GB text for BERT) and the large-scale corpora (160GB text for RoBERTa) respectively.
We compare the performance of ERNIE-Gram with the existing SOTA pre-training models for natural language generation (MPNet, UniLMv2, ELECTRA, RoBERTa and XLNet) on several language understanding tasks, including GLUE benchmark (General Language Understanding Evaluation), SQuAD (Stanford Question Answering).
The General Language Understanding Evaluation (GLUE) is a multi-task benchmark consisting of various NLU tasks, which contains 1) pairwise classification tasks like language inference MNLI, RTE), question answering (QNLI) and paraphrase detection (QQP, MRPC), 2) single-sentence classification tasks like linguistic acceptability (CoLA), sentiment analysis (SST-2) and 3) text similarity task (STS-B).
The results on GLUE are presented as follows:
Tasks | MNLI | QNLI | QQP | SST-2 | CoLA | MRPC | RTE | STS-B | AVG |
---|---|---|---|---|---|---|---|---|---|
Metrics | ACC | ACC | ACC | ACC | MCC | ACC | ACC | PCC | AVG |
XLNet | 86.8 | 91.7 | 91.4 | 94.7 | 60.2 | 88.2 | 74.0 | 89.5 | 84.5 |
RoBERTa | 87.6 | 92.8 | 91.9 | 94.8 | 63.6 | 90.2 | 78.7 | 91.2 | 86.4 |
ELECTRA | 88.8 | 93.2 | 91.5 | 95.2 | 67.7 | 89.5 | 82.7 | 91.2 | 87.5 |
UniLMv2 | 88.5 | 93.5 | 91.7 | 95.1 | 65.2 | 91.8 | 81.3 | 91.0 | 87.3 |
MPNet | 88.5 | 93.3 | 91.9 | 95.4 | 65.0 | 91.5 | 85.2 | 90.9 | 87.7 |
ERNIE-Gram | 89.1 | 93.2 | 92.2 | 95.6 | 68.6 | 90.7 | 83.8 | 91.3 | 88.1 |
Download the GLUE data by running this script and unpack it to some directory ${TASK_DATA_PATH}
After the dataset is downloaded, you should run sh ./utils/glue_data_process.sh $TASK_DATA_PATH
to convert the data format for training. If everything goes well, there will be a folder named data
created with all the converted datas in it.
The Stanford Question Answering (SQuAD) tasks are designed to extract the answer span within the given passage conditioned on the question. We conduct experiments on SQuAD1.1 and SQuAD2.0 by adding a classification layer on the sequence outputs of ERNIE-Gram and predicting whether each token is the start or end position of the answer span.
The results on SQuAD are presented as follows:
Tasks | SQuADv1 | SQuADv2 |
---|---|---|
Metrics | EM / F1 | EM / F1 |
RoBERTa | 84.6 / 91.5 | 80.5 / 83.7 |
XLNet | - / - | 80.2 / - |
ELECTRA | 86.8 / - | 80.5 / - |
MPNet | 86.8 / 92.5 | 82.8 / 85.6 |
UniLMv2 | 87.1 / 93.1 | 83.3 / 86.1 |
ERNIE-Gram | 87.2 / 93.2 | 84.1 / 87.1 |
The preprocessed data for SQuAD can be downloaded from SQuADv1 and SQuADv2. Please unpack them to ./data
.
The preprocessed data for tasks involving long text can be downloaded from RACE, IMDB and AG'news. Please unpack them to ./data
.
This code base has been tested with PaddlePaddle 2.0.0+, You can install PaddlePaddle follow this site.
Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before running ERNIE-Gram. We have put the parameter configurations of the finetuning tasks in ./task_conf
. You can easily run finetuning through these configuration files. For example, you can finetune ERNIE-Gram model on RTE by
TASK="RTE" # MNLI, SST-2, CoLA, SQuADv1..., please see ./task_conf
MODEL_PATH="./ernie-gram-160g" #path for pre-trained models
sh run.sh ${TASK} ${MODEL_PATH}
The log of training and the evaluation results are in log/*job.log.0
. To finetune on your own task data, you can refer to the data format we provide for processing your data.
The ERNIE-Gram-zh code using dynamic graph is more concise and flexible, please refer to ERNIE-Gram Dygraph for specific use.
You can cite the paper as below:
@article{xiao2021ernie-gram,
title={ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding},
author={Xiao, Dongling and Li, Yukun and Zhang, Han and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2010.12148},
year={2021}
}
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。