1 Star 0 Fork 0

百度开源 / ernie-unimo

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件


Code for the main conference of ACL 2021 long paper UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning


Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e., text or image) or limited multi-modal data (i.e., image-text pairs). In this work, we propose a UNIfied-MOdal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections are utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs augmented with related images and texts. With the help of rich non-paired single-modal data, our model is able to learn more generalizable representations, by allowing textual knowledge and visual knowledge to enhance each other in the unified semantic space. The experimental results show that UNIMO greatly improves the performance of several single-modal and multi-modal downstream tasks.



Results on multi-modal understanding and generation tasks:


Results on single-modal understanding and generation tasks:



  • [] Add VQA tasks


python 3.7.4
pyrouge==0.1.3 regex==2020.7.14

Pre-trained Models

UNIMO adopts large-scale text corpus, image collections and image-text aligned datasets as the pre-training data. We provide UNIMO pre-trained models below:

UNIMO base (lowercased | 12 layers)

UNIMO-mnli base (lowercased | 12 layers)

UNIMO large (lowercased | 24 layers)

UNIMO-mnli large (lowercased | 24 layers)

MODEL_SIZE=base # base | mnli_base | large | mnli_large
cd /path/to/model_files
wget --no-check-certificate -q https://unimo.bj.bcebos.com/model/unimo_${MODEL_SIZE}_en.tar.gz
tar -zxf unimo_${MODEL_SIZE}_en.tar.gz


Our fine-tuning experiments are carried on V100 GPU. The following are the startup methods and basic settings of all downstream tasks:

Task Type Datatset Pre-trained Models Start Command V100 GPU Cards Running Time
Text Understanding SST-2 UNIMO base sh ./script/classification/SST-2/run.sh 8 9h
UNIMO large sh ./script/classification/SST-2_large/run.sh 8 14h
CoLA UNIMO base sh ./script/classification/CoLA/run.sh 4 2h
UNIMO large sh ./script/classification/CoLA_large/run.sh 4 4h
MNLI-AX UNIMO base sh ./script/classification/MNLI-AX/run.sh 8 1d20h
UNIMO large sh ./script/classification/MNLI-AX_large/run.sh 8 2d13h
STS-B UNIMO-mnli base sh ./script/regression/STS-B/run.sh 8 2h
UNIMO-mnli large sh ./script/regression/STS-B_large/run.sh 8 4h
Text Generation CNN/DailyMail UNIMO base sh ./script/seq2seq/cnndm/run.sh 4 1d8h
UNIMO large sh ./script/seq2seq/cnndm_large/run.sh 4 3d18h
Gigaword UNIMO base sh ./script/seq2seq/gigaword/run.sh 4 1d3h
UNIMO large sh ./script/seq2seq/gigaword_large/run.sh 4 2d3h
CoQA UNIMO base sh ./script/seq2seq/coqa/run.sh 4 7h
UNIMO large sh ./script/seq2seq/coqa_large/run.sh 4 22h
Squad_QG UNIMO base sh ./script/seq2seq/squad_qg/run.sh 4 4h
UNIMO large sh ./script/seq2seq/squad_qg_large/run.sh 4 8h
Multi-Modal Understanding Flickr30k UNIMO base sh ./script/retrieval/Flickr30k/run.sh 16 3d
UNIMO large sh ./script/retrieval/Flickr30k_large/run.sh 16 3d
SNLI-VE UNIMO base sh ./script/visual_entailment/SNLI-VE/run.sh 16 16h
UNIMO large sh ./script/visual_entailment/SNLI-VE_large/run.sh 16 2d
VQA UNIMO base - - -
UNIMO large - - -
Multi-Modal Generation COCO Caption UNIMO base sh ./script/img2txt/coco/run.sh 16 3d
UNIMO large sh ./script/img2txt/coco_large/run.sh 16 4d

Text Understanding Tasks

(1) Sentiment Classification

Download SST-2 dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SST-2.tar.gz
tar -zxf SST.tar.gz

Run the following common to train and evaluate on the SST-2 dataset:

For base model:

bash ./script/classification/SST-2/run.sh

For large model:

bash ./script/classification/SST-2_large/run.sh

Evaluation Results:

Model Acc
UNIMO-base 95.1
UNIMO-large 96.8

(2) Natural Language Inference

Download MNLI-AX dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/MNLI-AX.tar.gz
tar -zxf MNLI-AX.tar.gz

Run the following common to train and evaluate on the MNLI-AX dataset:

For base model:

bash ./script/classification/MNLI-AX/run.sh

For large model:

bash ./script/classification/MNLI-AX_large/run.sh

Evaluation Results:

Model Acc-(m/mm)
UNIMO-base 86.8/86.7
UNIMO-large 89.8/89.5

(3) Similarity Tasks

Download STS-B dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/STS-B.tar.gz
tar -zxf STS-B.tar.gz

Run the following common to train and evaluate on the STS-B dataset:

For base model:

bash ./script/regression/STS-B/run.sh

For large model:

bash ./script/regression/STS-B_large/run.sh

Evaluation Results:

Model Pearson correlation
UNIMO-base 91.0
UNIMO-large 92.6

(4) Linguistic Acceptability Judgments

Download CoLA dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/CoLA.tar.gz
tar -zxf CoLA.tar.gz

Run the following common to train and evaluate on the CoLA dataset:

For base model:

bash ./script/classification/CoLA/run.sh

For large model:

bash ./script/classification/CoLA_large/run.sh

Evaluation Results:

Model Matthews correlation
UNIMO-base 65.4
UNIMO-large 68.5

Text Generation Tasks

(1) Document Summarization

Download CNN/DailyMail dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/cnndm.tar.gz
tar -zxf cnndm.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/cnndm.tar.gz
tar -zxf cnndm.tar.gz

Run the following common to train and evaluate on the CNN/DailyMail dataset:

For base model:

bash ./script/seq2seq/cnndm/run.sh

For large model:

bash ./script/seq2seq/cnndm_large/run.sh

Evaluation Results:

UNIMO-base 42.42 20.12 39.61
UNIMO-large 43.51 20.65 40.63

(2) Sentence Compression

Download Gigaword dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/gigaword.tar.gz
tar -zxf gigaword.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/gigaword.tar.gz
tar -zxf gigaword.tar.gz

Run the following common to train and evaluate on the Gigaword dataset:

For base model:

bash ./script/seq2seq/gigaword/run.sh

For large model:

bash ./script/seq2seq/gigaword_large/run.sh

Evaluation Results:

UNIMO-base 38.80 19.99 36.27
UNIMO-large 39.71 20.37 36.88

(3) Question Generation

Download Squad dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/squad_qg.tar.gz
tar -zxf squad_qg.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/squad_qg.tar.gz
tar -zxf squad_qg.tar.gz

Run the following common to train and evaluate on the Squad dataset:

For base model:

bash ./script/seq2seq/squad_qg/run.sh

For large model:

bash ./script/seq2seq/squad_qg_large/run.sh

Evaluation Results:

UNIMO-base 22.78 25.24 51.34
UNIMO-large 24.59 26.39 52.47

(4) Conversation Question Answering

Download CoQA dataset:

cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coqa.tar.gz
tar -zxf coqa.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coqa.tar.gz
tar -zxf coqa.tar.gz

Run the following common to train and evaluate on the CoQA dataset:

For base model:

bash ./script/seq2seq/coqa/run.sh

For large model:

bash ./script/seq2seq/coqa_large/run.sh

Evaluation Results:

Model Acc
UNIMO-base 80.2
UNIMO-large 84.9

Multi-Modal Understanding Tasks

(1) Image-Text Retrieval

Download Flickr30k dataset:

Note: Visual features are extracted by bottom-up-attention
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/Flickr30k.tar.gz # occupies about 37G disk space
tar -zxf Flickr30k.tar.gz

Run the following common to train and evaluate on the Flickr30k dataset:

For base model:

bash ./script/retrieval/Flickr30k/run.sh

For large model:

bash ./script/retrieval/Flickr30k_large/run.sh

Evaluation Results:

Results of Image Retrieval task on Flickr30k dataset

Model R@1 R@5 R@10
UNIMO-base 74.66 93.40 96.08
UNIMO-large 78.04 94.24 97.12

Results of Text Retrieval task on Flickr30k dataset

Model R@1 R@5 R@10
UNIMO-base 89.70 98.40 99.10
UNIMO-large 89.40 98.90 99.80

(2) Visual Entailment

Download SNLI-VE dataset:

Note: Visual features are extracted by bottom-up-attention
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/SNLI-VE.tar.gz
tar -zxf SNLI-VE.tar.gz

Run the following common to train and evaluate on the SNLI-VE dataset:

For base model:

bash ./script/visual_entailment/SNLI-VE/run.sh

For large model:

bash ./script/visual_entailment/SNLI-VE_large/run.sh

Evaluation Results:

Results of Visual Entailment task on SNLI-VE dataset

Model dev test
UNIMO-base 80.00 79.10
UNIMO-large 81.11 80.63

Multi-Modal Generation Tasks

(1) Image Caption Generation

Download COCO Caption dataset:

Note: Visual features are extracted by bottom-up-attention
cd /path/to/data
wget --no-check-certificate -q https://unimo.bj.bcebos.com/data/coco.tar.gz
tar -zxf coco.tar.gz

Download evaluation script:

cd src/eval/tasks
wget --no-check-certificate -q https://unimo.bj.bcebos.com/eval_script/coco.tar.gz
tar -zxf coco.tar.gz

Run the following common to train and evaluate on the COCO Caption dataset:

For base model:

bash ./script/img2txt/coco/run.sh

For large model:

bash ./script/img2txt/coco_large/run.sh

Evaluation Results:

UNIMO-base 38.8 124.4
UNIMO-large 39.6 127.7


If you find our paper and code useful, please cite the following paper:

  title={UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning},
  author={Li, Wei and Gao, Can and Niu, Guocheng and Xiao, Xinyan and Liu, Hao and Liu, Jiachen and Wu, Hua and Wang, Haifeng},
  journal={arXiv preprint arXiv:2012.15409},

Contact information

For help or issues using UNIMO, please submit a GitHub issue.

For personal communication related to UNIMO, please contact Wei Li (liwei85@baidu.com), Guocheng Niu (niuguocheng@baidu.com) , Can Gao (gaocan01@baidu.com).



语言与视觉一体的预训练模型 展开 收起






