1 Star 0 Fork 1

modelee / camembert-base-wikipedia-4gb

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
MIT
language
fr

CamemBERT: a Tasty French Language Model

Introduction

CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.

It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.

For further information or requests, please go to Camembert Website

Pre-trained models

Model #params Arch. Training data
camembert-base 110M Base OSCAR (138 GB of text)
camembert/camembert-large 335M Large CCNet (135 GB of text)
camembert/camembert-base-ccnet 110M Base CCNet (135 GB of text)
camembert/camembert-base-wikipedia-4gb 110M Base Wikipedia (4 GB of text)
camembert/camembert-base-oscar-4gb 110M Base Subsample of OSCAR (4 GB of text)
camembert/camembert-base-ccnet-4gb 110M Base Subsample of CCNet (4 GB of text)

How to use CamemBERT with HuggingFace

Load CamemBERT and its sub-word tokenizer :
from transformers import CamembertModel, CamembertTokenizer

# You can replace "camembert-base" with any other model from the table, e.g. "camembert/camembert-large".
tokenizer = CamembertTokenizer.from_pretrained("camembert/camembert-base-wikipedia-4gb")
camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb")

camembert.eval()  # disable dropout (or leave in train mode to finetune)
Filling masks using pipeline
from transformers import pipeline 

camembert_fill_mask  = pipeline("fill-mask", model="camembert/camembert-base-wikipedia-4gb", tokenizer="camembert/camembert-base-wikipedia-4gb")
results = camembert_fill_mask("Le camembert est un fromage de <mask>!")
# results
#[{'sequence': '<s> Le camembert est un fromage de chèvre!</s>', 'score': 0.4937814474105835, 'token': 19370}, 
#{'sequence': '<s> Le camembert est un fromage de brebis!</s>', 'score': 0.06255942583084106, 'token': 30616}, 
#{'sequence': '<s> Le camembert est un fromage de montagne!</s>', 'score': 0.04340197145938873, 'token': 2364},
# {'sequence': '<s> Le camembert est un fromage de Noël!</s>', 'score': 0.02823255956172943, 'token': 3236}, 
#{'sequence': '<s> Le camembert est un fromage de vache!</s>', 'score': 0.021357402205467224, 'token': 12329}]
Extract contextual embedding features from Camembert output
import torch
# Tokenize in sub-words with SentencePiece
tokenized_sentence = tokenizer.tokenize("J'aime le camembert !")
# ['▁J', "'", 'aime', '▁le', '▁ca', 'member', 't', '▁!'] 

# 1-hot encode and add special starting and end tokens 
encoded_sentence = tokenizer.encode(tokenized_sentence)
# [5, 221, 10, 10600, 14, 8952, 10540, 75, 1114, 6]
# NB: Can be done in one step : tokenize.encode("J'aime le camembert !")

# Feed tokens to Camembert as a torch tensor (batch dim 1)
encoded_sentence = torch.tensor(encoded_sentence).unsqueeze(0)
embeddings, _ = camembert(encoded_sentence)
# embeddings.detach()
# embeddings.size torch.Size([1, 10, 768])
#tensor([[[-0.0928,  0.0506, -0.0094,  ..., -0.2388,  0.1177, -0.1302],
#         [ 0.0662,  0.1030, -0.2355,  ..., -0.4224, -0.0574, -0.2802],
#         [-0.0729,  0.0547,  0.0192,  ..., -0.1743,  0.0998, -0.2677],
#         ...,
Extract contextual embedding features from all Camembert layers
from transformers import CamembertConfig
# (Need to reload the model with new config)
config = CamembertConfig.from_pretrained("camembert/camembert-base-wikipedia-4gb", output_hidden_states=True)
camembert = CamembertModel.from_pretrained("camembert/camembert-base-wikipedia-4gb", config=config)

embeddings, _, all_layer_embeddings = camembert(encoded_sentence)
#  all_layer_embeddings list of len(all_layer_embeddings) == 13 (input embedding layer + 12 self attention layers)
all_layer_embeddings[5]
# layer 5 contextual embedding : size torch.Size([1, 10, 768])
#tensor([[[-0.0059, -0.0227,  0.0065,  ..., -0.0770,  0.0369,  0.0095],
#         [ 0.2838, -0.1531, -0.3642,  ..., -0.0027, -0.8502, -0.7914],
#         [-0.0073, -0.0338, -0.0011,  ...,  0.0533, -0.0250, -0.0061],
#         ...,

Authors

CamemBERT was trained and evaluated by Louis Martin*, Benjamin Muller*, Pedro Javier Ortiz Suárez*, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.

Citation

If you use our work, please cite:

@inproceedings{martin2020camembert,
  title={CamemBERT: a Tasty French Language Model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
  year={2020}
}
MIT License Copyright (c) 2020 - Inria and Facebook, Inc. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
MIT
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
1
https://gitee.com/modelee/camembert-base-wikipedia-4gb.git
git@gitee.com:modelee/camembert-base-wikipedia-4gb.git
modelee
camembert-base-wikipedia-4gb
camembert-base-wikipedia-4gb
main

搜索帮助