代码拉取完成,页面将自动刷新
Segmentation algorithms:
--model_type=unigram
)--model_type=bpe
)Data sets:
NMT parameters: (Google’s Neural Machine Translation System is applied for all experiments.)
Evaluation metrics:
Setting | vocab size | BLEU(dev) | BLEU(test) | src #tokens/sent. | trg #tokens/sent. |
---|---|---|---|---|---|
SentencePiece | 4k (shared) | 0.2857 | 0.2940 | 43.7478 | 29.6998 |
SentencePiece | 8k (shared) | 0.2785 | 0.2955 | 30.9734 | 25.0540 |
SentencePiece | 16k (shared) | 0.2664 | 0.2862 | 27.1827 | 21.5326 |
SentencePiece | 32k (shared) | 0.2641 | 0.2849 | 25.0592 | 19.0840 |
SentencePiece(BPE) | 8k (shared) | 0.2767 | 0.2947 | 31.7693 | 25.4331 |
(Moses/KyTea)+SentencePiece | 8k (shared) | 0.2900 | 0.2985 | 31.2719 | 29.9854 |
(Moses/MeCab)+SentencePiece | 8k (shared) | 0.2817 | 0.2950 | 31.4743 | 28.9537 |
(Moses/neologd)+SentencePiece | 8k (shared) | 0.2824 | 0.3062 | 31.2985 | 28.8645 |
Moses/Kytea | 80k/80k | 0.2576 | 0.2824 | 21.2513 | 23.2161 |
Moses/MeCab | 80k/80k | 0.2455 | 0.2780 | 21.2513 | 21.2033 |
Moses/neologd | 80k/80k | 0.2157 | 0.2378 | 21.2513 | 18.4768 |
Moses/SentencePiece | 80k/8k | 0.2475 | 0.2742 | 21.2513 | 22.9383 |
SentencePiece/KyTea | 8k/80k | 0.2778 | 0.2918 | 27.0429 | 23.2161 |
SentencePiece/MeCab | 8k/80k | 0.2673 | 0.2919 | 27.0429 | 21.2033 |
SentencePiece/neolgod | 8k80k | 0.2280 | 0.2494 | 27.0429 | 18.4768 |
Char | 3k (shared) | 0.2509 | 0.2679 | 109.8662 | 33.6963 |
Setting | vocab size | BLEU(dev) | BLEU(test) | src #tokens/sent. | trg #tokens/sent. |
---|---|---|---|---|---|
SentencePiece | 4k (shared) | 0.1970 | 0.2179 | 29.6998 | 43.7478 |
SentencePiece | 8k (shared) | 0.1966 | 0.2162 | 25.0540 | 30.9734 |
SentencePiece | 16k (shared) | 0.1996 | 0.2160 | 21.5326 | 27.1827 |
SentencePiece | 32k (shared) | 0.1949 | 0.2159 | 19.0840 | 25.0592 |
SentencePiece(BPE) | 8k (shared) | 0.1977 | 0.2173 | 25.4331 | 31.7693 |
(KyTea/Moses)+SentencePiece | 8k (shared) | 0.1921 | 0.2086 | 29.9854 | 31.2719 |
(MeCab/Moses)+SentencePiece | 8k (shared) | 0.1909 | 0.2049 | 28.9537 | 31.4743 |
(neologd/Moses)+SentencePiece | 8k (shared) | 0.1938 | 0.2137 | 28.8645 | 31.2985 |
KyTea/Moses | 80k/80k | 0.1707 | 0.2006 | 23.2161 | 21.2513 |
MeCab/Moses | 80k/80k | 0.1668 | 0.1892 | 21.2033 | 21.2513 |
neologd/Moses | 80k/80k | 0.1589 | 0.1836 | 18.4768 | 21.2513 |
SentencePiece/Moses | 8k/80k | 0.1727 | 0.1994 | 22.9383 | 21.2513 |
KyTea/SentencePiece | 80k/8k | 0.1939 | 0.2141 | 23.2161 | 27.0429 |
MeCab/SentencePiece | 80k/8k | 0.1892 | 0.2077 | 21.2033 | 27.0429 |
neologd/SentencePiece | 80k/8k | 0.1641 | 0.1804 | 18.4768 | 27.0429 |
Char | 3k (shared) | 0.0824 | 0.0918 | 33.6963 | 109.8662 |
We have evaluated SentencePiece segmentation with the following configurations.
Segmentation algorithms:
--model_type=bpe
)--model_type=unigram
)pretokenization methods:
--split_by_whitespace=false
).--split_by_whitespace=true
). When handling CJK, this setting is almost equivalent to NoPretok.NMT parameters: (Google’s Neural Machine Translation System is applied for all experiments.)
Evaluation metrics:
Data sets:
NoPretok and WsPretok do not use any language-dependent resources. BPE+MosePretok is almost the same configuration used in [Sennrich et al.] and [Wu et al.].
Language Pair | BPE(NoPretok) | BPE(WsPretok) | BPE(MosesPretok) | Unigram(NoPretok) | Unigram(WsPretok) | Unigram(MosesPretok) |
---|---|---|---|---|---|---|
KFTT en-ja | 0.2796 | 0.281 | 0.286 | 0.2806 | 0.280 | 0.2871 |
KFTT ja-en | 0.1943 | 0.208 | 0.1967 | 0.1985 | 0.2148 | 0.198 |
MultiUN ar-en | 0.5268 | 0.5414 | 0.5381 | 0.5317 | 0.5449 | 0.5401 |
MultiUN en-ar | 0.4039 | 0.4147 | 0.4012 | 0.4084 | 0.4172 | 0.3991 |
MultiUN en-zh | 0.4155 | 0.4186 | 0.395 | 0.4214 | 0.4165 | 0.399 |
MultiUN zh-en | 0.46 | 0.4716 | 0.4806 | 0.4644 | 0.4711 | 0.4759 |
In house en-ko | 0.178 | 0.1851 | 0.1893 | 0.1846 | 0.1872 | 0.1890 |
In house ko-en | 0.1786 | 0.1954 | 0.1994 | 0.1845 | 0.1956 | 0.2015 |
WMT16 cs-en | 0.1987 | 0.2252 | 0.2231 | 0.2164 | 0.2228 | 0.2238 |
WMT16 de-en | 0.3194 | 0.3348 | 0.3374 | 0.3261 | 0.3375 | 0.3398 |
WMT16 en-cs | 0.1607 | 0.1827 | 0.1812 | 0.1722 | 0.1778 | 0.179 |
WMT16 en-de | 0.2847 | 0.3029 | 0.3013 | 0.2946 | 0.3000 | 0.3053 |
WMT16 en-fi | 0.1434 | 0.1528 | 0.1499 | 0.1472 | 0.1568 | 0.1517 |
WMT16 en-ru | 0.1884 | 0.1973 | 0.1989 | 0.19 | 0.1982 | 0.1903 |
WMT16 fi-en | 0.1775 | 0.1867 | 0.1877 | 0.182 | 0.1882 | 0.1865 |
WMT16 ru-en | 0.2042 | 0.2229 | 0.2194 | 0.2087 | 0.2201 | 0.2155 |
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。