Spaces:
Running
A newer version of the Gradio SDK is available:
5.7.1
An example of English to Japaneses Simultaneous Translation System
This is an example of training and evaluating a transformer wait-k English to Japanese simultaneous text-to-text translation model.
Data Preparation
This section introduces the data preparation for training and evaluation. If you only want to evaluate the model, please jump to Inference & Evaluation
For illustration, we only use the following subsets of the available data from WMT20 news translation task, which results in 7,815,391 sentence pairs.
- News Commentary v16
- Wiki Titles v3
- WikiMatrix V1
- Japanese-English Subtitle Corpus
- The Kyoto Free Translation Task Corpus
We use WMT20 development data as development set. Training transformer_vaswani_wmt_en_de_big
model on such amount of data will result in 17.3 BLEU with greedy search and 19.7 with beam (10) search. Notice that a better performance can be achieved with the full WMT training data.
We use sentencepiece toolkit to tokenize the data with a vocabulary size of 32000.
Additionally, we filtered out the sentences longer than 200 words after tokenization.
Assuming the tokenized text data is saved at ${DATA_DIR}
,
we prepare the data binary with the following command.
fairseq-preprocess \
--source-lang en --target-lang ja \
--trainpref ${DATA_DIR}/train \
--validpref ${DATA_DIR}/dev \
--testpref ${DATA_DIR}/test \
--destdir ${WMT20_ENJA_DATA_BIN} \
--nwordstgt 32000 --nwordssrc 32000 \
--workers 20
Simultaneous Translation Model Training
To train a wait-k (k=10)
model.
fairseq-train ${WMT20_ENJA_DATA_BIN} \
--save-dir ${SAVEDIR}
--simul-type waitk \
--waitk-lagging 10 \
--max-epoch 70 \
--arch transformer_monotonic_vaswani_wmt_en_de_big \
--optimizer adam \
--adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt \
--warmup-init-lr 1e-07 \
--warmup-updates 4000 \
--lr 0.0005 \
--stop-min-lr 1e-09 \
--clip-norm 10.0 \
--dropout 0.3 \
--weight-decay 0.0 \
--criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
--max-tokens 3584
This command is for training on 8 GPUs. Equivalently, the model can be trained on one GPU with --update-freq 8
.
Inference & Evaluation
First of all, install SimulEval for evaluation.
git clone https://github.com/facebookresearch/SimulEval.git
cd SimulEval
pip install -e .
The following command is for the evaluation.
Assuming the source and reference files are ${SRC_FILE}
and ${REF_FILE}
, the sentencepiece model file for English is saved at ${SRC_SPM_PATH}
simuleval \
--source ${SRC_FILE} \
--target ${TGT_FILE} \
--data-bin ${WMT20_ENJA_DATA_BIN} \
--sacrebleu-tokenizer ja-mecab \
--eval-latency-unit char \
--no-space \
--src-splitter-type sentencepiecemodel \
--src-splitter-path ${SRC_SPM_PATH} \
--agent ${FAIRSEQ}/examples/simultaneous_translation/agents/simul_trans_text_agent_enja.py \
--model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
--output ${OUTPUT} \
--scores
The --data-bin
should be the same in previous sections if you prepare the data from the scratch.
If only for evaluation, a prepared data directory can be found here and a pretrained checkpoint (wait-k=10 model) can be downloaded from here.
The output should look like this:
{
"Quality": {
"BLEU": 11.442253287568398
},
"Latency": {
"AL": 8.6587861866951,
"AP": 0.7863304776251316,
"DAL": 9.477850951194764
}
}
The latency is evaluated by characters (--eval-latency-unit
) on the target side. The latency is evaluated with sacrebleu
with MeCab
tokenizer --sacrebleu-tokenizer ja-mecab
. --no-space
indicates that do not add space when merging the predicted words.
If --output ${OUTPUT}
option is used, the detailed log and scores will be stored under the ${OUTPUT}
directory.