An example of English to Japaneses Simultaneous Translation System

This is an example of training and evaluating a transformer wait-k English to Japanese simultaneous text-to-text translation model.

Data Preparation

This section introduces the data preparation for training and evaluation. If you only want to evaluate the model, please jump to Inference & Evaluation

For illustration, we only use the following subsets of the available data from WMT20 news translation task, which results in 7,815,391 sentence pairs.

News Commentary v16
Wiki Titles v3
WikiMatrix V1
Japanese-English Subtitle Corpus
The Kyoto Free Translation Task Corpus

We use WMT20 development data as development set. Training transformer_vaswani_wmt_en_de_big model on such amount of data will result in 17.3 BLEU with greedy search and 19.7 with beam (10) search. Notice that a better performance can be achieved with the full WMT training data.

We use sentencepiece toolkit to tokenize the data with a vocabulary size of 32000. Additionally, we filtered out the sentences longer than 200 words after tokenization. Assuming the tokenized text data is saved at ${DATA_DIR}, we prepare the data binary with the following command.

fairseq-preprocess \
    --source-lang en --target-lang ja \
    --trainpref ${DATA_DIR}/train \
    --validpref ${DATA_DIR}/dev \
    --testpref ${DATA_DIR}/test \
    --destdir ${WMT20_ENJA_DATA_BIN} \
    --nwordstgt 32000 --nwordssrc 32000 \
    --workers 20

Simultaneous Translation Model Training

To train a wait-k (k=10) model.

fairseq-train ${WMT20_ENJA_DATA_BIN}  \
    --save-dir ${SAVEDIR}
    --simul-type waitk  \
    --waitk-lagging 10  \
    --max-epoch 70  \
    --arch transformer_monotonic_vaswani_wmt_en_de_big \
    --optimizer adam  \
    --adam-betas '(0.9, 0.98)'  \
    --lr-scheduler inverse_sqrt  \
    --warmup-init-lr 1e-07  \
    --warmup-updates 4000  \
    --lr 0.0005  \
    --stop-min-lr 1e-09  \
    --clip-norm 10.0  \
    --dropout 0.3  \
    --weight-decay 0.0  \
    --criterion label_smoothed_cross_entropy  \
    --label-smoothing 0.1  \
    --max-tokens 3584

This command is for training on 8 GPUs. Equivalently, the model can be trained on one GPU with --update-freq 8.

Inference & Evaluation

First of all, install SimulEval for evaluation.

git clone https://github.com/facebookresearch/SimulEval.git
cd SimulEval
pip install -e .

The following command is for the evaluation. Assuming the source and reference files are ${SRC_FILE} and ${REF_FILE}, the sentencepiece model file for English is saved at ${SRC_SPM_PATH}

simuleval \
    --source ${SRC_FILE} \
    --target ${TGT_FILE} \
    --data-bin ${WMT20_ENJA_DATA_BIN} \
    --sacrebleu-tokenizer ja-mecab \
    --eval-latency-unit char \
    --no-space \
    --src-splitter-type sentencepiecemodel \
    --src-splitter-path ${SRC_SPM_PATH} \
    --agent ${FAIRSEQ}/examples/simultaneous_translation/agents/simul_trans_text_agent_enja.py \
    --model-path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
    --output ${OUTPUT} \
    --scores

The --data-bin should be the same in previous sections if you prepare the data from the scratch. If only for evaluation, a prepared data directory can be found here and a pretrained checkpoint (wait-k=10 model) can be downloaded from here.

The output should look like this:

{
    "Quality": {
        "BLEU": 11.442253287568398
    },
    "Latency": {
        "AL": 8.6587861866951,
        "AP": 0.7863304776251316,
        "DAL": 9.477850951194764
    }
}

The latency is evaluated by characters (--eval-latency-unit) on the target side. The latency is evaluated with sacrebleu with MeCab tokenizer --sacrebleu-tokenizer ja-mecab. --no-space indicates that do not add space when merging the predicted words.

If --output ${OUTPUT} option is used, the detailed log and scores will be stored under the ${OUTPUT} directory.