LASER Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

You can find more information about LASER and how to use it on the official LASER repository.

This folder contains source code for training LASER embeddings.

Prepare data and configuration file

Binarize your data with fairseq, as described here.

Create a json config file with this format:

{
  "src_vocab": "/path/to/spm.src.cvocab",
  "tgt_vocab": "/path/to/spm.tgt.cvocab",
  "train": [
    {
      "type": "translation",
      "id": 0,
      "src": "/path/to/srclang1-tgtlang0/train.srclang1",
      "tgt": "/path/to/srclang1-tgtlang0/train.tgtlang0"
    },
    {
      "type": "translation",
      "id": 1,
      "src": "/path/to/srclang1-tgtlang1/train.srclang1",
      "tgt": "/path/to/srclang1-tgtlang1/train.tgtlang1"
    },
    {
      "type": "translation",
      "id": 0,
      "src": "/path/to/srclang2-tgtlang0/train.srclang2",
      "tgt": "/path/to/srclang2-tgtlang0/train.tgtlang0"
    },
    {
      "type": "translation",
      "id": 1,
      "src": "/path/to/srclang2-tgtlang1/train.srclang2",
      "tgt": "/path/to/srclang2-tgtlang1/train.tgtlang1"
    },
    ...
  ],
  "valid": [
    {
      "type": "translation",
      "id": 0,
      "src": "/unused",
      "tgt": "/unused"
    }
  ]
}

where paths are paths to binarized indexed fairseq dataset files. id represents the target language id.

Training Command Line Example

fairseq-train \
  /path/to/configfile_described_above.json \
  --user-dir examples/laser/laser_src \
  --log-interval 100 --log-format simple \
  --task laser --arch laser_lstm \
  --save-dir . \
  --optimizer adam \
  --lr 0.001 \
  --lr-scheduler inverse_sqrt \
  --clip-norm 5 \
  --warmup-updates 90000 \
  --update-freq 2 \
  --dropout 0.0 \
  --encoder-dropout-out 0.1 \
  --max-tokens 2000 \
  --max-epoch 50 \
  --encoder-bidirectional \
  --encoder-layers 5 \
  --encoder-hidden-size 512 \
  --decoder-layers 1 \
  --decoder-hidden-size 2048 \
  --encoder-embed-dim 320 \
  --decoder-embed-dim 320 \
  --decoder-lang-embed-dim 32 \
  --warmup-init-lr 0.001 \
  --disable-validation

Applications

We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").

Cross-lingual document classification using the MLDoc corpus [2,6]
WikiMatrix Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia [7]
Bitext mining using the BUCC corpus [3,5]
Cross-lingual NLI using the XNLI corpus [4,5,6]
Multilingual similarity search [1,6]
Sentence embedding of text files example how to calculate sentence embeddings for arbitrary text files in any of the supported language.

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

References

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB