Optimize the preprocessing and generation

#11

This PR is aimed at improving the translation quality and overall user experience of the demo.

Overview

The current version of the changes can be tested in the clone of this space: https://huggingface.co/spaces/cointegrated/nllb-demo-2024.

  • Harmonize the list of language codes with those really supported by the tokenizer
  • Raise meaningful errors when the source or target languages are not indicated
  • Update the generation algorithm: use beam search (with 5 beams) and blocking of repeating 4-grams
  • Normalize the punctuation before tokenizing the text
  • Use language-specific sentence splitters

Test cases

  • Try translating into Santali, Minangkabau (Arabic script), or Modern Standard Arabic (Romanized). It won't work in the current version of the demo, so this PR removes those languages.
  • Try translating without specifying the source or target language. Both current and new versions would show an error, but the new version also shows a small pop up with explanations.
  • Translate The United Nations Educational, Scientific and Cultural Organization is a specialized agency of the United Nations with the aim of promoting world peace and security through international cooperation in education, arts, sciences and culture. from English to Quechua. The current app is generating unwanted repetitions; the new app is translating correctly.
  • Translate On disait dans le livre : « Les serpents boas avalent leur proie tout entière, sans la mâcher. Ensuite ils ne peuvent plus bouger et ils dorment pendant les six mois de leur digestion. » J’ai alors beaucoup réfléchi sur les aventures de la jungle et, à mon tour, j’ai réussi, avec un crayon de couleur, à tracer mon premier dessin. Mon dessin numéro 1. Il était comme ça : from French to English. Without punctuation normalization, the closing quotation mark is getting lost in translation. The new version handles it correctly.
  • Translate अपना यह चित्र मैंने बड़े लोगों को दिखाया। मैंने पूछा, “इसे देखकर डर लगता है या नहीं?” उन्होंने उत्तर दिया, भला टोपी से डर क्‍यों लगेगा?! मैंने टोपी तो बनाई नहीं थी। मैंने एक अजगर बनाया था जो हाथी को निगल कर पचा रहा था। आखिर मैंने सांप के पेट के अंदर की भी तस्वीर बनाई। ताकि ये बड़े लोग भी समझ सकें। ये बिना समझाये कुछ नहीं समझते। मेरा दूसरा चित्र ऐसा था। from Hindi to English. With the current approach, the ~10 source sentences won't be recognized as such, and as a result, the model would ignore the last sentence (supposed to mean "My second painting was like this."). The new version of the demo translates each sentence.

Statistical validation

I validated the changes by translating FLORES (200 languages, mostly single sentences) and MMMLU (14 languages, longer passages) into English, and computing BLEU scores w.r.t. the original English version of these datasets. On FLORES, the BLEU improves in 94% cases (+0.68 points on average) as a result of the changes introduced in this PR. On MMLU, it improves in 100% cases (+3.27 on average).
These improvements are mostly driven by the language-specific sentence splitting (about +2 points on average) and by the beam search decoding (+1 point on average).
The largest improvements are observed in translating MMMLU from Bengali, Chinese, Hindi and Japanese.

The quality of translation into the languages other than English is yet to test.

cointegrated changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment