mms-meta
/

mms-zeroshot-300m

@@ -14,99 +14,19 @@ metrics:
 # Massively Multilingual Speech (MMS) - Finetuned ASR - ALL
-This checkpoint is a model fine-tuned for multi-lingual ASR and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/).
-This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and makes use of adapter models to transcribe 1000+ languages.
-The checkpoint consists of **1 billion parameters** and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 1162 languages.
 ## Table Of Content
 - [Example](#example)
-- [Supported Languages](#supported-languages)
 - [Model details](#model-details)
 - [Additional links](#additional-links)
 ## Example
-This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to transcribe audio of 1107 different
-languages. Let's look at a simple example.
-First, we install transformers and some other libraries
-```
-pip install torch accelerate torchaudio datasets
-pip install --upgrade transformers
-````
-**Note**: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version
-is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from
-source:
-```
-pip install git+https://github.com/huggingface/transformers.git
-```
-Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz.
-```py
-from datasets import load_dataset, Audio
-# English
-stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
-stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
-en_sample = next(iter(stream_data))["audio"]["array"]
-# French
-stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
-stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
-fr_sample = next(iter(stream_data))["audio"]["array"]
-```
-Next, we load the model and processor
-```py
-from transformers import Wav2Vec2ForCTC, AutoProcessor
-import torch
-model_id = "facebook/mms-1b-all"
-processor = AutoProcessor.from_pretrained(model_id)
-model = Wav2Vec2ForCTC.from_pretrained(model_id)
-```
-Now we process the audio data, pass the processed audio data to the model and transcribe the model output, just like we usually do for Wav2Vec2 models such as [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
-```py
-inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
-with torch.no_grad():
-    outputs = model(**inputs).logits
-ids = torch.argmax(outputs, dim=-1)[0]
-transcription = processor.decode(ids)
-# 'joe keton disapproved of films and buster also had reservations about the media'
-```
-We can now keep the same model in memory and simply switch out the language adapters by calling the convenient [`load_adapter()`]() function for the model and [`set_target_lang()`]() for the tokenizer. We pass the target language as an input - "fra" for French.
-```py
-processor.tokenizer.set_target_lang("fra")
-model.load_adapter("fra")
-inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")
-with torch.no_grad():
-    outputs = model(**inputs).logits
-ids = torch.argmax(outputs, dim=-1)[0]
-transcription = processor.decode(ids)
-# "ce dernier est volé tout au long de l'histoire romaine"
-```
-In the same way the language can be switched out for all other supported languages. Please have a look at:
-```py
-processor.tokenizer.vocab.keys()
-```
-For more details, please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms).
 ## Model details

 # Massively Multilingual Speech (MMS) - Finetuned ASR - ALL
+This is a checkpoint of [MMS Zero-shot project](https://arxiv.org/abs/2407.17852), a model to transcribe the speech of almost any language using only a small amount of unlabeled text in the new language.
+The approach is based on a multilingual acoustic model trained on data in 1,150 languages (leveraging the data of [MMS](https://ai.meta.com/blog/multilingual-model-speech-recognition/)) which outputs transcriptions in an intermediate representation ([uroman](https://github.com/isi-nlp/uroman) tokens).
+A small amount of text in the new, unseen language is then also mapped to the this intermediate representation and at infernce time, this mapping, with an optional language model, enables transcribing a new language.
 ## Table Of Content
 - [Example](#example)
 - [Model details](#model-details)
 - [Additional links](#additional-links)
 ## Example
+Please have a look at [the official space](https://huggingface.co/spaces/mms-meta/mms-zeroshot/tree/main) for an example on using the model.
 ## Model details