metadata

library_name: transformers
license: mit
datasets:
  - h-j-han/SpeechQE-CoVoST2
language:
  - de
  - en
base_model:
  - Unbabel/TowerInstruct-7B-v0.2
  - openai/whisper-large-v2

SpeechQE: Estimating the Quality of Direct Speech Translation

This is End-to-End model for the task of quality estimation for speech translation (SpeechQE).

Task	E2E Model	Trained Domain
SpeechQE for English-to-German Speech Translation	h-j-han/SpeechQE-TowerInstruct-7B-en2de	CoVoST2
SpeechQE for Spanish-to-English Speech Translation	h-j-han/SpeechQE-TowerInstruct-7B-es2en	CoVoST2

Architecture and Training

Our design incorporates a pretrained speech encoder (whisper-large-v2) and a large language model (TowerInstruct-7B-v0.2) to leverage their existing capabilities in extracting high-quality audio features and handling translation-related tasks. The model is trained with two-phase approach where we first train only an adapter with ASR and ST tasks while freezing textLLM to focus solely on mapping between text and speech modality. Then, we continue training with the SpeechQE task to let the LLM learn the unseen task of QE. In the second phase, the adapter pre-trained in the previous phase is frozen, while text-LLM is trained with LoRA

Setup

We provide code in Github repo : https://github.com/h-j-han/SpeechQE

$ git clone https://github.com/h-j-han/SpeechQE.git
$ cd SpeechQE

$ conda create -n speechqe Python=3.11 pytorch=2.0.1  pytorch-cuda=11.7 torchvision torchaudio -c pytorch -c nvidia
$ conda activate speechqe
$ pip install -r requirements.txt

Download Audio Data

Download the audio data from Common Voice. Here, we use mozilla-foundation/common_voice_4_0.

import datasets
cv4en = datasets.load_dataset(
    "mozilla-foundation/common_voice_4_0", "en", cache_dir='path/to/cv4/download',
)

Evaluation

We provide SpeechQE benchmark: h-j-han/SpeechQE-CoVoST2. BASE_AUDIO_PATH is the path of downloaded Common Voice dataset.

$ python speechqe/score_speechqe.py \
    --speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-en2de \
    --dataset_name=h-j-han/SpeechQE-CoVoST2 \
    --base_audio_path=$BASE_AUDIO_PATH \
    --dataset_config_name=en2de \
    --test_split_name=test \

Reference

Please find details in this EMNLP24 paper :

@misc{han2024speechqe,
    title={SpeechQE: Estimating the Quality of Direct Speech Translation},
    author={HyoJung Han and Kevin Duh and Marine Carpuat},
    year={2024},
    eprint={2410.21485},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}