primeline/whisper-large-v3-turbo-german · Evaluations are a bit disingenuous

Oct 9

•

Cool release and congrats on finetuning :)

Some remarks on the evaluation:

First of all thanks for sharing them. I could only verify this since you were so transparent about all of this. So thanks for that. Currently all Whisper Models except yours get punished for using ß vs. ss which is fairly common in german. Also punctuation is punished which is not always clear in spoken language especially. Just generally experimenting a bit with the evals they are highly dependent on which types of normalization one does include. With the commonly used Whisper normalizer evals look like this:

******************** common_voice_19_0 ********************
openai-whisper-large-v3-turbo: 4.08
openai-whisper-large-v3: 3.73
primeline-whisper-large-v3-german: 2.83
nyrahealth-CrisperWhisper: 2.07
primeline-whisper-large-v3-turbo-german: 3.40

******************** multilingual librispeech ********************
openai-whisper-large-v3-turbo: 3.23
openai-whisper-large-v3: 2.82
primeline-whisper-large-v3-german: 2.10
nyrahealth-CrisperWhisper: 2.64
primeline-whisper-large-v3-turbo-german: 2.09

******************** Tuda-De ********************
openai-whisper-large-v3-turbo: 8.69
openai-whisper-large-v3: 8.31
primeline-whisper-large-v3-german: 8.54
nyrahealth-CrisperWhisper: 5.04
primeline-whisper-large-v3-turbo-german: 6.70

******************** All ********************
openai-whisper-large-v3-turbo: 3.76
openai-whisper-large-v3: 3.37
primeline-whisper-large-v3-german: 2.64
nyrahealth-CrisperWhisper: 2.58
primeline-whisper-large-v3-turbo-german: 2.72

So feel free to run them yourself or comment in and out certain normalizations from my wer_standardize function and using wer_standardize in calculate_wer. Maybe you find a way to display a fairer evaluation or multiple ones that could reflect various use cases.
Commonvoice itself is problematic to run evaluations on anyway since test sets are not static between releases :)

import jiwer
from collections import defaultdict
import datasets
from datasets import load_dataset
from jiwer import wer, cer, wer_standardize_contiguous
import re
from whisper_normalizer.english import EnglishTextNormalizer
english_normalizer = EnglishTextNormalizer()
def wer_standardize2(text):
#    # This function should implement the standardization logic
#    # For this example, we'll use a simple lowercase transformation
    return english_normalizer(text)



def wer_standardize(text):
    # Convert to lowercase
    text = text.lower()
    
    # Replace ß with ss
    text = text.replace('ß', 'ss')
    
    # Remove words enclosed in []
    text = re.sub(r'\[.*?\]', '', text)

    #remove quotes
    text = re.sub(r'[\'"]', '', text)
    
    # Remove punctuation and hyphens
    text = re.sub(r'[^\w\s]', '', text)

    # Remove trailing spaces and reduce multiple spaces to single space
    text = re.sub(r'\s+', ' ', text.strip())
    
    return text

#change to standardize one and comment in and out different standardizations
def calculate_wer(references, hypotheses):
    references = [wer_standardize2(ref) for ref in references]
    hypotheses = [wer_standardize2(hyp) for hyp in hypotheses]
    return wer(references,
               hypotheses)

def process_dataset(dataset):
    models = [
        'openai-whisper-large-v3-turbo',
        'openai-whisper-large-v3',
        'primeline-whisper-large-v3-german',
        'nyrahealth-CrisperWhisper',
        'primeline-whisper-large-v3-turbo-german',
    ]

    set_names = ["All"]
    results = defaultdict(lambda: defaultdict(lambda: {'references': [], 'predictions': []}))

    for i in range(len(dataset)):
        item = dataset[i]
        dataset_source = str(item['from'])
        reference = item['references']
        set_names.append(dataset_source)

        for model in models:
            prediction = item.get(model, '')
            if prediction:  # Skip empty transcriptions
                results[dataset_source][model]['references'].append(reference)
                results[dataset_source][model]['predictions'].append(prediction)
                results["All"][model]['references'].append(reference)
                results["All"][model]['predictions'].append(prediction)
    return results, set(set_names)

def print_results(results, set_names):
    for dataset_source in set_names:
        print("\n\n")
        print("*" * 20, dataset_source, "*" * 20)
        for model in results[dataset_source]:
            refs = results[dataset_source][model]['references']
            preds = results[dataset_source][model]['predictions']
            if refs and preds:  # Check if lists are not empty
                error_rate = calculate_wer(refs, preds) * 100
                print(f"{model}: {error_rate:.2f}")
        print("*" * 50)

# Process the 'train' split of the dataset
# Load the dataset
ds = load_dataset("flozi00/asr-german-mixed-evals")
dataset = ds['train']
results, set_names = process_dataset(dataset)
print_results(results, set_names)
```

Laurin-myreha changed discussion title from Evaluation are a bit disingenuous to Evaluations are a bit disingenuous Oct 9

flozi00

primeLine AI Services org Oct 9

Great, thanks for this Feedback
I will take a deeper look later this day.

If it looks good I will rerun the evals with updated code 👍

flozi00

primeLine AI Services org Oct 9

To be honest I was very surprised by the low scores of the initial primeline whisper model.

Thanks for this good catch 😎

Laurin-myreha

Oct 9

No worries :) Glad it cleared up some confusion of yours too :) Devils in the details with this stuff always haha

rpeinl

Oct 9

•

edited Oct 9

Thanks for pointing this out. I think it is an important topic. However, I don't understand the code you posted. It seems that you propose a meaningful method to normalize the German text and then use the other function, that applies an english normalizer, which doesn't seem meaningful.
In our internal evaluation, we also normalize numbers, so that all numbers are printed with the respective word: e.g., 11 => elf
We also have special cases for ordinal numbers: e.g., 1. => erstens and roman numbers, e.g., könig ludwig XIV. => könig ludwig der vierzehnte.

Laurin-myreha

Oct 9

•

edited Oct 9

You can try both. The english normalizer is what a lot of people use and it should be a sensible default atleast for german..... normalizing symbols like Euro, Dollar, strip punctuation, lowercase etc. https://github.com/openai/whisper/blob/main/whisper/normalizers/english.py. The other stuff i included only mainly so you guys can directly play with it :) Normalizing numbers is definately a important point too! :)

flozi00

primeLine AI Services org Oct 9

Great to see here starts a very cool discussion.
Lets get all the normalizations together :)
@rpeinl could you share what methods you are using to do the num to word stuff ?
I only know the opposite doing word to number

flozi00

primeLine AI Services org Oct 9

I updated the results and the code
Thanks for this great findings

I already did those normalizations in the time of my wav2vec models, but anyhow just forgot I already have the code and missed all the learnings.

Laurin-myreha

Oct 9

Cool! Very nice :)

Laurin-myreha changed discussion status to closed Oct 9

rpeinl

Oct 9

Yes, we should jointly work on a good way to do normalization for German language.
I cannot believe, that using the English normalizer works well for German, although it is certainly sophisticated. Take the numbers for example. Let's say the model outputs numbers as digits and the groundtruth uses number words instead (or the other way around). The english normalizer will not convert the German number words corrctly (e.g. elf). Therefore, the WER goes up, although the model has correctly transcribed it. If it is the scientific international standard to do this (which I hope is not the case), we should not accept it, but rather propose a better solution.