README.md · jimregan/psst-partial-timit at refs/pr/1

metadata

language:
  - en
license: apache-2.0
tags:
  - automatic-speech-recognition
datasets:
  - jimregan/psst
  - timit_asr

This repository contains a number of experiments for the PSST Challenge.

As the test set is unavailable, all numbers are based on the validation set.

The experiments in the tables below were finetuned on Wav2vec 2.0 Base, No finetuning

Our overall best performing model (FER 9.2%, PER: 21.0%) was based on Wav2vec 2.0 Large, No finetuning (git tag: larger-rir), with the TIMIT subset augmented with Room Impulse Response, based on the experiments below, on the base model.

Augmented TIMIT subset

Using a subset of TIMIT that could map easily to the phoneset used by the PSST Challenge data (a list of IDs are in the repository), we experimented with augmenting the data to better match the PSST data.

The best results were obtained using Room Impulse Response (tag: rir)

Augmentation	FER	PER	Git tag
unaugmented	10.2%	22.5%	huggingface-unaugmented
Gaussian noise	10.0%	22.1%	gaussian
Pitchshift	9.6%	22.9%	pitchshift
RIR	9.6%	21.8%	rir
Time stretch	10.1%	22.8%	timestretch
Gaussian noise + RIR	10.0%	23.4%	gaussian-rir
Pitchshift + Gaussian noise	9.9%	22.9%	pitchshift-gaussian
Pitchshift + RIR	9.9%	22.8%	pitchshift-rir
Tim estretch + Gaussian noise	10.2%	22.8%	timestretch-gaussian
Time stretch + Pitchshift	9.8%	22.0%	timestretch-pitchshift
Time stretch + RIR	9.7%	22.2%	timestretch-rir
Pitchshift + Gaussian noise + RIR	10.1%	23.5%	pitchshift-gaussian-rir
Time stretch + Gaussian noise + RIR	9.7%	22.3%	timestretch-gaussian-rir
Time stretch + Pitchshift + Gaussian noise	10.2%	22.9%	timestretch-pitchshift-gaussian
Time stretch + Pitchshift + RIR	10.2%	22.5%	timestretch-pitchshift-rir
Time stretch + Pitchshift + Gaussian noise + RIR	10.9%	24.1%	timestretch-pitchshift-gaussian-rir

LM experiments

We experimented with a number of language model configurations, combining the data from the PSST challenge, the subset of TIMIT we used, and CMUdict.

We tried combining CMUdict data in a number of ways: unmodified, with a silence token added at the start of the pronunciation, at the end, and at both the start and the end.

The best result was from a 5-gram model, with silences added at the end of the CMUdict data (git tag: lm-nosil-cmudict-sile.5).

Evaluation was performed using scripts provided by the PSST Challenge's organisers, so there are no scripts in place to automatically use the LM with the transformers library.

	n-gram	FER	PER	Tag
Baseline + TIMIT	---	10.2%	22.5%	huggingface-unaugmented
All silences	4	10.5%	23.0%	lm-allsil.4
	5	10.5%	22.6%	lm-allsil.5
	6	10.3%	22.3%	lm-allsil.6
No silences	4	10.3%	22.6%	lm-nosil.4
	5	10.2%	22.2%	lm-nosil.5
	6	10.2%	22.4%	lm-nosil.6
PSST and TIMIT without silence
Unmodified CMUdict	4	10.3%	22.6%	lm-nosil-cmudict-nosil.4
	5	10.2%	22.2%	lm-nosil-cmudict-nosil.5
	6	10.2%	22.4%	lm-nosil-cmudict-nosil.6
CMUdict-end	4	10.3%	22.6%	lm-nosil-cmudict-sile.4
	5	10.2%	22.1%	lm-nosil-cmudict-sile.5
	6	10.2%	22.3%	lm-nosil-cmudict-sile.6
CMUdict-start	4	10.4%	22.6%	lm-nosil-cmudict-sils.4
	5	10.3%	22.4%	lm-nosil-cmudict-sils.5
	6	10.3%	22.3%	lm-nosil-cmudict-sils.6
CMUdict-both	4	10.4%	22.7%	lm-nosil-cmudict-silb.4
	5	10.4%	22.3%	lm-nosil-cmudict-silb.5
	6	10.3%	22.3%	lm-nosil-cmudict-silb.6
Unmodified PSST and TIMIT
Unmodified CMUdict	4	10.3%	22.8%	lm-orig-cmudict-nosil.4
	5	10.3%	22.4%	lm-orig-cmudict-nosil.5
	6	10.2%	22.4%	lm-orig-cmudict-nosil.6
CMUdict-end	4	10.3%	22.7%	lm-orig-cmudict-sile.4
	5	10.2%	22.2%	lm-orig-cmudict-sile.5
	6	10.2%	22.3%	lm-orig-cmudict-sile.6
CMUdict-start	4	10.5%	22.8%	lm-orig-cmudict-sils.4
	5	10.4%	22.5%	lm-orig-cmudict-sils.5
	6	10.3%	22.4%	lm-orig-cmudict-sils.6
CMUdict-both	4	10.5%	22.8%	lm-orig-cmudict-silb.4
	5	10.4%	22.4%	lm-orig-cmudict-silb.5
	6	10.4%	22.4%	lm-orig-cmudict-silb.6