MatsRooth
/

wav2vec2-base_than_I_did

Audio Classification

feature-extraction

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

wav2vec2-base_than_I_did / README.md

MatsRooth's picture

Update README.md

7f2b4aa verified 6 months ago

|

history blame contribute delete

3.55 kB

	---
	license: apache-2.0
	base_model: facebook/wav2vec2-base
	tags:
	- audio-classification
	- generated_from_trainer
	metrics:
	- accuracy
	model-index:
	- name: wav2vec2-base_than_I_did
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# wav2vec2-base_than_I_did

	This model is a fine-tuned version of [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the MatsRooth/than_I_did dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.2077
	- Accuracy: 0.9592

	## Model description

	This is a binary classifier for the prosody of tokens of "I did". The label s is subject prominence. The label
	ns is the complement, with prominence either on "did" or afterwards.

	## Intended uses & limitations

	Research on prosody.

	## Training and evaluation data

	The utterances are collected on Youtube, aligned with the Youtube transcript using Kaldi, and cut to the
	words "I did" using Matlab. Labels were assigned by the experimenter, using 's' for tokens there the main clause
	subject differed from the than-clause subject, and 'ns' for other tokens. The labeling does not depend on prosody,
	though it correlates with it.

	On the same problem using an SVM classifier, see Howell, Jonathan, Mats Rooth, and Michael Wagner, Acoustic classification of focus: On the web and in the lab (2016).

	The class ns was reduced to 160 tokens, to match the number of tokens of s.

	## Training procedure
	Training and evaluation use run_audio_classification.py from HuggingFace. The slurm script than_I_did.sub launches training.

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 0
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- num_epochs: 20.0
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|
	\| No log \| 0.94 \| 8 \| 0.6940 \| 0.4694 \|
	\| 0.6939 \| 2.0 \| 17 \| 0.6776 \| 0.6735 \|
	\| 0.6844 \| 2.94 \| 25 \| 0.6505 \| 0.6531 \|
	\| 0.6752 \| 4.0 \| 34 \| 0.6390 \| 0.6122 \|
	\| 0.6071 \| 4.94 \| 42 \| 0.5664 \| 0.7959 \|
	\| 0.5483 \| 6.0 \| 51 \| 0.4090 \| 0.8571 \|
	\| 0.5483 \| 6.94 \| 59 \| 0.3948 \| 0.8163 \|
	\| 0.4747 \| 8.0 \| 68 \| 0.4082 \| 0.8163 \|
	\| 0.4782 \| 8.94 \| 76 \| 0.3435 \| 0.8776 \|
	\| 0.4403 \| 10.0 \| 85 \| 0.3410 \| 0.8776 \|
	\| 0.4682 \| 10.94 \| 93 \| 0.2878 \| 0.8980 \|
	\| 0.4032 \| 12.0 \| 102 \| 0.2589 \| 0.9184 \|
	\| 0.359 \| 12.94 \| 110 \| 0.2554 \| 0.9184 \|
	\| 0.359 \| 14.0 \| 119 \| 0.2077 \| 0.9592 \|
	\| 0.3142 \| 14.94 \| 127 \| 0.1839 \| 0.9592 \|
	\| 0.3735 \| 16.0 \| 136 \| 0.1944 \| 0.9388 \|
	\| 0.3655 \| 16.94 \| 144 \| 0.1870 \| 0.9592 \|
	\| 0.3918 \| 18.0 \| 153 \| 0.2005 \| 0.9592 \|
	\| 0.3305 \| 18.82 \| 160 \| 0.1947 \| 0.9592 \|


	### Framework versions

	- Transformers 4.36.0.dev0
	- Pytorch 2.1.0+cu121
	- Datasets 2.13.1
	- Tokenizers 0.15.0