File size: 9,052 Bytes
8c92a11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Amphion Evaluation Recipe

## Supported Evaluation Metrics

Until now, Amphion Evaluation has supported the following objective metrics:

- **F0 Modeling**:
  - F0 Pearson Coefficients (FPC)
  - F0 Periodicity Root Mean Square Error (PeriodicityRMSE)
  - F0 Root Mean Square Error (F0RMSE)
  - Voiced/Unvoiced F1 Score (V/UV F1)
- **Energy Modeling**:
  - Energy Root Mean Square Error (EnergyRMSE)
  - Energy Pearson Coefficients (EnergyPC)
- **Intelligibility**:
  - Character Error Rate (CER) based on [Whipser](https://github.com/openai/whisper)
  - Word Error Rate (WER) based on [Whipser](https://github.com/openai/whisper)
- **Spectrogram Distortion**:
  - Frechet Audio Distance (FAD)
  - Mel Cepstral Distortion (MCD)
  - Multi-Resolution STFT Distance (MSTFT)
  - Perceptual Evaluation of Speech Quality (PESQ)
  - Short Time Objective Intelligibility (STOI)
  - Scale Invariant Signal to Distortion Ratio (SISDR)
  - Scale Invariant Signal to Noise Ratio (SISNR)
- **Speaker Similarity**:
  - Cosine similarity based on:
    - [Rawnet3](https://github.com/Jungjee/RawNet)
    - [Resemblyzer](https://github.com/resemble-ai/Resemblyzer)
    - [WavLM](https://huggingface.co/microsoft/wavlm-base-plus-sv)

We provide a recipe to demonstrate how to objectively evaluate your generated audios. There are three steps in total:

1. Pretrained Models Preparation
2. Audio Data Preparation
3. Evaluation

## 1. Pretrained Models Preparation

If you want to calculate `RawNet3` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md).

## 2. Audio Data Preparation

Prepare reference audios and generated audios in two folders, the `ref_dir` contains the reference audio and the `gen_dir` contains the generated audio. Here is an example.

```plaintext
 ┣ {ref_dir}
 ┃ ┣ sample1.wav
 ┃ ┣ sample2.wav
 ┣ {gen_dir}
 ┃ ┣ sample1.wav
 ┃ ┣ sample2.wav
```

You have to make sure that the pairwise **reference audio and generated audio are named the same**, as illustrated above (sample1 to sample1, sample2 to sample2).

## 3. Evaluation

Run the `run.sh` with specified refenrece folder, generated folder, dump folder and metrics.

```bash
cd Amphion
sh egs/metrics/run.sh \
	--reference_folder [Your path to the reference audios] \
	--generated_folder [Your path to the generated audios] \
	--dump_folder [Your path to dump the objective results] \
	--metrics [The metrics you need] \
	--fs [Optional. To calculate all metrics in the specified sampling rate] \
	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
	--similarity_mode [Optional. To choose the mode for calculating the speaker similarity. "pairwith" for calculating a series of ground truth / prediction audio pairs to obtain the speaker similarity, and "overall" for computing the average score with all possible pairs between the refernece folder and generated folder. Default to "pairwith"] \
	--intelligibility_mode [Optionoal. To choose the mode for computing CER and WER. "gt_audio" means selecting the recognition content of the reference audio as the target, "gt_content" means using transcription as the target. Default to "gt_audio"] \
	--ltr_path [Optional. Path to the transcription file] \
	--language [Optional. Language for computing CER and WER. Default to "english"]
```

As for the metrics, an example is provided below:

```bash
--metrics "mcd pesq fad"
```

All currently available metrics keywords are listed below:

| Keys                      | Description                                |
| ------------------------- | ------------------------------------------ |
| `fpc`                     | F0 Pearson Coefficients                    |
| `f0_periodicity_rmse`     | F0 Periodicity Root Mean Square Error      |
| `f0rmse`                  | F0 Root Mean Square Error                  |
| `v_uv_f1`                 | Voiced/Unvoiced F1 Score                   |
| `energy_rmse`             | Energy Root Mean Square Error              |
| `energy_pc`               | Energy Pearson Coefficients                |
| `cer`                     | Character Error Rate                       |
| `wer`                     | Word Error Rate                            |
| `similarity`      | Speaker Similarity
| `fad`                     | Frechet Audio Distance                     |
| `mcd`                     | Mel Cepstral Distortion                    |
| `mstft`                   | Multi-Resolution STFT Distance             |
| `pesq`                    | Perceptual Evaluation of Speech Quality    |
| `si_sdr`                  | Scale Invariant Signal to Distortion Ratio |
| `si_snr`                  | Scale Invariant Signal to Noise Ratio      |
| `stoi`                    | Short Time Objective Intelligibility       |

For example, if want to calculate the speaker similarity between the synthesized audio and the reference audio with the same content, run:

```bash
sh egs/metrics/run.sh \
	--reference_folder [Your path to the reference audios] \
	--generated_folder [Your path to the generated audios] \
	--dump_folder [Your path to dump the objective results] \
	--metrics "similarity" \
	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
	--similarity_mode "pairwith" \
```

If you don't have the reference audio with the same content, run the following to get the conteng-free similarity score:

```bash
sh egs/metrics/run.sh \
	--reference_folder [Your path to the reference audios] \
	--generated_folder [Your path to the generated audios] \
	--dump_folder [Your path to dump the objective results] \
	--metrics "similarity" \
	--similarity_model [Optional. To choose the model for calculating the speaker similarity. Currently "rawnet", "wavlm" and "resemblyzer" are available. Default to "wavlm"] \
	--similarity_mode "overall" \
```

## Troubleshooting
### FAD (Using Offline Models)
If your system is unable to access huggingface.co from the terminal, you might run into an error like "OSError: Can't load tokenizer for ...". To work around this, follow these steps to use local models:

1. Download the [bert-base-uncased](https://huggingface.co/bert-base-uncased), [roberta-base](https://huggingface.co/roberta-base), and [facebook/bart-base](https://huggingface.co/facebook/bart-base) models from `huggingface.co`. Ensure that the models are complete and uncorrupted. Place these directories within `Amphion/pretrained`. For a detailed file structure reference, see [This README](../../pretrained/README.md#optional-model-dependencies-for-evaluation) under `Amphion/pretrained`.
2. Inside the `Amphion/pretrained` directory, create a bash script with the content outlined below. This script will automatically update the tokenizer paths used by your system:
  ```bash
  #!/bin/bash

  BERT_DIR="bert-base-uncased"
  ROBERTA_DIR="roberta-base"
  BART_DIR="facebook/bart-base"
  PYTHON_SCRIPT="[YOUR ENV PATH]/lib/python3.9/site-packages/laion_clap/training/data.py"

  update_tokenizer_path() {
      local dir_name=$1
      local tokenizer_variable=$2
      local full_path

      if [ -d "$dir_name" ]; then
          full_path=$(realpath "$dir_name")
          if [ -f "$PYTHON_SCRIPT" ]; then
              sed -i "s|${tokenizer_variable}.from_pretrained(\".*\")|${tokenizer_variable}.from_pretrained(\"$full_path\")|" "$PYTHON_SCRIPT"
              echo "Updated ${tokenizer_variable} path to $full_path."
          else
              echo "Error: The specified Python script does not exist."
              exit 1
          fi
      else
          echo "Error: The directory $dir_name does not exist in the current directory."
          exit 1
      fi
  }

  update_tokenizer_path "$BERT_DIR" "BertTokenizer"
  update_tokenizer_path "$ROBERTA_DIR" "RobertaTokenizer"
  update_tokenizer_path "$BART_DIR" "BartTokenizer"

  echo "BERT, BART and RoBERTa Python script paths have been updated."

  ```

3. The script provided is intended to adjust the tokenizer paths in the `data.py` file, found under `/lib/python3.9/site-packages/laion_clap/training/`, within your specific environment. For those utilizing conda, you can determine your environment path by running `conda info --envs`. Then, substitute `[YOUR ENV PATH]` in the script with this path. If your environment is configured differently, you'll need to update the `PYTHON_SCRIPT` variable to correctly point to the `data.py` file.
4. Run the script. If it executes successfully, the tokenizer paths will be updated, allowing them to be loaded locally.

### WavLM-based Speaker Similarity (Using Offline Models)

If your system is unable to access huggingface.co from the terminal and you want to calculate `WavLM` based speaker similarity, you need to download the pretrained model first, as illustrated [here](../../pretrained/README.md).