ai-tube-model-musicgen-4

Running on A10G

File size: 5,769 Bytes

5325fcc

# AudioCraft objective metrics

In addition to training losses, AudioCraft provides a set of objective metrics
for audio synthesis and audio generation. As these metrics may require
extra dependencies and can be costly to train, they are often disabled by default.
This section provides guidance for setting up and using these metrics in
the AudioCraft training pipelines.

## Available metrics

### Audio synthesis quality metrics

#### SI-SNR

We provide an implementation of the Scale-Invariant Signal-to-Noise Ratio in PyTorch.
No specific requirement is needed for this metric. Please activate the metric at the
evaluation stage with the appropriate flag:

**Warning:** We report the opposite of the SI-SNR, e.g. multiplied by -1. This is due to internal 
    details where the SI-SNR score can also be used as a training loss function, where lower
    values should indicate better reconstruction. Negative values are such expected and a good sign! Those should be again multiplied by `-1` before publication :)

```shell
dora run <...> evaluate.metrics.sisnr=true
```

#### ViSQOL

We provide a Python wrapper around the ViSQOL [official implementation](https://github.com/google/visqol)
to conveniently run ViSQOL within the training pipelines.

One must specify the path to the ViSQOL installation through the configuration in order
to enable ViSQOL computations in AudioCraft:

```shell
# the first parameter is used to activate visqol computation while the second specify
# the path to visqol's library to be used by our python wrapper
dora run <...> evaluate.metrics.visqol=true metrics.visqol.bin=<path_to_visqol>
```

See an example grid: [Compression with ViSQOL](../audiocraft/grids/compression/encodec_musicgen_32khz.py)

To learn more about ViSQOL and how to build ViSQOL binary using bazel, please refer to the
instructions available in the [open source repository](https://github.com/google/visqol).

### Audio generation metrics

#### Frechet Audio Distance

Similarly to ViSQOL, we use a Python wrapper around the Frechet Audio Distance
[official implementation](https://github.com/google-research/google-research/tree/master/frechet_audio_distance)
in TensorFlow.

Note that we had to make several changes to the actual code in order to make it work.
Please refer to the [FrechetAudioDistanceMetric](../audiocraft/metrics/fad.py) class documentation
for more details. We do not plan to provide further support in obtaining a working setup for the
Frechet Audio Distance at this stage.

```shell
# the first parameter is used to activate FAD metric computation while the second specify
# the path to FAD library to be used by our python wrapper
dora run <...> evaluate.metrics.fad=true metrics.fad.bin=<path_to_google_research_repository>
```

See an example grid: [Evaluation with FAD](../audiocraft/grids/musicgen/musicgen_pretrained_32khz_eval.py)

#### Kullback-Leibler Divergence

We provide a PyTorch implementation of the Kullback-Leibler Divergence computed over the probabilities
of the labels obtained by a state-of-the-art audio classifier. We provide our implementation of the KLD
using the [PaSST classifier](https://github.com/kkoutini/PaSST).

In order to use the KLD metric over PaSST, you must install the PaSST library as an extra dependency:
```shell
pip install 'git+https://github.com/kkoutini/[email protected]#egg=hear21passt'
```

Then similarly, you can use the metric activating the corresponding flag:

```shell
# one could extend the kld metric with additional audio classifier models that can then be picked through the configuration
dora run <...> evaluate.metrics.kld=true metrics.kld.model=passt
```

#### Text consistency

We provide a text-consistency metric, similarly to the MuLan Cycle Consistency from
[MusicLM](https://arxiv.org/pdf/2301.11325.pdf) or the CLAP score used in
[Make-An-Audio](https://arxiv.org/pdf/2301.12661v1.pdf).
More specifically, we provide a PyTorch implementation of a Text consistency metric
relying on a pre-trained [Contrastive Language-Audio Pretraining (CLAP)](https://github.com/LAION-AI/CLAP).

Please install the CLAP library as an extra dependency prior to using the metric:
```shell
pip install laion_clap
```

Then similarly, you can use the metric activating the corresponding flag:

```shell
# one could extend the text consistency metric with additional audio classifier models that can then be picked through the configuration
dora run ... evaluate.metrics.text_consistency=true metrics.text_consistency.model=clap
```

Note that the text consistency metric based on CLAP will require the CLAP checkpoint to be
provided in the configuration.

#### Chroma cosine similarity

Finally, as introduced in MusicGen, we provide a Chroma Cosine Similarity metric in PyTorch.
No specific requirement is needed for this metric. Please activate the metric at the
evaluation stage with the appropriate flag:

```shell
dora run ... evaluate.metrics.chroma_cosine=true
```

#### Comparing against reconstructed audio

For all the above audio generation metrics, we offer the option to compute the metric on the reconstructed audio
fed in EnCodec instead of the generated sample using the flag `<metric>.use_gt=true`.

## Example usage

You will find example of configuration for the different metrics introduced above in:
* The [musicgen's default solver](../config/solver/musicgen/default.yaml) for all audio generation metrics
* The [compression's default solver](../config/solver/compression/default.yaml) for all audio synthesis metrics

Similarly, we provide different examples in our grids:
* [Evaluation with ViSQOL](../audiocraft/grids/compression/encodec_musicgen_32khz.py)
* [Evaluation with FAD and others](../audiocraft/grids/musicgen/musicgen_pretrained_32khz_eval.py)