Spaces:
Running
on
A10G
Running
on
A10G
# AudioCraft training pipelines | |
AudioCraft training pipelines are built on top of PyTorch as our core deep learning library | |
and [Flashy](https://github.com/facebookresearch/flashy) as our training pipeline design library, | |
and [Dora](https://github.com/facebookresearch/dora) as our experiment manager. | |
AudioCraft training pipelines are designed to be research and experiment-friendly. | |
## Environment setup | |
For the base installation, follow the instructions from the [README.md](../README.md). | |
Below are some additional instructions for setting up environment to train new models. | |
### Team and cluster configuration | |
In order to support multiple teams and clusters, AudioCraft uses an environment configuration. | |
The team configuration allows to specify cluster-specific configurations (e.g. SLURM configuration), | |
or convenient mapping of paths between the supported environments. | |
Each team can have a yaml file under the [configuration folder](../config). To select a team set the | |
`AUDIOCRAFT_TEAM` environment variable to a valid team name (e.g. `labs` or `default`): | |
```shell | |
conda env config vars set AUDIOCRAFT_TEAM=default | |
``` | |
Alternatively, you can add it to your `.bashrc`: | |
```shell | |
export AUDIOCRAFT_TEAM=default | |
``` | |
If not defined, the environment will default to the `default` team. | |
The cluster is automatically detected, but it is also possible to override it by setting | |
the `AUDIOCRAFT_CLUSTER` environment variable. | |
Based on this team and cluster, the environment is then configured with: | |
* The dora experiment outputs directory. | |
* The available slurm partitions: categorized by global and team. | |
* A shared reference directory: In order to facilitate sharing research models while remaining | |
agnostic to the used compute cluster, we created the `//reference` symbol that can be used in | |
YAML config to point to a defined reference folder containing shared checkpoints | |
(e.g. baselines, models for evaluation...). | |
**Important:** The default output dir for trained models and checkpoints is under `/tmp/`. This is suitable | |
only for quick testing. If you are doing anything serious you MUST edit the file `default.yaml` and | |
properly set the `dora_dir` entries. | |
#### Overriding environment configurations | |
You can set the following environmet variables to bypass the team's environment configuration: | |
* `AUDIOCRAFT_CONFIG`: absolute path to a team config yaml file. | |
* `AUDIOCRAFT_DORA_DIR`: absolute path to a custom dora directory. | |
* `AUDIOCRAFT_REFERENCE_DIR`: absolute path to the shared reference directory. | |
## Training pipelines | |
Each task supported in AudioCraft has its own training pipeline and dedicated solver. | |
Learn more about solvers and key designs around AudioCraft training pipeline below. | |
Please refer to the documentation of each task and model for specific information on a given task. | |
### Solvers | |
The core training component in AudioCraft is the solver. A solver holds the definition | |
of how to solve a given task: It implements the training pipeline logic, combining the datasets, | |
model, optimization criterion and components and the full training loop. We refer the reader | |
to [Flashy](https://github.com/facebookresearch/flashy) for core principles around solvers. | |
AudioCraft proposes an initial solver, the `StandardSolver` that is used as the base implementation | |
for downstream solvers. This standard solver provides a nice base management of logging, | |
checkpoints loading/saving, xp restoration, etc. on top of the base Flashy implementation. | |
In AudioCraft, we made the assumption that all tasks are following the same set of stages: | |
train, valid, evaluate and generation, each relying on a dedicated dataset. | |
Each solver is responsible for defining the task to solve and the associated stages | |
of the training loop in order to leave the full ownership of the training pipeline | |
to the researchers. This includes loading the datasets, building the model and | |
optimisation components, registering them and defining the execution of each stage. | |
To create a new solver for a given task, one should extend the StandardSolver | |
and define each stage of the training loop. One can further customise its own solver | |
starting from scratch instead of inheriting from the standard solver. | |
```python | |
from . import base | |
from .. import optim | |
class MyNewSolver(base.StandardSolver): | |
def __init__(self, cfg: omegaconf.DictConfig): | |
super().__init__(cfg) | |
# one can add custom attributes to the solver | |
self.criterion = torch.nn.L1Loss() | |
def best_metric(self): | |
# here optionally specify which metric to use to keep track of best state | |
return 'loss' | |
def build_model(self): | |
# here you can instantiate your models and optimization related objects | |
# this method will be called by the StandardSolver init method | |
self.model = ... | |
# the self.cfg attribute contains the raw configuration | |
self.optimizer = optim.build_optimizer(self.model.parameters(), self.cfg.optim) | |
# don't forget to register the states you'd like to include in your checkpoints! | |
self.register_stateful('model', 'optimizer') | |
# keep the model best state based on the best value achieved at validation for the given best_metric | |
self.register_best('model') | |
# if you want to add EMA around the model | |
self.register_ema('model') | |
def build_dataloaders(self): | |
# here you can instantiate your dataloaders | |
# this method will be called by the StandardSolver init method | |
self.dataloaders = ... | |
... | |
# For both train and valid stages, the StandardSolver relies on | |
# a share common_train_valid implementation that is in charge of | |
# accessing the appropriate loader, iterate over the data up to | |
# the specified number of updates_per_epoch, run the ``run_step`` | |
# function that you need to implement to specify the behavior | |
# and finally update the EMA and collect the metrics properly. | |
@abstractmethod | |
def run_step(self, idx: int, batch: tp.Any, metrics: dict): | |
"""Perform one training or valid step on a given batch. | |
""" | |
... # provide your implementation of the solver over a batch | |
def train(self): | |
"""Train stage. | |
""" | |
return self.common_train_valid('train') | |
def valid(self): | |
"""Valid stage. | |
""" | |
return self.common_train_valid('valid') | |
@abstractmethod | |
def evaluate(self): | |
"""Evaluate stage. | |
""" | |
... # provide your implementation here! | |
@abstractmethod | |
def generate(self): | |
"""Generate stage. | |
""" | |
... # provide your implementation here! | |
``` | |
### About Epochs | |
AudioCraft Solvers uses the concept of Epoch. One epoch doesn't necessarily mean one pass over the entire | |
dataset, but instead represent the smallest amount of computation that we want to work with before checkpointing. | |
Typically, we find that having an Epoch time around 30min is ideal both in terms of safety (checkpointing often enough) | |
and getting updates often enough. One Epoch is at least a `train` stage that lasts for `optim.updates_per_epoch` (2000 by default), | |
and a `valid` stage. You can control how long the valid stage takes with `dataset.valid.num_samples`. | |
Other stages (`evaluate`, `generate`) will only happen every X epochs, as given by `evaluate.every` and `generate.every`). | |
### Models | |
In AudioCraft, a model is a container object that wraps one or more torch modules together | |
with potential processing logic to use in a solver. For example, a model would wrap an encoder module, | |
a quantisation bottleneck module, a decoder and some tensor processing logic. Each of the previous components | |
can be considered as a small « model unit » on its own but the container model is a practical component | |
to manipulate and train a set of modules together. | |
### Datasets | |
See the [dedicated documentation on datasets](./DATASETS.md). | |
### Metrics | |
See the [dedicated documentation on metrics](./METRICS.md). | |
### Conditioners | |
AudioCraft language models can be conditioned in various ways and the codebase offers a modular implementation | |
of different conditioners that can be potentially combined together. | |
Learn more in the [dedicated documentation on conditioning](./CONDITIONING.md). | |
### Configuration | |
AudioCraft's configuration is defined in yaml files and the framework relies on | |
[hydra](https://hydra.cc/docs/intro/) and [omegaconf](https://omegaconf.readthedocs.io/) to parse | |
and manipulate the configuration through Dora. | |
##### :warning: Important considerations around configurations | |
Our configuration management relies on Hydra and the concept of group configs to structure | |
and compose configurations. Updating the root default configuration files will then have | |
an impact on all solvers and tasks. | |
**One should never change the default configuration files. Instead they should use Hydra config groups in order to store custom configuration.** | |
Once this configuration is created and used for running experiments, you should not edit it anymore. | |
Note that as we are using Dora as our experiment manager, all our experiment tracking is based on | |
signatures computed from delta between configurations. | |
**One must therefore ensure backward compatibilty of the configuration at all time.** | |
See [Dora's README](https://github.com/facebookresearch/dora) and the | |
[section below introduction Dora](#running-experiments-with-dora). | |
##### Configuration structure | |
The configuration is organized in config groups: | |
* `conditioner`: default values for conditioning modules. | |
* `dset`: contains all data source related information (paths to manifest files | |
and metadata for a given dataset). | |
* `model`: contains configuration for each model defined in AudioCraft and configurations | |
for different variants of models. | |
* `solver`: contains the default configuration for each solver as well as configuration | |
for each solver task, combining all the above components. | |
* `teams`: contains the cluster configuration per teams. See environment setup for more details. | |
The `config.yaml` file is the main configuration that composes the above groups | |
and contains default configuration for AudioCraft. | |
##### Solver's core configuration structure | |
The core configuration structure shared across solver is available in `solvers/default.yaml`. | |
##### Other configuration modules | |
AudioCraft configuration contains the different setups we used for our research and publications. | |
## Running experiments with Dora | |
### Launching jobs | |
Try launching jobs for different tasks locally with dora run: | |
```shell | |
# run compression task with lightweight encodec | |
dora run solver=compression/debug | |
``` | |
Most of the time, the jobs are launched through dora grids, for example: | |
```shell | |
# run compression task through debug grid | |
dora grid compression.debug | |
``` | |
Learn more about running experiments with Dora below. | |
### A small introduction to Dora | |
[Dora](https://github.com/facebookresearch/dora) is the experiment manager tool used in AudioCraft. | |
Check out the README to learn how Dora works. Here is a quick summary of what to know: | |
* An XP is a unique set of hyper-parameters with a given signature. The signature is a hash | |
of those hyper-parameters. We always refer to an XP with its signature, e.g. 9357e12e. We will see | |
after that one can retrieve the hyper-params and re-rerun it in a single command. | |
* In fact, the hash is defined as a delta between the base config and the one obtained | |
with the config overrides you passed from the command line. This means you must never change | |
the `conf/**.yaml` files directly., except for editing things like paths. Changing the default values | |
in the config files means the XP signature won't reflect that change, and wrong checkpoints might be reused. | |
I know, this is annoying, but the reason is that otherwise, any change to the config file would mean | |
that all XPs ran so far would see their signature change. | |
#### Dora commands | |
```shell | |
dora info -f 81de367c # this will show the hyper-parameter used by a specific XP. | |
# Be careful some overrides might present twice, and the right most one | |
# will give you the right value for it. | |
dora run -d -f 81de367c # run an XP with the hyper-parameters from XP 81de367c. | |
# `-d` is for distributed, it will use all available GPUs. | |
dora run -d -f 81de367c dataset.batch_size=32 # start from the config of XP 81de367c but change some hyper-params. | |
# This will give you a new XP with a new signature (e.g. 3fe9c332). | |
dora info -f SIG -t # will tail the log (if the XP has scheduled). | |
# if you need to access the logs of the process for rank > 0, in particular because a crash didn't happen in the main | |
# process, then use `dora info -f SIG` to get the main log name (finished into something like `/5037674_0_0_log.out`) | |
# and worker K can accessed as `/5037674_0_{K}_log.out`. | |
# This is only for scheduled jobs, for local distributed runs with `-d`, then you should go into the XP folder, | |
# and look for `worker_{K}.log` logs. | |
``` | |
An XP runs from a specific folder based on its signature, under the | |
`<cluster_specific_path>/<user>/experiments/audiocraft/outputs/` folder. | |
You can safely interrupt a training and resume it, it will reuse any existing checkpoint, | |
as it will reuse the same folder. If you made some change to the code and need to ignore | |
a previous checkpoint you can use `dora run --clear [RUN ARGS]`. | |
If you have a Slurm cluster, you can also use the dora grid command, e.g. | |
```shell | |
# run a dummy grid located at `audiocraft/grids/my_grid_folder/my_grid_name.py` | |
dora grid my_grid_folder.my_grid_name | |
# Run the following will simply display the grid and also initialized the Dora experiments database. | |
# You can then simply refer to a config using its signature (e.g. as `dora run -f SIG`). | |
dora grid my_grid_folder.my_grid_name --dry_run --init | |
``` | |
Please refer to the [Dora documentation](https://github.com/facebookresearch/dora) for more information. | |
#### Clearing up past experiments | |
```shell | |
# This will cancel all the XPs and delete their folder and checkpoints. | |
# It will then reschedule them starting from scratch. | |
dora grid my_grid_folder.my_grid_name --clear | |
# The following will delete the folder and checkpoint for a single XP, | |
# and then run it afresh. | |
dora run [-f BASE_SIG] [ARGS] --clear | |
``` | |