Spaces:

descript
/

vampnet

Running on T4

App Files Files Community

Hugo Flores Garcia commited on Jun 6, 2023

Commit

75a7169

•

1 Parent(s): 13b04cf

efficient lora ckpts

Browse files

Files changed (22) hide show

README.md +24 -38
conf/{generated → generated-v0}/berta-goldman-speech/c2f.yml +0 -0
conf/{generated → generated-v0}/berta-goldman-speech/coarse.yml +0 -0
conf/{generated → generated-v0}/berta-goldman-speech/interface.yml +0 -0
conf/{generated → generated-v0}/gamelan-xeno-canto/c2f.yml +0 -0
conf/{generated → generated-v0}/gamelan-xeno-canto/coarse.yml +0 -0
conf/{generated → generated-v0}/gamelan-xeno-canto/interface.yml +0 -0
conf/{generated → generated-v0}/nasralla/c2f.yml +0 -0
conf/{generated → generated-v0}/nasralla/coarse.yml +0 -0
conf/{generated → generated-v0}/nasralla/interface.yml +0 -0
conf/generated/musica-bolero-marimba/c2f.yml +18 -0
conf/generated/musica-bolero-marimba/coarse.yml +11 -0
conf/generated/musica-bolero-marimba/interface.yml +8 -0
conf/generated/xeno-canto/c2f.yml +15 -0
conf/generated/xeno-canto/coarse.yml +8 -0
conf/generated/xeno-canto/interface.yml +7 -0
conf/lora/lora.yml +2 -2
conf/vampnet.yml +1 -1
demo.py +7 -3
scripts/exp/fine_tune.py +16 -19
scripts/exp/train.py +19 -2
vampnet/interface.py +55 -9

README.md CHANGED Viewed

@@ -33,41 +33,6 @@ Config files are stored in the `conf/` folder.
 Download the pretrained models from [this link](https://drive.google.com/file/d/1ZIBMJMt8QRE8MYYGjg4lH7v7BLbZneq2/view?usp=sharing). Then, extract the models to the `models/` folder.
-# How the code is structured
-This code was written fast to meet a publication deadline, so it can be messy and redundant at times. Currently working on cleaning it up.
-```
-├── conf         <- (conf files for training, finetuning, etc)
-├── demo.py      <- (gradio UI for playing with vampnet)
-├── env          <- (environment variables)
-│   └── env.sh
-├── models       <- (extract pretrained models)
-│   ├── spotdl
-│   │   ├── c2f.pth     <- (coarse2fine checkpoint)
-│   │   ├── coarse.pth  <- (coarse checkpoint)
-│   │   └── codec.pth    <- (codec checkpoint)
-│   └── wavebeat.pth
-├── README.md
-├── scripts
-│   ├── exp
-│   │   ├── eval.py       <- (eval script)
-│   │   └── train.py       <- (training/finetuning script)
-│   └── utils
-├── vampnet
-│   ├── beats.py         <- (beat tracking logic)
-│   ├── __init__.py
-│   ├── interface.py     <- (high-level programmatic interface)
-│   ├── mask.py
-│   ├── modules
-│   │   ├── activations.py
-│   │   ├── __init__.py
-│   │   ├── layers.py
-│   │   └── transformer.py  <- (architecture + sampling code)
-│   ├── scheduler.py
-│   └── util.py
-```
 # Usage
 First, you'll want to set up your environment
@@ -90,12 +55,33 @@ python scripts/exp/train.py --args.load conf/vampnet.yml --save_path /path/to/ch
 ```
 ## Fine-tuning
-To fine-tune a model, see the configuration files under `conf/lora/`.
-You just need to provide a list of audio files // folders to fine-tune on, then launch the training job as usual.
 ```bash
-python scripts/exp/train.py --args.load conf/lora/birds.yml --save_path /path/to/checkpoints
 ```
 ## Launching the Gradio Interface

 Download the pretrained models from [this link](https://drive.google.com/file/d/1ZIBMJMt8QRE8MYYGjg4lH7v7BLbZneq2/view?usp=sharing). Then, extract the models to the `models/` folder.
 # Usage
 First, you'll want to set up your environment
 ```
 ## Fine-tuning
+To fine-tune a model, use the script in `scripts/exp/fine_tune.py` to generate 3 configuration files: `c2f.yml`, `coarse.yml`, and `interface.yml`.
+The first two are used to fine-tune the coarse and fine models, respectively. The last one is used to fine-tune the interface.
 ```bash
+python scripts/exp/fine_tune.py "/path/to/audio1.mp3 /path/to/audio2/ /path/to/audio3.wav" <fine_tune_name>
 ```
+This will create a folder under `conf/<fine_tune_name>/` with the 3 configuration files.
+The save_paths will be set to `runs/<fine_tune_name>/coarse` and `runs/<fine_tune_name>/c2f`.
+launch the coarse job:
+```bash
+python scripts/exp/train.py --args.load conf/<fine_tune_name>/coarse.yml
+```
+this will save the coarse model to `runs/<fine_tune_name>/coarse/ckpt/best/`.
+launch the c2f job:
+```bash
+python  scripts/exp/train.py --args.load conf/<fine_tune_name>/c2f.yml
+```
+launch the interface:
+```bash
+python  demo.py --args.load conf/generated/<fine_tune_name>/interface.yml
+```
 ## Launching the Gradio Interface

conf/{generated → generated-v0}/berta-goldman-speech/c2f.yml RENAMED Viewed

File without changes

conf/{generated → generated-v0}/berta-goldman-speech/coarse.yml RENAMED Viewed

File without changes

conf/{generated → generated-v0}/berta-goldman-speech/interface.yml RENAMED Viewed

File without changes

conf/{generated → generated-v0}/gamelan-xeno-canto/c2f.yml RENAMED Viewed

File without changes

conf/{generated → generated-v0}/gamelan-xeno-canto/coarse.yml RENAMED Viewed

File without changes

conf/{generated → generated-v0}/gamelan-xeno-canto/interface.yml RENAMED Viewed

File without changes

conf/{generated → generated-v0}/nasralla/c2f.yml RENAMED Viewed

File without changes

conf/{generated → generated-v0}/nasralla/coarse.yml RENAMED Viewed

File without changes

conf/{generated → generated-v0}/nasralla/interface.yml RENAMED Viewed

File without changes

conf/generated/musica-bolero-marimba/c2f.yml ADDED Viewed

	@@ -0,0 +1,18 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/spotdl/c2f.pth
+save_path: ./runs/musica-bolero-marimba/c2f
+train/AudioLoader.sources:
+- /media/CHONK/hugo/loras/boleros
+- /media/CHONK/hugo/loras/marimba-honduras
+val/AudioLoader.sources:
+- /media/CHONK/hugo/loras/boleros
+- /media/CHONK/hugo/loras/marimba-honduras

conf/generated/musica-bolero-marimba/coarse.yml ADDED Viewed

	@@ -0,0 +1,11 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/spotdl/coarse.pth
+save_path: ./runs/musica-bolero-marimba/coarse
+train/AudioLoader.sources:
+- /media/CHONK/hugo/loras/boleros
+- /media/CHONK/hugo/loras/marimba-honduras
+val/AudioLoader.sources:
+- /media/CHONK/hugo/loras/boleros
+- /media/CHONK/hugo/loras/marimba-honduras

conf/generated/musica-bolero-marimba/interface.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+AudioLoader.sources:
+- /media/CHONK/hugo/loras/boleros
+- /media/CHONK/hugo/loras/marimba-honduras
+Interface.coarse2fine_ckpt: ./models/spotdl/c2f.pth
+Interface.coarse2fine_lora_ckpt: ./runs/musica-bolero-marimba/c2f/latest/lora.pth
+Interface.coarse_ckpt: ./models/spotdl/coarse.pth
+Interface.coarse_lora_ckpt: ./runs/musica-bolero-marimba/coarse/latest/lora.pth
+Interface.codec_ckpt: ./models/spotdl/codec.pth

conf/generated/xeno-canto/c2f.yml ADDED Viewed

	@@ -0,0 +1,15 @@

+$include:
+- conf/lora/lora.yml
+AudioDataset.duration: 3.0
+AudioDataset.loudness_cutoff: -40.0
+VampNet.embedding_dim: 1280
+VampNet.n_codebooks: 14
+VampNet.n_conditioning_codebooks: 4
+VampNet.n_heads: 20
+VampNet.n_layers: 16
+fine_tune: true
+fine_tune_checkpoint: ./models/spotdl/c2f.pth
+save_path: ./runs/xeno-canto/c2f
+train/AudioLoader.sources: &id001
+- /media/CHONK/hugo/loras/xeno-canto-2/
+val/AudioLoader.sources: *id001

conf/generated/xeno-canto/coarse.yml ADDED Viewed

	@@ -0,0 +1,8 @@

+$include:
+- conf/lora/lora.yml
+fine_tune: true
+fine_tune_checkpoint: ./models/spotdl/coarse.pth
+save_path: ./runs/xeno-canto/coarse
+train/AudioLoader.sources: &id001
+- /media/CHONK/hugo/loras/xeno-canto-2/
+val/AudioLoader.sources: *id001

conf/generated/xeno-canto/interface.yml ADDED Viewed

	@@ -0,0 +1,7 @@

+AudioLoader.sources:
+- - /media/CHONK/hugo/loras/xeno-canto-2/
+Interface.coarse2fine_ckpt: ./mod els/spotdl/c2f.pth
+Interface.coarse2fine_lora_ckpt: ./runs/xeno-canto/c2f/latest/lora.pth
+Interface.coarse_ckpt: ./models/spotdl/coarse.pth
+Interface.coarse_lora_ckpt: ./runs/xeno-canto/coarse/latest/lora.pth
+Interface.codec_ckpt: ./models/spotdl/codec.pth

conf/lora/lora.yml CHANGED Viewed

@@ -13,10 +13,10 @@ NoamScheduler.warmup: 500
 batch_size: 7
 num_workers: 7
 epoch_length: 100
-save_audio_epochs: 4
 AdamW.lr: 0.0001
 # let's us organize sound classes into folders and choose from those sound classes uniformly
 AudioDataset.without_replacement: False
-max_epochs: 1000

 batch_size: 7
 num_workers: 7
 epoch_length: 100
+save_audio_epochs: 10
 AdamW.lr: 0.0001
 # let's us organize sound classes into folders and choose from those sound classes uniformly
 AudioDataset.without_replacement: False
+max_epochs: 500

conf/vampnet.yml CHANGED Viewed

@@ -1,5 +1,5 @@
-codec_ckpt: /home/hugo/descript/vampnet/models/spotdl/codec.pth
 save_path: ckpt
 max_epochs: 1000
 epoch_length: 1000

+codec_ckpt: ./models/spotdl/codec.pth
 save_path: ckpt
 max_epochs: 1000
 epoch_length: 1000

demo.py CHANGED Viewed

@@ -104,7 +104,11 @@ def _vamp(data, return_mask=False):
     # save the mask as a txt file
     np.savetxt(out_dir / "mask.txt", mask[:,0,:].long().cpu().numpy())
-    top_k = data[topk] if data[topk] > 0 else None
     zv, mask_z = interface.coarse_vamp(
         z,
         mask=mask,
@@ -354,17 +358,16 @@ with gr.Blocks() as demo:
                 value=0.0
             )
-            vamp_button = gr.Button("vamp!!!")
         # mask settings
         with gr.Column():
             output_audio = gr.Audio(
                 label="output audio",
                 interactive=False,
                 type="filepath"
             )
-            use_as_input_button = gr.Button("use as input")
         # with gr.Column():
@@ -397,6 +400,7 @@ with gr.Blocks() as demo:
                 label="vamp to download will appear here",
                 interactive=False
             )
             thank_you = gr.Markdown("")

     # save the mask as a txt file
     np.savetxt(out_dir / "mask.txt", mask[:,0,:].long().cpu().numpy())
+    if data[topk] is not None:
+        top_k = data[topk] if data[topk] > 0 else None
+    else:
+        top_k = None
     zv, mask_z = interface.coarse_vamp(
         z,
         mask=mask,
                 value=0.0
             )
         # mask settings
         with gr.Column():
+            vamp_button = gr.Button("vamp!!!")
             output_audio = gr.Audio(
                 label="output audio",
                 interactive=False,
                 type="filepath"
             )
         # with gr.Column():
                 label="vamp to download will appear here",
                 interactive=False
             )
+            use_as_input_button = gr.Button("use output as input")
             thank_you = gr.Markdown("")

scripts/exp/fine_tune.py CHANGED Viewed

@@ -1,6 +1,7 @@
 import argbind
 from pathlib import Path
 import yaml
@@ -10,7 +11,7 @@ import yaml
 """
 @argbind.bind(without_prefix=True, positional=True)
-def fine_tune(audio_file_or_folder: str, name: str):
     conf_dir = Path("conf")
     assert conf_dir.exists(), "conf directory not found. are you in the vampnet directory?"
@@ -24,8 +25,8 @@ def fine_tune(audio_file_or_folder: str, name: str):
     finetune_c2f_conf = {
         "$include": ["conf/lora/lora.yml"],
         "fine_tune": True,
-        "train/AudioLoader.sources": [audio_file_or_folder],
-        "val/AudioLoader.sources": [audio_file_or_folder],
         "VampNet.n_codebooks": 14,
         "VampNet.n_conditioning_codebooks": 4,
         "VampNet.embedding_dim": 1280,
@@ -34,21 +35,27 @@ def fine_tune(audio_file_or_folder: str, name: str):
         "AudioDataset.duration": 3.0,
         "AudioDataset.loudness_cutoff": -40.0,
         "save_path": f"./runs/{name}/c2f",
     }
     finetune_coarse_conf = {
         "$include": ["conf/lora/lora.yml"],
         "fine_tune": True,
-        "train/AudioLoader.sources": [audio_file_or_folder],
-        "val/AudioLoader.sources": [audio_file_or_folder],
         "save_path": f"./runs/{name}/coarse",
     }
     interface_conf = {
-        "Interface.coarse_ckpt": f"./runs/{name}/coarse/best/vampnet/weights.pth",
-        "Interface.coarse2fine_ckpt": f"./runs/{name}/c2f/best/vampnet/weights.pth",
         "Interface.codec_ckpt": "./models/spotdl/codec.pth",
-        "AudioLoader.sources": [audio_file_or_folder],
     }
     # save the confs
@@ -61,18 +68,8 @@ def fine_tune(audio_file_or_folder: str, name: str):
     with open(finetune_dir / "interface.yml", "w") as f:
         yaml.dump(interface_conf, f)
-    # copy the starter weights to the save paths
-    import shutil
-    def pmkdir(path):
-        Path(path).parent.mkdir(exist_ok=True, parents=True)
-        return path
-    shutil.copy("./models/spotdl/c2f.pth", pmkdir(f"./runs/{name}/c2f/starter/vampnet/weights.pth"))
-    shutil.copy("./models/spotdl/coarse.pth", pmkdir(f"./runs/{name}/coarse/starter/vampnet/weights.pth"))
-    print(f"generated confs in {finetune_dir}. run training jobs with `python scripts/exp/train.py --args.load {finetune_dir}/<c2f/coarse>.yml --resume --load_weights --tag starter` ")
 if __name__ == "__main__":
     args = argbind.parse_args()

 import argbind
 from pathlib import Path
 import yaml
+from typing import List
 """
 @argbind.bind(without_prefix=True, positional=True)
+def fine_tune(audio_files_or_folders: List[str], name: str):
     conf_dir = Path("conf")
     assert conf_dir.exists(), "conf directory not found. are you in the vampnet directory?"
     finetune_c2f_conf = {
         "$include": ["conf/lora/lora.yml"],
         "fine_tune": True,
+        "train/AudioLoader.sources": audio_files_or_folders,
+        "val/AudioLoader.sources": audio_files_or_folders,
         "VampNet.n_codebooks": 14,
         "VampNet.n_conditioning_codebooks": 4,
         "VampNet.embedding_dim": 1280,
         "AudioDataset.duration": 3.0,
         "AudioDataset.loudness_cutoff": -40.0,
         "save_path": f"./runs/{name}/c2f",
+        "fine_tune_checkpoint": "./models/spotdl/c2f.pth"
     }
     finetune_coarse_conf = {
         "$include": ["conf/lora/lora.yml"],
         "fine_tune": True,
+        "train/AudioLoader.sources": audio_files_or_folders,
+        "val/AudioLoader.sources": audio_files_or_folders,
         "save_path": f"./runs/{name}/coarse",
+        "fine_tune_checkpoint": "./models/spotdl/coarse.pth"
     }
     interface_conf = {
+        "Interface.coarse_ckpt": f"./models/spotdl/coarse.pth",
+        "Interface.coarse_lora_ckpt": f"./runs/{name}/coarse/latest/lora.pth",
+        "Interface.coarse2fine_ckpt": f"./models/spotdl/c2f.pth",
+        "Interface.coarse2fine_lora_ckpt": f"./runs/{name}/c2f/latest/lora.pth",
         "Interface.codec_ckpt": "./models/spotdl/codec.pth",
+        "AudioLoader.sources": [audio_files_or_folders],
     }
     # save the confs
     with open(finetune_dir / "interface.yml", "w") as f:
         yaml.dump(interface_conf, f)
+    print(f"generated confs in {finetune_dir}. run training jobs with `python scripts/exp/train.py --args.load {finetune_dir}/<c2f/coarse>.yml` ")
 if __name__ == "__main__":
     args = argbind.parse_args()

scripts/exp/train.py CHANGED Viewed

@@ -107,7 +107,11 @@ def load(
     resume: bool = False,
     tag: str = "latest",
     load_weights: bool = False,
 ):
     model, v_extra = None, {}
     if resume:
@@ -123,8 +127,12 @@ def load(
                 f"Could not find a VampNet checkpoint in {kwargs['folder']}"
             )
-    codec = LAC.load(args["codec_ckpt"], map_location="cpu")
-    codec.eval()
     model = VampNet() if model is None else model
     model = accel.prepare_model(model)
@@ -460,6 +468,15 @@ def train(
                 self.print(f"Best model so far")
                 tags.append("best")
             for tag in tags:
                 model_extra = {
                     "optimizer.pth": optimizer.state_dict(),

     resume: bool = False,
     tag: str = "latest",
     load_weights: bool = False,
+    fine_tune_checkpoint: Optional[str] = None,
 ):
+    codec = LAC.load(args["codec_ckpt"], map_location="cpu")
+    codec.eval()
     model, v_extra = None, {}
     if resume:
                 f"Could not find a VampNet checkpoint in {kwargs['folder']}"
             )
+    if args["fine_tune"]:
+        assert fine_tune_checkpoint is not None, "Must provide a fine-tune checkpoint"
+        model = VampNet.load(location=Path(fine_tune_checkpoint), map_location="cpu")
     model = VampNet() if model is None else model
     model = accel.prepare_model(model)
                 self.print(f"Best model so far")
                 tags.append("best")
+            if fine_tune:
+                for tag in tags:
+                    # save the lora model
+                    (Path(save_path) / tag).mkdir(parents=True, exist_ok=True)
+                    torch.save(
+                        lora.lora_state_dict(accel.unwrap(model)),
+                        f"{save_path}/{tag}/lora.pth"
+                    )
             for tag in tags:
                 model_extra = {
                     "optimizer.pth": optimizer.state_dict(),

vampnet/interface.py CHANGED Viewed

@@ -21,12 +21,40 @@ def signal_concat(
     return AudioSignal(audio_data, sample_rate=audio_signals[0].sample_rate)
 class Interface(torch.nn.Module):
     def __init__(
         self,
         coarse_ckpt: str = None,
         coarse2fine_ckpt: str = None,
         codec_ckpt: str = None,
         wavebeat_ckpt: str = None,
         device: str = "cpu",
@@ -40,18 +68,21 @@ class Interface(torch.nn.Module):
         self.codec.to(device)
         assert coarse_ckpt is not None, "must provide a coarse checkpoint"
-        self.coarse = VampNet.load(location=Path(coarse_ckpt), map_location="cpu")
-        self.coarse.to(device)
-        self.coarse.eval()
-        self.coarse.chunk_size_s = self.s2t2s(coarse_chunk_size_s)
         if coarse2fine_ckpt is not None:
-            self.c2f = VampNet.load(
-                location=Path(coarse2fine_ckpt), map_location="cpu"
             )
-            self.c2f.to(device)
-            self.c2f.eval()
-            self.c2f.chunk_size_s = self.s2t2s(coarse2fine_chunk_size_s)
         else:
             self.c2f = None
@@ -64,6 +95,21 @@ class Interface(torch.nn.Module):
         self.device = device
     def s2t(self, seconds: float):
         """seconds to tokens"""
         if isinstance(seconds, np.ndarray):

     return AudioSignal(audio_data, sample_rate=audio_signals[0].sample_rate)
+def _load_model(
+    ckpt: str,
+    lora_ckpt: str = None,
+    device: str = "cpu",
+    chunk_size_s: int = 10,
+):
+    # we need to set strict to False if the model has lora weights to add later
+    model = VampNet.load(location=Path(ckpt), map_location="cpu", strict=False)
+    # load lora weights if needed
+    if lora_ckpt is not None:
+        if not Path(lora_ckpt).exists():
+            should_cont = input(
+                f"lora checkpoint {lora_ckpt} does not exist. continue? (y/n) "
+            )
+            if should_cont != "y":
+                raise Exception("aborting")
+        else:
+            model.load_state_dict(torch.load(lora_ckpt, map_location="cpu"), strict=False)
+    model.to(device)
+    model.eval()
+    model.chunk_size_s = chunk_size_s
+    return model
 class Interface(torch.nn.Module):
     def __init__(
         self,
         coarse_ckpt: str = None,
+        coarse_lora_ckpt: str = None,
         coarse2fine_ckpt: str = None,
+        coarse2fine_lora_ckpt: str = None,
         codec_ckpt: str = None,
         wavebeat_ckpt: str = None,
         device: str = "cpu",
         self.codec.to(device)
         assert coarse_ckpt is not None, "must provide a coarse checkpoint"
+        self.coarse = _load_model(
+            ckpt=coarse_ckpt,
+            lora_ckpt=coarse_lora_ckpt,
+            device=device,
+            chunk_size_s=coarse_chunk_size_s,
+        )
+        # check if we have a coarse2fine ckpt
         if coarse2fine_ckpt is not None:
+            self.c2f = _load_model(
+                ckpt=coarse2fine_ckpt,
+                lora_ckpt=coarse2fine_lora_ckpt,
+                device=device,
+                chunk_size_s=coarse2fine_chunk_size_s,
             )
         else:
             self.c2f = None
         self.device = device
+    def lora_load(
+        self,
+        coarse_lora_ckpt: str = None,
+        coarse2fine_lora_ckpt: str = None,
+    ):
+        if coarse_lora_ckpt is not None:
+            self.coarse.to("cpu")
+            self.coarse.load_state_dict(torch.load(coarse_lora_ckpt, map_location="cpu"))
+            self.coarse.to(self.device)
+        if coarse2fine_lora_ckpt is not None:
+            self.c2f.to("cpu")
+            self.c2f.load_state_dict(torch.load(coarse2fine_lora_ckpt, map_location="cpu"))
+            self.c2f.to(self.device)
     def s2t(self, seconds: float):
         """seconds to tokens"""
         if isinstance(seconds, np.ndarray):