# Datasets Format

Amphion support the following academic datasets (sort alphabetically):

- [Datasets Format](#datasets-format)
  - [AudioCaps](#audiocaps)
  - [CSD](#csd)
  - [CustomSVCDataset](#customsvcdataset)
  - [Hi-Fi TTS](#hifitts)
  - [KiSing](#kising)
  - [LibriLight](#librilight)
  - [LibriTTS](#libritts)
  - [LJSpeech](#ljspeech)
  - [M4Singer](#m4singer)
  - [NUS-48E](#nus-48e)
  - [Opencpop](#opencpop)
  - [OpenSinger](#opensinger)
  - [Opera](#opera)
  - [PopBuTFy](#popbutfy)
  - [PopCS](#popcs)
  - [PJS](#pjs)
  - [SVCC](#svcc)
  - [VCTK](#vctk)

The downloading link and the file structure tree of each dataset is displayed as follows.

> **Note:** When using Docker to run Amphion, mount the dataset to the container is necessary after downloading. Check [Mount dataset in Docker container](./docker.md) for more details.

## AudioCaps

AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information.

Download AudioCaps dataset [here](https://github.com/cdjkim/audiocaps). The file structure looks like below:

```plaintext
[AudioCaps dataset path]
┣ AudioCpas
┃ ┣ wav
┃ ┃ ┣ ---1_cCGK4M_0_10000.wav
┃ ┃ ┣ ---lTs1dxhU_30000_40000.wav
┃ ┃ ┣ ...
```

## CSD

Download the official CSD dataset [here](https://zenodo.org/records/4785016). The file structure looks like below:

```plaintext
[CSD dataset path]
 ┣ english
 ┣ korean
 ┣ utterances
 ┃ ┣ en001a
 ┃ ┃ ┣ {UtterenceID}.wav
 ┃ ┣ en001b
 ┃ ┣ en002a
 ┃ ┣ en002b
 ┃ ┣ ...
 ┣ README
```

## CustomSVCDataset

We support custom dataset for Singing Voice Conversion. Organize your data in the following structure to construct your own dataset:

```plaintext
[Your Custom Dataset Path]
 ┣ singer1
 ┃ ┣ song1
 ┃ ┃ ┣ utterance1.wav
 ┃ ┃ ┣ utterance2.wav
 ┃ ┃ ┣ ...
 ┃ ┣ song2
 ┃ ┣ ...
 ┣ singer2
 ┣ ...
```


## Hi-Fi TTS

Download the official Hi-Fi TTS dataset [here](https://www.openslr.org/109/). The file structure looks like below:

```plaintext
[Hi-Fi TTS dataset path]
 ┣ audio
 ┃ ┣ 11614_other {Speaker_ID}_{SNR_subset}
 ┃ ┃ ┣ 10547 {Book_ID}
 ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0001.flac
 ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0003.flac
 ┃ ┃ ┃ ┣ thousandnights8_04_anonymous_0004.flac
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┣ 92_manifest_clean_dev.json
 ┣ 92_manifest_clean_test.json
 ┣ 92_manifest_clean_train.json
 ┣ ...
 ┣ {Speaker_ID}_manifest_{SNR_subset}_{dataset_split}.json
 ┣ ...
 ┣ books_bandwidth.tsv
 ┣ LICENSE.txt
 ┣ readers_books_clean.txt
 ┣ readers_books_other.txt
 ┣ README.txt

```

## KiSing

Download the official KiSing dataset [here](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/). The file structure looks like below:

```plaintext
[KiSing dataset path]
 ┣ clean
 ┃ ┣ 421
 ┃ ┣ 422
 ┃ ┣ ...
```

## LibriLight

Download the official LibriLight dataset [here](https://github.com/facebookresearch/libri-light). The file structure looks like below:

```plaintext
[LibriTTS dataset path]
 ┣ small (Subset)
 ┃ ┣ 100 {Speaker_ID}
 ┃ ┃ ┣ sea_fairies_0812_librivox_64kb_mp3 {Chapter_ID}
 ┃ ┃ ┃ ┣ 01_baum_sea_fairies_64kb.flac
 ┃ ┃ ┃ ┣ 02_baum_sea_fairies_64kb.flac
 ┃ ┃ ┃ ┣ 03_baum_sea_fairies_64kb.flac
 ┃ ┃ ┃ ┣ 22_baum_sea_fairies_64kb.flac
 ┃ ┃ ┃ ┣ 01_baum_sea_fairies_64kb.json
 ┃ ┃ ┃ ┣ 02_baum_sea_fairies_64kb.json
 ┃ ┃ ┃ ┣ 03_baum_sea_fairies_64kb.json
 ┃ ┃ ┃ ┣ 22_baum_sea_fairies_64kb.json
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┣ medium (Subset)
 ┣ ...
```

## LibriTTS

Download the official LibriTTS dataset [here](https://www.openslr.org/60/). The file structure looks like below:

```plaintext
[LibriTTS dataset path]
 ┣ BOOKS.txt
 ┣ CHAPTERS.txt
 ┣ eval_sentences10.tsv
 ┣ LICENSE.txt
 ┣ NOTE.txt
 ┣ reader_book.tsv
 ┣ README_librispeech.txt
 ┣ README_libritts.txt 
 ┣ speakers.tsv
 ┣ SPEAKERS.txt
 ┣ dev-clean (Subset)
 ┃ ┣ 1272{Speaker_ID}
 ┃ ┃ ┣ 128104 {Chapter_ID}
 ┃ ┃ ┃ ┣ 1272_128104_000001_000000.normalized.txt
 ┃ ┃ ┃ ┣ 1272_128104_000001_000000.original.txt
 ┃ ┃ ┃ ┣ 1272_128104_000001_000000.wav
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┃ ┣ 1272_128104.book.tsv
 ┃ ┃ ┃ ┣ 1272_128104.trans.tsv
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┣ dev-other (Subset)
 ┃ ┣ 116 (Speaker)
 ┃ ┃ ┣ 288045 {Chapter_ID}
 ┃ ┃ ┃ ┣ 116_288045_000003_000000.normalized.txt
 ┃ ┃ ┃ ┣ 116_288045_000003_000000.original.txt
 ┃ ┃ ┃ ┣ 116_288045_000003_000000.wav
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┃ ┣ 116_288045.book.tsv
 ┃ ┃ ┃ ┣ 116_288045.trans.tsv
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┃ ┣ ...
 ┣ test-clean  (Subset)
 ┃ ┣ {Speaker_ID}
 ┃ ┃ ┣ {Chapter_ID}
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┣ test-other
 ┃ ┣ {Speaker_ID}
 ┃ ┃ ┣ {Chapter_ID}
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┣ train-clean-100
 ┃ ┣ {Speaker_ID}
 ┃ ┃ ┣ {Chapter_ID}
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┣ train-clean-360
 ┃ ┣ {Speaker_ID}
 ┃ ┃ ┣ {Chapter_ID}
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┣ train-other-500
 ┃ ┣ {Speaker_ID}
 ┃ ┃ ┣ {Chapter_ID}
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.book.tsv
 ┃ ┃ ┃ ┣ {Speaker_ID}_{Chapter_ID}.trans.tsv
 ┃ ┃ ┣ ...
 ┃ ┣ ...
```

## LJSpeech

Download the official LJSpeech dataset [here](https://keithito.com/LJ-Speech-Dataset/). The file structure looks like below:

```plaintext
[LJSpeech dataset path]
 ┣ metadata.csv
 ┣ wavs
 ┃ ┣ LJ001-0001.wav
 ┃ ┣ LJ001-0002.wav 
 ┃ ┣ ...
 ┣ README
```

## M4Singer

Download the official M4Singer dataset [here](https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view). The file structure looks like below:

```plaintext
[M4Singer dataset path]
 ┣ {Singer_1}#{Song_1}
 ┃ ┣ 0000.mid
 ┃ ┣ 0000.TextGrid
 ┃ ┣ 0000.wav
 ┃ ┣ ...
 ┣ {Singer_1}#{Song_2}
 ┣ ...
 ┣ {Singer_2}#{Song_1}
 ┣ {Singer_2}#{Song_2}
 ┣ ...
 ┗ meta.json
```

## NUS-48E

Download the official NUS-48E dataset [here](https://drive.google.com/drive/folders/12pP9uUl0HTVANU3IPLnumTJiRjPtVUMx). The file structure looks like below:

```plaintext
[NUS-48E dataset path]
 ┣ {SpeakerID}
 ┃ ┣ read
 ┃ ┃ ┣ {SongID}.txt
 ┃ ┃ ┣ {SongID}.wav
 ┃ ┃ ┣ ...
 ┃ ┣ sing
 ┃ ┃ ┣ {SongID}.txt
 ┃ ┃ ┣ {SongID}.wav
 ┃ ┃ ┣ ...
 ┣ ...
 ┣ README.txt

```

## Opencpop

Download the official Opencpop dataset [here](https://wenet.org.cn/opencpop/). The file structure looks like below:

```plaintext
[Opencpop dataset path]
 ┣ midis
 ┃ ┣ 2001.midi
 ┃ ┣ 2002.midi
 ┃ ┣ 2003.midi
 ┃ ┣ ...
 ┣ segments
 ┃ ┣ wavs
 ┃ ┃ ┣ 2001000001.wav
 ┃ ┃ ┣ 2001000002.wav
 ┃ ┃ ┣ 2001000003.wav
 ┃ ┃ ┣ ...
 ┃ ┣ test.txt
 ┃ ┣ train.txt
 ┃ ┗ transcriptions.txt
 ┣ textgrids
 ┃ ┣ 2001.TextGrid
 ┃ ┣ 2002.TextGrid
 ┃ ┣ 2003.TextGrid
 ┃ ┣ ...
 ┣ wavs
 ┃ ┣ 2001.wav
 ┃ ┣ 2002.wav
 ┃ ┣ 2003.wav
 ┃ ┣ ...
 ┣ TERMS_OF_ACCESS
 ┗ readme.md
```

## OpenSinger

Download the official OpenSinger dataset [here](https://drive.google.com/file/d/1EofoZxvalgMjZqzUEuEdleHIZ6SHtNuK/view). The file structure looks like below:

```plaintext
[OpenSinger dataset path]
 ┣ ManRaw
 ┃ ┣ {Singer_1}_{Song_1}
 ┃ ┃ ┣ {Singer_1}_{Song_1}_0.lab
 ┃ ┃ ┣ {Singer_1}_{Song_1}_0.txt
 ┃ ┃ ┣ {Singer_1}_{Song_1}_0.wav
 ┃ ┃ ┣ ...
 ┃ ┣ {Singer_1}_{Song_2}
 ┃ ┣ ...
 ┣ WomanRaw
 ┣ LICENSE
 ┗ README.md
```

## Opera

Download the official Opera dataset [here](http://isophonics.net/SingingVoiceDataset). The file structure looks like below:

```plaintext
[Opera dataset path]
 ┣ monophonic
 ┃ ┣ chinese
 ┃ ┃ ┣ {Gender}_{SingerID}
 ┃ ┃ ┃ ┣ {Emotion}_{SongID}.wav
 ┃ ┃ ┃ ┣ ...
 ┃ ┃ ┣ ...
 ┃ ┣ western
 ┣ polyphonic
 ┃ ┣ chinese
 ┃ ┣ western
 ┣ CrossculturalDataSet.xlsx
```

## PopBuTFy

Download the official PopBuTFy dataset [here](https://github.com/MoonInTheRiver/NeuralSVB). The file structure looks like below:

```plaintext
[PopBuTFy dataset path]
 ┣ data
 ┃ ┣ {SingerID}#singing#{SongName}_Amateur
 ┃ ┃ ┣ {SingerID}#singing#{SongName}_Amateur_{UtteranceID}.mp3
 ┃ ┃ ┣ ...
 ┃ ┣ {SingerID}#singing#{SongName}_Professional
 ┃ ┃ ┣ {SingerID}#singing#{SongName}_Professional_{UtteranceID}.mp3
 ┃ ┃ ┣ ...
 ┣ text_labels
 ┗ TERMS_OF_ACCESS
```

## PopCS

Download the official PopCS dataset [here](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md). The file structure looks like below:

```plaintext
[PopCS dataset path]
 ┣ popcs
 ┃ ┣ popcs-{SongName}
 ┃ ┃ ┣ {UtteranceID}_ph.txt
 ┃ ┃ ┣ {UtteranceID}_wf0.wav
 ┃ ┃ ┣ {UtteranceID}.TextGrid
 ┃ ┃ ┣ {UtteranceID}.txt
 ┃ ┃ ┣ ...
 ┃ ┣ ...
 ┗ TERMS_OF_ACCESS
```

## PJS

Download the official PJS dataset [here](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus). The file structure looks like below:

```plaintext
[PJS dataset path]
 ┣ PJS_corpus_ver1.1
 ┃ ┣ background_noise
 ┃ ┣ pjs{SongID}
 ┃ ┃ ┣ pjs{SongID}_song.wav
 ┃ ┃ ┣ pjs{SongID}_speech.wav
 ┃ ┃ ┣ pjs{SongID}.lab
 ┃ ┃ ┣ pjs{SongID}.mid
 ┃ ┃ ┣ pjs{SongID}.musicxml
 ┃ ┃ ┣ pjs{SongID}.txt
 ┃ ┣ ...
```

## SVCC

Download the official SVCC dataset [here](https://github.com/lesterphillip/SVCC23_FastSVC/tree/main/egs/generate_dataset). The file structure looks like below:

```plaintext
[SVCC dataset path]
 ┣ Data
 ┃ ┣ CDF1
 ┃ ┃ ┣ 10001.wav
 ┃ ┃ ┣ 10002.wav
 ┃ ┃ ┣ ...
 ┃ ┣ CDM1
 ┃ ┣ IDF1
 ┃ ┣ IDM1
 ┗ README.md
```

## VCTK

Download the official VCTK dataset [here](https://datashare.ed.ac.uk/handle/10283/3443). The file structure looks like below:

```plaintext
[VCTK dataset path]
 ┣ txt
 ┃ ┣ {Speaker_1}
 ┃ ┃ ┣ {Speaker_1}_001.txt
 ┃ ┃ ┣ {Speaker_1}_002.txt
 ┃ ┃ ┣ ...
 ┃ ┣ {Speaker_2}
 ┃ ┣ ...
 ┣ wav48_silence_trimmed
 ┃ ┣ {Speaker_1}
 ┃ ┃ ┣ {Speaker_1}_001_mic1.flac
 ┃ ┃ ┣ {Speaker_1}_001_mic2.flac
 ┃ ┃ ┣ {Speaker_1}_002_mic1.flac
 ┃ ┃ ┣ ...
 ┃ ┣ {Speaker_2}
 ┃ ┣ ...
 ┣ speaker-info.txt
 ┗ update.txt
```