Spaces:
Running
Running
# Datasets Format | |
Amphion support the following academic datasets (sort alphabetically): | |
- [Datasets Format](#datasets-format) | |
- [AudioCaps](#audiocaps) | |
- [CSD](#csd) | |
- [CustomSVCDataset](#customsvcdataset) | |
- [Hi-Fi TTS](#hifitts) | |
- [KiSing](#kising) | |
- [LibriLight](#librilight) | |
- [LibriTTS](#libritts) | |
- [LJSpeech](#ljspeech) | |
- [M4Singer](#m4singer) | |
- [NUS-48E](#nus-48e) | |
- [Opencpop](#opencpop) | |
- [OpenSinger](#opensinger) | |
- [Opera](#opera) | |
- [PopBuTFy](#popbutfy) | |
- [PopCS](#popcs) | |
- [PJS](#pjs) | |
- [SVCC](#svcc) | |
- [VCTK](#vctk) | |
The downloading link and the file structure tree of each dataset is displayed as follows. | |
> **Note:** When using Docker to run Amphion, mount the dataset to the container is necessary after downloading. Check [Mount dataset in Docker container](./docker.md) for more details. | |
## AudioCaps | |
AudioCaps is a dataset of around 44K audio-caption pairs, where each audio clip corresponds to a caption with rich semantic information. | |
Download AudioCaps dataset [here](https://github.com/cdjkim/audiocaps). The file structure looks like below: | |
```plaintext | |
[AudioCaps dataset path] | |
β£ AudioCpas | |
β β£ wav | |
β β β£ ---1_cCGK4M_0_10000.wav | |
β β β£ ---lTs1dxhU_30000_40000.wav | |
β β β£ ... | |
``` | |
## CSD | |
Download the official CSD dataset [here](https://zenodo.org/records/4785016). The file structure looks like below: | |
```plaintext | |
[CSD dataset path] | |
β£ english | |
β£ korean | |
β£ utterances | |
β β£ en001a | |
β β β£ {UtterenceID}.wav | |
β β£ en001b | |
β β£ en002a | |
β β£ en002b | |
β β£ ... | |
β£ README | |
``` | |
## CustomSVCDataset | |
We support custom dataset for Singing Voice Conversion. Organize your data in the following structure to construct your own dataset: | |
```plaintext | |
[Your Custom Dataset Path] | |
β£ singer1 | |
β β£ song1 | |
β β β£ utterance1.wav | |
β β β£ utterance2.wav | |
β β β£ ... | |
β β£ song2 | |
β β£ ... | |
β£ singer2 | |
β£ ... | |
``` | |
## Hi-Fi TTS | |
Download the official Hi-Fi TTS dataset [here](https://www.openslr.org/109/). The file structure looks like below: | |
```plaintext | |
[Hi-Fi TTS dataset path] | |
β£ audio | |
β β£ 11614_other {Speaker_ID}_{SNR_subset} | |
β β β£ 10547 {Book_ID} | |
β β β β£ thousandnights8_04_anonymous_0001.flac | |
β β β β£ thousandnights8_04_anonymous_0003.flac | |
β β β β£ thousandnights8_04_anonymous_0004.flac | |
β β β β£ ... | |
β β β£ ... | |
β β£ ... | |
β£ 92_manifest_clean_dev.json | |
β£ 92_manifest_clean_test.json | |
β£ 92_manifest_clean_train.json | |
β£ ... | |
β£ {Speaker_ID}_manifest_{SNR_subset}_{dataset_split}.json | |
β£ ... | |
β£ books_bandwidth.tsv | |
β£ LICENSE.txt | |
β£ readers_books_clean.txt | |
β£ readers_books_other.txt | |
β£ README.txt | |
``` | |
## KiSing | |
Download the official KiSing dataset [here](http://shijt.site/index.php/2021/05/16/kising-the-first-open-source-mandarin-singing-voice-synthesis-corpus/). The file structure looks like below: | |
```plaintext | |
[KiSing dataset path] | |
β£ clean | |
β β£ 421 | |
β β£ 422 | |
β β£ ... | |
``` | |
## LibriLight | |
Download the official LibriLight dataset [here](https://github.com/facebookresearch/libri-light). The file structure looks like below: | |
```plaintext | |
[LibriTTS dataset path] | |
β£ small (Subset) | |
β β£ 100 {Speaker_ID} | |
β β β£ sea_fairies_0812_librivox_64kb_mp3 {Chapter_ID} | |
β β β β£ 01_baum_sea_fairies_64kb.flac | |
β β β β£ 02_baum_sea_fairies_64kb.flac | |
β β β β£ 03_baum_sea_fairies_64kb.flac | |
β β β β£ 22_baum_sea_fairies_64kb.flac | |
β β β β£ 01_baum_sea_fairies_64kb.json | |
β β β β£ 02_baum_sea_fairies_64kb.json | |
β β β β£ 03_baum_sea_fairies_64kb.json | |
β β β β£ 22_baum_sea_fairies_64kb.json | |
β β β β£ ... | |
β β β£ ... | |
β β£ ... | |
β£ medium (Subset) | |
β£ ... | |
``` | |
## LibriTTS | |
Download the official LibriTTS dataset [here](https://www.openslr.org/60/). The file structure looks like below: | |
```plaintext | |
[LibriTTS dataset path] | |
β£ BOOKS.txt | |
β£ CHAPTERS.txt | |
β£ eval_sentences10.tsv | |
β£ LICENSE.txt | |
β£ NOTE.txt | |
β£ reader_book.tsv | |
β£ README_librispeech.txt | |
β£ README_libritts.txt | |
β£ speakers.tsv | |
β£ SPEAKERS.txt | |
β£ dev-clean (Subset) | |
β β£ 1272{Speaker_ID} | |
β β β£ 128104 {Chapter_ID} | |
β β β β£ 1272_128104_000001_000000.normalized.txt | |
β β β β£ 1272_128104_000001_000000.original.txt | |
β β β β£ 1272_128104_000001_000000.wav | |
β β β β£ ... | |
β β β β£ 1272_128104.book.tsv | |
β β β β£ 1272_128104.trans.tsv | |
β β β£ ... | |
β β£ ... | |
β£ dev-other (Subset) | |
β β£ 116 (Speaker) | |
β β β£ 288045 {Chapter_ID} | |
β β β β£ 116_288045_000003_000000.normalized.txt | |
β β β β£ 116_288045_000003_000000.original.txt | |
β β β β£ 116_288045_000003_000000.wav | |
β β β β£ ... | |
β β β β£ 116_288045.book.tsv | |
β β β β£ 116_288045.trans.tsv | |
β β β£ ... | |
β β£ ... | |
β β£ ... | |
β£ test-clean (Subset) | |
β β£ {Speaker_ID} | |
β β β£ {Chapter_ID} | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav | |
β β β β£ ... | |
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv | |
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv | |
β β β£ ... | |
β β£ ... | |
β£ test-other | |
β β£ {Speaker_ID} | |
β β β£ {Chapter_ID} | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav | |
β β β β£ ... | |
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv | |
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv | |
β β β£ ... | |
β β£ ... | |
β£ train-clean-100 | |
β β£ {Speaker_ID} | |
β β β£ {Chapter_ID} | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav | |
β β β β£ ... | |
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv | |
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv | |
β β β£ ... | |
β β£ ... | |
β£ train-clean-360 | |
β β£ {Speaker_ID} | |
β β β£ {Chapter_ID} | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav | |
β β β β£ ... | |
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv | |
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv | |
β β β£ ... | |
β β£ ... | |
β£ train-other-500 | |
β β£ {Speaker_ID} | |
β β β£ {Chapter_ID} | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.normalized.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.original.txt | |
β β β β£ {Speaker_ID}_{Chapter_ID}_{Utterance_ID}.wav | |
β β β β£ ... | |
β β β β£ {Speaker_ID}_{Chapter_ID}.book.tsv | |
β β β β£ {Speaker_ID}_{Chapter_ID}.trans.tsv | |
β β β£ ... | |
β β£ ... | |
``` | |
## LJSpeech | |
Download the official LJSpeech dataset [here](https://keithito.com/LJ-Speech-Dataset/). The file structure looks like below: | |
```plaintext | |
[LJSpeech dataset path] | |
β£ metadata.csv | |
β£ wavs | |
β β£ LJ001-0001.wav | |
β β£ LJ001-0002.wav | |
β β£ ... | |
β£ README | |
``` | |
## M4Singer | |
Download the official M4Singer dataset [here](https://drive.google.com/file/d/1xC37E59EWRRFFLdG3aJkVqwtLDgtFNqW/view). The file structure looks like below: | |
```plaintext | |
[M4Singer dataset path] | |
β£ {Singer_1}#{Song_1} | |
β β£ 0000.mid | |
β β£ 0000.TextGrid | |
β β£ 0000.wav | |
β β£ ... | |
β£ {Singer_1}#{Song_2} | |
β£ ... | |
β£ {Singer_2}#{Song_1} | |
β£ {Singer_2}#{Song_2} | |
β£ ... | |
β meta.json | |
``` | |
## NUS-48E | |
Download the official NUS-48E dataset [here](https://drive.google.com/drive/folders/12pP9uUl0HTVANU3IPLnumTJiRjPtVUMx). The file structure looks like below: | |
```plaintext | |
[NUS-48E dataset path] | |
β£ {SpeakerID} | |
β β£ read | |
β β β£ {SongID}.txt | |
β β β£ {SongID}.wav | |
β β β£ ... | |
β β£ sing | |
β β β£ {SongID}.txt | |
β β β£ {SongID}.wav | |
β β β£ ... | |
β£ ... | |
β£ README.txt | |
``` | |
## Opencpop | |
Download the official Opencpop dataset [here](https://wenet.org.cn/opencpop/). The file structure looks like below: | |
```plaintext | |
[Opencpop dataset path] | |
β£ midis | |
β β£ 2001.midi | |
β β£ 2002.midi | |
β β£ 2003.midi | |
β β£ ... | |
β£ segments | |
β β£ wavs | |
β β β£ 2001000001.wav | |
β β β£ 2001000002.wav | |
β β β£ 2001000003.wav | |
β β β£ ... | |
β β£ test.txt | |
β β£ train.txt | |
β β transcriptions.txt | |
β£ textgrids | |
β β£ 2001.TextGrid | |
β β£ 2002.TextGrid | |
β β£ 2003.TextGrid | |
β β£ ... | |
β£ wavs | |
β β£ 2001.wav | |
β β£ 2002.wav | |
β β£ 2003.wav | |
β β£ ... | |
β£ TERMS_OF_ACCESS | |
β readme.md | |
``` | |
## OpenSinger | |
Download the official OpenSinger dataset [here](https://drive.google.com/file/d/1EofoZxvalgMjZqzUEuEdleHIZ6SHtNuK/view). The file structure looks like below: | |
```plaintext | |
[OpenSinger dataset path] | |
β£ ManRaw | |
β β£ {Singer_1}_{Song_1} | |
β β β£ {Singer_1}_{Song_1}_0.lab | |
β β β£ {Singer_1}_{Song_1}_0.txt | |
β β β£ {Singer_1}_{Song_1}_0.wav | |
β β β£ ... | |
β β£ {Singer_1}_{Song_2} | |
β β£ ... | |
β£ WomanRaw | |
β£ LICENSE | |
β README.md | |
``` | |
## Opera | |
Download the official Opera dataset [here](http://isophonics.net/SingingVoiceDataset). The file structure looks like below: | |
```plaintext | |
[Opera dataset path] | |
β£ monophonic | |
β β£ chinese | |
β β β£ {Gender}_{SingerID} | |
β β β β£ {Emotion}_{SongID}.wav | |
β β β β£ ... | |
β β β£ ... | |
β β£ western | |
β£ polyphonic | |
β β£ chinese | |
β β£ western | |
β£ CrossculturalDataSet.xlsx | |
``` | |
## PopBuTFy | |
Download the official PopBuTFy dataset [here](https://github.com/MoonInTheRiver/NeuralSVB). The file structure looks like below: | |
```plaintext | |
[PopBuTFy dataset path] | |
β£ data | |
β β£ {SingerID}#singing#{SongName}_Amateur | |
β β β£ {SingerID}#singing#{SongName}_Amateur_{UtteranceID}.mp3 | |
β β β£ ... | |
β β£ {SingerID}#singing#{SongName}_Professional | |
β β β£ {SingerID}#singing#{SongName}_Professional_{UtteranceID}.mp3 | |
β β β£ ... | |
β£ text_labels | |
β TERMS_OF_ACCESS | |
``` | |
## PopCS | |
Download the official PopCS dataset [here](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md). The file structure looks like below: | |
```plaintext | |
[PopCS dataset path] | |
β£ popcs | |
β β£ popcs-{SongName} | |
β β β£ {UtteranceID}_ph.txt | |
β β β£ {UtteranceID}_wf0.wav | |
β β β£ {UtteranceID}.TextGrid | |
β β β£ {UtteranceID}.txt | |
β β β£ ... | |
β β£ ... | |
β TERMS_OF_ACCESS | |
``` | |
## PJS | |
Download the official PJS dataset [here](https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus). The file structure looks like below: | |
```plaintext | |
[PJS dataset path] | |
β£ PJS_corpus_ver1.1 | |
β β£ background_noise | |
β β£ pjs{SongID} | |
β β β£ pjs{SongID}_song.wav | |
β β β£ pjs{SongID}_speech.wav | |
β β β£ pjs{SongID}.lab | |
β β β£ pjs{SongID}.mid | |
β β β£ pjs{SongID}.musicxml | |
β β β£ pjs{SongID}.txt | |
β β£ ... | |
``` | |
## SVCC | |
Download the official SVCC dataset [here](https://github.com/lesterphillip/SVCC23_FastSVC/tree/main/egs/generate_dataset). The file structure looks like below: | |
```plaintext | |
[SVCC dataset path] | |
β£ Data | |
β β£ CDF1 | |
β β β£ 10001.wav | |
β β β£ 10002.wav | |
β β β£ ... | |
β β£ CDM1 | |
β β£ IDF1 | |
β β£ IDM1 | |
β README.md | |
``` | |
## VCTK | |
Download the official VCTK dataset [here](https://datashare.ed.ac.uk/handle/10283/3443). The file structure looks like below: | |
```plaintext | |
[VCTK dataset path] | |
β£ txt | |
β β£ {Speaker_1} | |
β β β£ {Speaker_1}_001.txt | |
β β β£ {Speaker_1}_002.txt | |
β β β£ ... | |
β β£ {Speaker_2} | |
β β£ ... | |
β£ wav48_silence_trimmed | |
β β£ {Speaker_1} | |
β β β£ {Speaker_1}_001_mic1.flac | |
β β β£ {Speaker_1}_001_mic2.flac | |
β β β£ {Speaker_1}_002_mic1.flac | |
β β β£ ... | |
β β£ {Speaker_2} | |
β β£ ... | |
β£ speaker-info.txt | |
β update.txt | |
``` | |