Yuning Wu commited on
Commit
86ff281
1 Parent(s): 9d99e8b

Update model

Browse files
Files changed (1) hide show
  1. README.md +486 -0
README.md CHANGED
@@ -1 +1,487 @@
 
 
 
 
 
 
 
 
 
 
1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - singing-voice-synthesis
6
+ language: zh
7
+ datasets:
8
+ - opencpop
9
+ license: cc-by-4.0
10
+ ---
11
 
12
+ ## ESPnet2 SVS model
13
+
14
+ ### `AQuarterMile/opencpop_visinger1`
15
+
16
+ This model was trained by Yuning Wu using opencpop recipe in [espnet](https://github.com/espnet/espnet/).
17
+
18
+ ### Demo: How to use in ESPnet2
19
+
20
+ Follow the [ESPnet installation instructions](https://espnet.github.io/espnet/installation.html)
21
+ if you haven't done that already.
22
+
23
+ ```bash
24
+ cd espnet
25
+ git checkout e2cf39700cfa056e993dc627e53d18fccc7f68b9
26
+ pip install -e .
27
+ cd egs2/opencpop/svs1
28
+ ./run.sh --skip_data_prep false --skip_train true --download_model AQuarterMile/opencpop_visinger1
29
+ ```
30
+
31
+
32
+
33
+ ## SVS config
34
+
35
+ <details><summary>expand</summary>
36
+
37
+ ```
38
+ config: conf/tuning/train_vits.yaml
39
+ print_config: false
40
+ log_level: INFO
41
+ dry_run: false
42
+ iterator_type: sequence
43
+ output_dir: exp/visinger1
44
+ ngpu: 1
45
+ seed: 777
46
+ num_workers: 4
47
+ num_att_plot: 3
48
+ dist_backend: nccl
49
+ dist_init_method: env://
50
+ dist_world_size: null
51
+ dist_rank: null
52
+ local_rank: 0
53
+ dist_master_addr: null
54
+ dist_master_port: null
55
+ dist_launcher: null
56
+ multiprocessing_distributed: false
57
+ unused_parameters: true
58
+ sharded_ddp: false
59
+ cudnn_enabled: true
60
+ cudnn_benchmark: false
61
+ cudnn_deterministic: false
62
+ collect_stats: false
63
+ write_collected_feats: false
64
+ max_epoch: 300
65
+ patience: null
66
+ val_scheduler_criterion:
67
+ - valid
68
+ - loss
69
+ early_stopping_criterion:
70
+ - valid
71
+ - loss
72
+ - min
73
+ best_model_criterion:
74
+ - - train
75
+ - total_count
76
+ - max
77
+ keep_nbest_models: 10
78
+ nbest_averaging_interval: 0
79
+ grad_clip: -1
80
+ grad_clip_type: 2.0
81
+ grad_noise: false
82
+ accum_grad: 1
83
+ no_forward_run: false
84
+ resume: true
85
+ train_dtype: float32
86
+ use_amp: false
87
+ log_interval: 50
88
+ use_matplotlib: true
89
+ use_tensorboard: true
90
+ create_graph_in_tensorboard: false
91
+ use_wandb: false
92
+ wandb_project: null
93
+ wandb_id: null
94
+ wandb_entity: null
95
+ wandb_name: null
96
+ wandb_model_log_interval: -1
97
+ detect_anomaly: false
98
+ pretrain_path: null
99
+ init_param: []
100
+ ignore_init_mismatch: false
101
+ freeze_param: []
102
+ num_iters_per_epoch: 1000
103
+ batch_size: 20
104
+ valid_batch_size: null
105
+ batch_bins: 500000
106
+ valid_batch_bins: null
107
+ train_shape_file:
108
+ - exp/svs_stats_raw_phn_None_zh/train/text_shape.phn
109
+ - exp/svs_stats_raw_phn_None_zh/train/singing_shape
110
+ valid_shape_file:
111
+ - exp/svs_stats_raw_phn_None_zh/valid/text_shape.phn
112
+ - exp/svs_stats_raw_phn_None_zh/valid/singing_shape
113
+ batch_type: numel
114
+ valid_batch_type: null
115
+ fold_length:
116
+ - 150
117
+ - 204800
118
+ sort_in_batch: descending
119
+ sort_batch: descending
120
+ multiple_iterator: false
121
+ chunk_length: 500
122
+ chunk_shift_ratio: 0.5
123
+ num_cache_chunks: 1024
124
+ chunk_excluded_key_prefixes: []
125
+ train_data_path_and_name_and_type:
126
+ - - dump/raw/tr_no_dev/text
127
+ - text
128
+ - text
129
+ - - dump/raw/tr_no_dev/wav.scp
130
+ - singing
131
+ - sound
132
+ - - dump/raw/tr_no_dev/label
133
+ - label
134
+ - duration
135
+ - - dump/raw/tr_no_dev/score.scp
136
+ - score
137
+ - score
138
+ - - exp/svs_stats_raw_phn_None_zh/train/collect_feats/pitch.scp
139
+ - pitch
140
+ - npy
141
+ - - exp/svs_stats_raw_phn_None_zh/train/collect_feats/feats.scp
142
+ - feats
143
+ - npy
144
+ valid_data_path_and_name_and_type:
145
+ - - dump/raw/dev/text
146
+ - text
147
+ - text
148
+ - - dump/raw/dev/wav.scp
149
+ - singing
150
+ - sound
151
+ - - dump/raw/dev/label
152
+ - label
153
+ - duration
154
+ - - dump/raw/dev/score.scp
155
+ - score
156
+ - score
157
+ - - exp/svs_stats_raw_phn_None_zh/valid/collect_feats/pitch.scp
158
+ - pitch
159
+ - npy
160
+ - - exp/svs_stats_raw_phn_None_zh/valid/collect_feats/feats.scp
161
+ - feats
162
+ - npy
163
+ allow_variable_data_keys: false
164
+ max_cache_size: 0.0
165
+ max_cache_fd: 32
166
+ valid_max_cache_size: null
167
+ exclude_weight_decay: false
168
+ exclude_weight_decay_conf: {}
169
+ optim: adamw
170
+ optim_conf:
171
+ lr: 0.0002
172
+ betas:
173
+ - 0.8
174
+ - 0.99
175
+ eps: 1.0e-09
176
+ weight_decay: 0.0
177
+ scheduler: exponentiallr
178
+ scheduler_conf:
179
+ gamma: 0.999875
180
+ optim2: adamw
181
+ optim2_conf:
182
+ lr: 0.0002
183
+ betas:
184
+ - 0.8
185
+ - 0.99
186
+ eps: 1.0e-09
187
+ weight_decay: 0.0
188
+ scheduler2: exponentiallr
189
+ scheduler2_conf:
190
+ gamma: 0.999875
191
+ generator_first: false
192
+ token_list:
193
+ - <blank>
194
+ - <unk>
195
+ - SP
196
+ - i
197
+ - AP
198
+ - e
199
+ - y
200
+ - d
201
+ - w
202
+ - sh
203
+ - ai
204
+ - n
205
+ - x
206
+ - j
207
+ - ian
208
+ - u
209
+ - l
210
+ - h
211
+ - b
212
+ - o
213
+ - zh
214
+ - an
215
+ - ou
216
+ - m
217
+ - q
218
+ - z
219
+ - en
220
+ - g
221
+ - ing
222
+ - ei
223
+ - ao
224
+ - ang
225
+ - uo
226
+ - eng
227
+ - t
228
+ - a
229
+ - ong
230
+ - ui
231
+ - k
232
+ - f
233
+ - r
234
+ - iang
235
+ - ch
236
+ - v
237
+ - in
238
+ - iao
239
+ - ie
240
+ - iu
241
+ - c
242
+ - s
243
+ - van
244
+ - p
245
+ - ve
246
+ - uan
247
+ - uang
248
+ - ia
249
+ - ua
250
+ - uai
251
+ - un
252
+ - er
253
+ - vn
254
+ - iong
255
+ - <sos/eos>
256
+ odim: null
257
+ model_conf: {}
258
+ use_preprocessor: true
259
+ token_type: phn
260
+ bpemodel: null
261
+ non_linguistic_symbols: null
262
+ cleaner: null
263
+ g2p: null
264
+ fs: 22050
265
+ score_feats_extract: syllable_score_feats
266
+ score_feats_extract_conf:
267
+ fs: 22050
268
+ n_fft: 1024
269
+ win_length: null
270
+ hop_length: 256
271
+ feats_extract: linear_spectrogram
272
+ feats_extract_conf:
273
+ n_fft: 1024
274
+ hop_length: 256
275
+ win_length: null
276
+ normalize: null
277
+ normalize_conf: {}
278
+ svs: vits
279
+ svs_conf:
280
+ generator_type: vits_generator
281
+ generator_params:
282
+ hidden_channels: 192
283
+ spks: -1
284
+ global_channels: -1
285
+ segment_size: 32
286
+ text_encoder_attention_heads: 2
287
+ text_encoder_ffn_expand: 4
288
+ text_encoder_blocks: 6
289
+ text_encoder_positionwise_layer_type: conv1d
290
+ text_encoder_positionwise_conv_kernel_size: 3
291
+ text_encoder_positional_encoding_layer_type: rel_pos
292
+ text_encoder_self_attention_layer_type: rel_selfattn
293
+ text_encoder_activation_type: swish
294
+ text_encoder_normalize_before: true
295
+ text_encoder_dropout_rate: 0.1
296
+ text_encoder_positional_dropout_rate: 0.0
297
+ text_encoder_attention_dropout_rate: 0.1
298
+ use_macaron_style_in_text_encoder: true
299
+ use_conformer_conv_in_text_encoder: false
300
+ text_encoder_conformer_kernel_size: -1
301
+ decoder_kernel_size: 7
302
+ decoder_channels: 512
303
+ decoder_upsample_scales:
304
+ - 8
305
+ - 8
306
+ - 2
307
+ - 2
308
+ decoder_upsample_kernel_sizes:
309
+ - 16
310
+ - 16
311
+ - 4
312
+ - 4
313
+ decoder_resblock_kernel_sizes:
314
+ - 3
315
+ - 7
316
+ - 11
317
+ decoder_resblock_dilations:
318
+ - - 1
319
+ - 3
320
+ - 5
321
+ - - 1
322
+ - 3
323
+ - 5
324
+ - - 1
325
+ - 3
326
+ - 5
327
+ use_weight_norm_in_decoder: true
328
+ posterior_encoder_kernel_size: 5
329
+ posterior_encoder_layers: 16
330
+ posterior_encoder_stacks: 1
331
+ posterior_encoder_base_dilation: 1
332
+ posterior_encoder_dropout_rate: 0.0
333
+ use_weight_norm_in_posterior_encoder: true
334
+ flow_flows: 4
335
+ flow_kernel_size: 5
336
+ flow_base_dilation: 1
337
+ flow_layers: 4
338
+ flow_dropout_rate: 0.0
339
+ use_weight_norm_in_flow: true
340
+ use_only_mean_in_flow: true
341
+ vocabs: 63
342
+ aux_channels: 513
343
+ use_visinger: true
344
+ use_dp: true
345
+ discriminator_type: hifigan_multi_scale_multi_period_discriminator
346
+ discriminator_params:
347
+ scales: 1
348
+ scale_downsample_pooling: AvgPool1d
349
+ scale_downsample_pooling_params:
350
+ kernel_size: 4
351
+ stride: 2
352
+ padding: 2
353
+ scale_discriminator_params:
354
+ in_channels: 1
355
+ out_channels: 1
356
+ kernel_sizes:
357
+ - 15
358
+ - 41
359
+ - 5
360
+ - 3
361
+ channels: 128
362
+ max_downsample_channels: 1024
363
+ max_groups: 16
364
+ bias: true
365
+ downsample_scales:
366
+ - 2
367
+ - 2
368
+ - 4
369
+ - 4
370
+ - 1
371
+ nonlinear_activation: LeakyReLU
372
+ nonlinear_activation_params:
373
+ negative_slope: 0.1
374
+ use_weight_norm: true
375
+ use_spectral_norm: false
376
+ follow_official_norm: false
377
+ periods:
378
+ - 2
379
+ - 3
380
+ - 5
381
+ - 7
382
+ - 11
383
+ period_discriminator_params:
384
+ in_channels: 1
385
+ out_channels: 1
386
+ kernel_sizes:
387
+ - 5
388
+ - 3
389
+ channels: 32
390
+ downsample_scales:
391
+ - 3
392
+ - 3
393
+ - 3
394
+ - 3
395
+ - 1
396
+ max_downsample_channels: 1024
397
+ bias: true
398
+ nonlinear_activation: LeakyReLU
399
+ nonlinear_activation_params:
400
+ negative_slope: 0.1
401
+ use_weight_norm: true
402
+ use_spectral_norm: false
403
+ generator_adv_loss_params:
404
+ average_by_discriminators: false
405
+ loss_type: mse
406
+ discriminator_adv_loss_params:
407
+ average_by_discriminators: false
408
+ loss_type: mse
409
+ feat_match_loss_params:
410
+ average_by_discriminators: false
411
+ average_by_layers: false
412
+ include_final_outputs: true
413
+ mel_loss_params:
414
+ fs: 22050
415
+ n_fft: 1024
416
+ hop_length: 256
417
+ win_length: null
418
+ window: hann
419
+ n_mels: 80
420
+ fmin: 0
421
+ fmax: null
422
+ log_base: null
423
+ lambda_adv: 1.0
424
+ lambda_mel: 45.0
425
+ lambda_feat_match: 2.0
426
+ lambda_dur: 0.1
427
+ lambda_pitch: 1.0
428
+ lambda_phoneme: 1.0
429
+ lambda_kl: 1.0
430
+ sampling_rate: 22050
431
+ cache_generator_outputs: true
432
+ pitch_extract: dio
433
+ pitch_extract_conf:
434
+ use_token_averaged_f0: false
435
+ fs: 22050
436
+ n_fft: 1024
437
+ hop_length: 256
438
+ f0max: 400
439
+ f0min: 80
440
+ pitch_normalize: global_mvn
441
+ pitch_normalize_conf:
442
+ stats_file: exp/svs_stats_raw_phn_None_zh/train/pitch_stats.npz
443
+ energy_extract: null
444
+ energy_extract_conf: {}
445
+ energy_normalize: null
446
+ energy_normalize_conf: {}
447
+ required:
448
+ - output_dir
449
+ - token_list
450
+ version: '202301'
451
+ distributed: false
452
+ ```
453
+
454
+ </details>
455
+
456
+
457
+
458
+ ### Citing ESPnet
459
+
460
+ ```BibTex
461
+ @inproceedings{watanabe2018espnet,
462
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
463
+ title={{ESPnet}: End-to-End Speech Processing Toolkit},
464
+ year={2018},
465
+ booktitle={Proceedings of Interspeech},
466
+ pages={2207--2211},
467
+ doi={10.21437/Interspeech.2018-1456},
468
+ url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
469
+ }
470
+
471
+
472
+
473
+
474
+ ```
475
+
476
+ or arXiv:
477
+
478
+ ```bibtex
479
+ @misc{watanabe2018espnet,
480
+ title={ESPnet: End-to-End Speech Processing Toolkit},
481
+ author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
482
+ year={2018},
483
+ eprint={1804.00015},
484
+ archivePrefix={arXiv},
485
+ primaryClass={cs.CL}
486
+ }
487
+ ```