SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
Paper
•
2110.07205
•
Published
•
5
The SpeechT5 framework consists of a shared seq2seq and six modal-specific (speech/text) pre/post-nets that can address a few audio-related tasks.
Note Text-to-speech version of SpeechT5
Note Voice-conversion version of SpeechT5
Note Automatic-speech-recognition version of SpeechT5
Note SpeechT5 produces a spectrogram, this model converts it to a waveform