F5-TTS / model /backbones /README.md
mrfakename's picture
Upload folder using huggingface_hub
dd217c7 verified
|
raw
history blame
701 Bytes

Backbones quick introduction

unett.py

  • flat unet transformer
  • structure same as in e2-tts & voicebox paper except using rotary pos emb
  • update: allow possible abs pos emb & convnextv2 blocks for embedded text before concat

dit.py

  • adaln-zero dit
  • embedded timestep as condition
  • concatted noised_input + masked_cond + embedded_text, linear proj in
  • possible abs pos emb & convnextv2 blocks for embedded text before concat
  • possible long skip connection (first layer to last layer)

mmdit.py

  • sd3 structure
  • timestep as condition
  • left stream: text embedded and applied a abs pos emb
  • right stream: masked_cond & noised_input concatted and with same conv pos emb as unett