An end-to-end (e2e) Voice Language Model by Fish Audio.
Co-Speech Gesture Video Generation
Import a portrait, click to move the head!
a tiny vision language model