Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
merveย 
posted an update Jun 21
Post
3524
EPFL and Apple (at @EPFL-VILAB ) just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! ๐Ÿ™€
4M is a multimodal training framework introduced by Apple and EPFL.
Resulting model takes image and text and output image and text ๐Ÿคฉ

Models: EPFL-VILAB/4m-models-660193abe3faf4b4d98a2742
Demo: EPFL-VILAB/4M
Paper: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2406.09406)

This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:

input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!

This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation ๐Ÿ–ผ๏ธ

The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well โ˜บ๏ธ
In this post