ved-genmo commited on
Commit
a6487ce
1 Parent(s): 4b5bd5c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -3
README.md CHANGED
@@ -1,3 +1,83 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - "en"
4
+ tags:
5
+ - video
6
+ license: "apache-2.0"
7
+ ---
8
+
9
+ # Mochi 1
10
+ A state of the art video generation model by [Genmo](https://genmo.ai).
11
+
12
+ https://github.com/user-attachments/assets/20800321-f1ed-4f35-a964-6612e7d7e86e
13
+
14
+ ## Overview
15
+
16
+ Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. This model dramatically closes the gap between closed and open video generation systems. We’re releasing the model under a permissive Apache 2.0 license. Try this model for free on [our playground](https://genmo.ai/play).
17
+
18
+ ## Installation
19
+
20
+ Clone the repository and install it in editable mode:
21
+
22
+ ```bash
23
+ git clone https://github.com/genmoai/models
24
+ cd models
25
+ pip install setuptools psutil
26
+ pip install -e . --no-build-isolation
27
+ ```
28
+
29
+ For a faster installation, use [uv](https://github.com/astral-sh/uv):
30
+
31
+ ```bash
32
+ git clone https://github.com/genmoai/models
33
+ cd models
34
+ pip install uv
35
+ uv venv .venv
36
+ source venv/bin/activate
37
+ uv pip install -e .
38
+ ```
39
+
40
+ ## Download Weights
41
+
42
+ Download the weights from [Hugging Face](https://huggingface.co/genmo/mochi-1-preview/tree/main) or via [a magnet link]().
43
+
44
+ ## Running
45
+
46
+ Start the gradio UI with
47
+
48
+ ```bash
49
+ python3 -m mochi_preview.gradio_ui --model_dir "<path_to_model_directory>"
50
+ ```
51
+
52
+ Or generate videos directly from the CLI with
53
+
54
+ ```bash
55
+ python3 -m mochi_preview.infer --prompt "A hand with delicate fingers picks up a bright yellow lemon from a wooden bowl filled with lemons and sprigs of mint against a peach-colored background. The hand gently tosses the lemon up and catches it, showcasing its smooth texture. A beige string bag sits beside the bowl, adding a rustic touch to the scene. Additional lemons, one halved, are scattered around the base of the bowl. The even lighting enhances the vibrant colors and creates a fresh, inviting atmosphere." --seed 1710977262 --cfg-scale 4.5 --model_dir "<path_to_model_directory>"
56
+ ```
57
+
58
+ Replace `<path_to_model_directory>` with the path to your model directory.
59
+
60
+ ## Model Architecture
61
+
62
+ Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture.
63
+
64
+ Alongside Mochi, we are open-sourcing our video VAE. Our VAE causally compresses videos to a 128x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
65
+
66
+ An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
67
+ Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
68
+
69
+ ## Safety
70
+ Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
71
+
72
+ ## Limitations
73
+ Under the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.
74
+
75
+
76
+ ## BibTeX
77
+ ```
78
+ @misc{genmo2024mochi,
79
+ title={Mochi},
80
+ author={Genmo Team},
81
+ year={2024}
82
+ }
83
+ ```