Model Card for WaLa-MVDream-DM6
This model is part of the Wavelet Latent Diffusion (WaLa) paper, capable of generating six-view depth maps from text descriptions to support text-to-3D generation.
Model Details
Model Description
WaLa-MVDream-DM6 is a fine-tuned version of the MVDream model, adapted to generate six-view depth maps from text inputs. This model serves as an intermediate step in the text-to-3D generation pipeline of WaLa, producing multi-view depth maps that are then used by the WaLa-DM6-1B model to generate 3D shapes.
- Developed by: Aditya Sanghi, Aliasghar Khani, Chinthala Pradyumna Reddy, Arianna Rampini, Derek Cheung, Kamal Rahimi Malekshan, Kanika Madan, Hooman Shayani
- Model type: Text-to-Depth Map Generative Model
- License: Autodesk Non-Commercial (3D Generative) v1.0
For more information please look at the Project Page and the paper.
Model Sources
Uses
Direct Use
This model is released by Autodesk and intended for academic and research purposes only for the theoretical exploration and demonstration of the WaLa 3D generative framework. It is designed to be used in conjunction with WaLa-DM6-1B for text-to-3D generation. Please see here for inferencing instructions.
Out-of-Scope Use
The model should not be used for:
- Commercial purposes
- Generation of inappropriate or offensive content
- Any usage not in compliance with the license, in particular, the "Acceptable Use" section.
Bias, Risks, and Limitations
Bias
- The model may inherit biases present in the text-image datasets used for pre-training and fine-tuning.
- The model's performance may vary depending on the complexity and specificity of the input text descriptions.
Risks and Limitations
- The quality of the generated multi-view depth maps may impact the subsequent 3D shape generation.
- The model may occasionally generate depth maps that do not accurately represent the input text or maintain consistency across views.
How to Get Started with the Model
Please refer to the instructions here
Training Details
Training Data
The model was fine-tuned using captions generated for the WaLa dataset. Captions were initially created using the Internvl 2.0 model and then augmented using LLaMA 3.1 to enhance diversity and richness.
Training Procedure
Preprocessing
Captions were generated for each 3D object in the dataset using four renderings and two distinct prompts. These captions were then augmented to increase diversity. For depth map generation, six views were used to ensure comprehensive coverage of the entire object.
Training Hyperparameters
- Training regime: Please refer to the paper.
Technical Specifications
Model Architecture and Objective
The model is based on the MVDream architecture, fine-tuned to generate six-view depth maps from text inputs. It is designed to work in tandem with the WaLa-DM6-1B model for text-to-3D generation. The model uses the Stable Diffusion framework, initialized with weights from MVDream, and is fine-tuned on depth map-text paired data.
Compute Infrastructure
Hardware
The model was trained on NVIDIA H100 GPUs.
Citation
@misc{sanghi2024waveletlatentdiffusionwala,
title={Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings},
author={Aditya Sanghi and Aliasghar Khani and Pradyumna Reddy and Arianna Rampini and Derek Cheung and Kamal Rahimi Malekshan and Kanika Madan and Hooman Shayani},
year={2024},
eprint={2411.08017},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.08017},
}