--- pipeline_tag: text-to-3d tags: - image-to-3d ---
# Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation [\[Code\]](https://github.com/tencent/Hunyuan3D-1) [\[Huggingface\]](https://huggingface.co/tencent/Hunyuan3D-1) [\[Report\]](https://3d.hunyuan.tencent.com/hunyuan3d.pdf) ## 🔥🔥🔥 News!! * Nov 5, 2024: 💬 We support demo running image_to_3d generation now. Please check the [script](#using-gradio) below. * Nov 5, 2024: 💬 We support demo running text_to_3d generation now. Please check the [script](#using-gradio) below. ## 📑 Open-source Plan - [x] Inference - [x] Checkpoints - [ ] Baking related - [ ] Training - [ ] ComfyUI - [ ] Distillation Version - [ ] TensorRT Version ## **Abstract**
While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has 3x more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets. ## 🎉 **Hunyuan3D-1 Architecture**
## 📈 Comparisons We have evaluated Hunyuan3D-1.0 with other open-source 3d-generation methods, our Hunyuan3D-1.0 received the highest user preference across 5 metrics. Details in the picture on the lower left. The lite model takes around 10 seconds to produce a 3D mesh from a single image on an NVIDIA A100 GPU, while the standard model takes roughly 25 seconds. The plot laid out in the lower right demonstrates that Hunyuan3D-1.0 achieves an optimal balance between quality and efficiency.
## Get Started #### Begin by cloning the repository: ```shell git clone https://github.com/tencent/Hunyuan3D-1 cd Hunyuan3D-1 ``` #### Installation Guide for Linux We provide an env_install.sh script file for setting up environment. We recommend python3.9 and CUDA11.7+ ``` conda create -n hunyuan3d-1 python=3.9 conda activate hunyuan3d-1 bash env_install.sh ``` #### Download Pretrained Models The models are available at [https://huggingface.co/spaces/tencent/Hunyuan3D-1](https://huggingface.co/spaces/tencent/Hunyuan3D-1): + `Hunyuan3D-1/lite`, lite model for multi-view generation. + `Hunyuan3D-1/std`, standard model for multi-view generation. + `Hunyuan3D-1/svrm`, sparse-view reconstruction model. To download the model, first install the huggingface-cli. (Detailed instructions are available [here](https://huggingface.co/docs/huggingface_hub/guides/cli).) ```shell python3 -m pip install "huggingface_hub[cli]" ``` Then download the model using the following commands: ```shell mkdir weights huggingface-cli download tencent/Hunyuan3D-1 --local-dir ./weights mkdir weights/hunyuanDiT huggingface-cli download Tencent-Hunyuan/HunyuanDiT-v1.1-Diffusers-Distilled --local-dir ./weights/hunyuanDiT ``` #### Inference For text to 3d generation, we supports bilingual Chinese and English, you can use the following command to inference. ```python python3 main.py \ --text_prompt "a lovely rabbit" \ --save_folder ./outputs/test/ \ --max_faces_num 90000 \ --do_texture_mapping \ --do_render ``` For image to 3d generation, you can use the following command to inference. ```python python3 main.py \ --image_prompt "/path/to/your/image" \ --save_folder ./outputs/test/ \ --max_faces_num 90000 \ --do_texture_mapping \ --do_render ``` We list some more useful configurations for easy usage: | Argument | Default | Description | |:------------------:|:---------:|:---------------------------------------------------:| |`--text_prompt` | None |The text prompt for 3D generation | |`--image_prompt` | None |The image prompt for 3D generation | |`--t2i_seed` | 0 |The random seed for generating images | |`--t2i_steps` | 25 |The number of steps for sampling of text to image | |`--gen_seed` | 0 |The random seed for generating 3d generation | |`--gen_steps` | 50 |The number of steps for sampling of 3d generation | |`--max_faces_numm` | 90000 |The limit number of faces of 3d mesh | |`--save_memory` | False |text2image will move to cpu automatically| |`--do_texture_mapping` | False |Change vertex shadding to texture shading | |`--do_render` | False |render gif | We have also prepared scripts with different configurations for reference ```bash bash scripts/text_to_3d_demo.sh bash scripts/text_to_3d_fast_demo.sh bash scripts/image_to_3d_demo.sh bash scripts/image_to_3d_fast_demo.sh ``` This example requires ~40GB VRAM to run. #### Using Gradio We have prepared two versions of multi-view generation, std and lite. For better results, the std version of the running script is as follows ```shell python3 app.py ``` For faster speed, you can use the lite version by adding the --use_lite parameter. ```shell python3 app.py --use_lite ``` Then the demo can be accessed through http://0.0.0.0:8080. It should be noted that the 0.0.0.0 here needs to be X.X.X.X with your server IP. ## Camera Parameters Output views are a fixed set of camera poses: + Azimuth (relative to input view): `+0, +60, +120, +180, +240, +300`.