Spaces:
Running
on
Zero
Running
on
Zero
title: EscherNet | |
emoji: 📸📸➡️🖼️🖼️🖼️ | |
app_file: app.py | |
sdk: gradio | |
sdk_version: 4.31.0 | |
[comment]: <> (# EscherNet: A Generative Model for Scalable View Synthesis) | |
<!-- PROJECT LOGO --> | |
<p align="center"> | |
<h1 align="center">EscherNet: A Generative Model for Scalable View Synthesis</h1> | |
<p align="center"> | |
<a href="https://kxhit.github.io"><strong>Xin Kong</strong></a> | |
· | |
<a href="https://shikun.io"><strong>Shikun Liu</strong></a> | |
· | |
<a href="https://shawlyu.github.io/"><strong>Xiaoyang Lyu</strong></a> | |
· | |
<a href="https://marwan99.github.io/"><strong>Marwan Taher</strong></a> | |
· | |
<a href="https://xjqi.github.io/"><strong>Xiaojuan Qi</strong></a> | |
· | |
<a href="https://www.doc.ic.ac.uk/~ajd/"><strong>Andrew J. Davison</strong></a> | |
</p> | |
[comment]: <> ( <h2 align="center">PAPER</h2>) | |
<h3 align="center"><a href="https://arxiv.org/abs/2402.03908">Paper</a> | <a href="https://kxhit.github.io/EscherNet">Project Page</a></h3> | |
<div align="center"></div> | |
<p align="center"> | |
<a href=""> | |
<img src="./scripts/teaser.png" alt="Logo" width="80%"> | |
</a> | |
</p> | |
<p align="center"> | |
EscherNet is a <strong>multi-view conditioned</strong> diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with the <strong>camera positional encoding (CaPE)</strong>, allowing precise and continuous relative control of the camera transformation between an <strong>arbitrary number of reference and target views</strong>. | |
</p> | |
<br> | |
## Install | |
``` | |
conda env create -f environment.yml -n eschernet | |
conda activate eschernet | |
``` | |
## Demo | |
Run demo to generate randomly sampled 25 novel views from (1,2,3,5,10) reference views: | |
```commandline | |
bash eval_eschernet.sh | |
``` | |
## Camera Positional Encoding (CaPE) | |
CaPE is applied in self/cross-attention for encoding camera pose info into transformers. The main modification is in `diffusers/models/attention_processor.py`. | |
To quickly check the implementation of CaPE (6DoF and 4DoF), run: | |
``` | |
python CaPE.py | |
``` | |
## Training | |
### Objaverse 1.0 Dataset | |
Download Zero123's Objaverse Rendering data: | |
```commandline | |
wget https://tri-ml-public.s3.amazonaws.com/datasets/views_release.tar.gz | |
``` | |
Filter Zero-1-to-3 rendered views (empty images): | |
```commandline | |
cd scripts | |
python objaverse_filter.py --path /data/objaverse/views_release | |
``` | |
### Launch training | |
Configure accelerator (8 A100 GPUs, bf16): | |
```commandline | |
accelerate config | |
``` | |
Choose 4DoF or 6DoF CaPE (Camera Positional Encoding): | |
```commandline | |
cd 4DoF or 6DoF | |
``` | |
Launch training: | |
```commandline | |
accelerate launch train_eschernet.py --train_data_dir /data/objectverse/views_release --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 --train_batch_size 256 --dataloader_num_workers 16 --mixed_precision bf16 --gradient_checkpointing --T_in 3 --T_out 3 --T_in_val 10 --output_dir logs_N3M3B256_SD1.5 --push_to_hub --hub_model_id ***** --hub_token hf_******************* --tracker_project_name eschernet | |
``` | |
For monitoring training progress, we recommand [wandb](https://wandb.ai/site) for its simplicity and powerful features. | |
```commandline | |
wandb login | |
``` | |
Offline mode: | |
```commandline | |
WANDB_MODE=offline python xxx.py | |
``` | |
## Evaluation | |
We provide [raw results](https://huggingface.co/datasets/kxic/EscherNet-Results) and two checkpoints [4DoF](https://huggingface.co/kxic/eschernet-4dof) and [6DoF](https://huggingface.co/kxic/eschernet-6dof) for easier comparison. | |
### Datasets | |
##### [GSO Google Scanned Objects](https://app.gazebosim.org/GoogleResearch/fuel/collections/Scanned%20Objects%20by%20Google%20Research) | |
[GSO30](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main): We select 30 objects from GSO dataset and render 25 randomly sampled novel views for each object for both NVS and 3D reconstruction evaluation. | |
##### [RTMV](https://drive.google.com/drive/folders/1cUXxUp6g25WwzHnm_491zNJJ4T7R_fum) | |
We use the 10 scenes from `google_scanned.tar` under folder `40_scenes` for NVS evaluation. | |
##### [NeRF_Synthetic](https://drive.google.com/drive/folders/1JDdLGDruGNXWnM1eqY1FNL9PlStjaKWi) | |
We use the all 8 NeRF objects for 2D NVS evaluation. | |
##### [Franka16](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main) | |
We collected 16 real world object-centric recordings using a Franka Emika Panda robot arm with RealSense D435i Camera for real world NVS evaluation. | |
##### [Text2Img](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main) | |
We collected Text2Img generation results from internet, [Stable Diffusion XL](https://github.com/Stability-AI/generative-models) (1 view) and [MVDream](https://github.com/bytedance/MVDream) (4 views: front, right, back, left) for NVS evaluation. | |
### Novel View Synthesis (NVS) | |
To get 2D Novel View Synthesis (NVS) results, set `cape_type, checkpoint, data_type, data_dir` and run: | |
```commandline | |
bash ./eval_eschernet.sh | |
``` | |
Evaluate 2D metrics (PSNR, SSIM, LPIPS): | |
```commandline | |
cd metrics | |
python eval_2D_NVS.py | |
``` | |
### 3D Reconstruction | |
We firstly generate 36 novel views with `data_type=GSO3D` by: | |
```commandline | |
bash ./eval_eschernet.sh | |
``` | |
Then we adopt [NeuS](https://github.com/Totoro97/NeuS) for 3D reconstruction: | |
```commandline | |
export CUDA_HOME=/usr/local/cuda-11.8 | |
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch | |
cd 3drecon | |
python run_NeuS.py | |
``` | |
Evaluate 3D metrics (Chamfer Distance, IoU): | |
```commandline | |
cd metrics | |
python eval_3D_GSO.py | |
``` | |
## Gradio Demo | |
TODO. | |
To build locally: | |
```commandline | |
python gradio_eschernet.py | |
``` | |
## Acknowledgement | |
We have intensively borrow codes from the following repositories. Many thanks to the authors for sharing their codes. | |
- [Zero-1-to-3](https://github.com/cvlab-columbia/zero123) | |
- [SyncDreamer](https://github.com/liuyuan-pal/SyncDreamer) | |
- [MVDream](https://github.com/bytedance/MVDream) | |
- [NeuS](https://github.com/Totoro97/NeuS) | |
## Citation | |
If you find this work useful, a citation will be appreciated via: | |
``` | |
@article{kong2024eschernet, | |
title={EscherNet: A Generative Model for Scalable View Synthesis}, | |
author={Kong, Xin and Liu, Shikun and Lyu, Xiaoyang and Taher, Marwan and Qi, Xiaojuan and Davison, Andrew J}, | |
journal={arXiv preprint arXiv:2402.03908}, | |
year={2024} | |
} | |
``` |