EscherNet / README.md
kxhit
my req
195ed51
|
raw
history blame
6.32 kB
---
title: EscherNet
emoji: 📸📸➡️🖼️🖼️🖼️
app_file: app.py
sdk: gradio
sdk_version: 4.31.0
---
[comment]: <> (# EscherNet: A Generative Model for Scalable View Synthesis)
<!-- PROJECT LOGO -->
<p align="center">
<h1 align="center">EscherNet: A Generative Model for Scalable View Synthesis</h1>
<p align="center">
<a href="https://kxhit.github.io"><strong>Xin Kong</strong></a>
·
<a href="https://shikun.io"><strong>Shikun Liu</strong></a>
·
<a href="https://shawlyu.github.io/"><strong>Xiaoyang Lyu</strong></a>
·
<a href="https://marwan99.github.io/"><strong>Marwan Taher</strong></a>
·
<a href="https://xjqi.github.io/"><strong>Xiaojuan Qi</strong></a>
·
<a href="https://www.doc.ic.ac.uk/~ajd/"><strong>Andrew J. Davison</strong></a>
</p>
[comment]: <> ( <h2 align="center">PAPER</h2>)
<h3 align="center"><a href="https://arxiv.org/abs/2402.03908">Paper</a> | <a href="https://kxhit.github.io/EscherNet">Project Page</a></h3>
<div align="center"></div>
<p align="center">
<a href="">
<img src="./scripts/teaser.png" alt="Logo" width="80%">
</a>
</p>
<p align="center">
EscherNet is a <strong>multi-view conditioned</strong> diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with the <strong>camera positional encoding (CaPE)</strong>, allowing precise and continuous relative control of the camera transformation between an <strong>arbitrary number of reference and target views</strong>.
</p>
<br>
## Install
```
conda env create -f environment.yml -n eschernet
conda activate eschernet
```
## Demo
Run demo to generate randomly sampled 25 novel views from (1,2,3,5,10) reference views:
```commandline
bash eval_eschernet.sh
```
## Camera Positional Encoding (CaPE)
CaPE is applied in self/cross-attention for encoding camera pose info into transformers. The main modification is in `diffusers/models/attention_processor.py`.
To quickly check the implementation of CaPE (6DoF and 4DoF), run:
```
python CaPE.py
```
## Training
### Objaverse 1.0 Dataset
Download Zero123's Objaverse Rendering data:
```commandline
wget https://tri-ml-public.s3.amazonaws.com/datasets/views_release.tar.gz
```
Filter Zero-1-to-3 rendered views (empty images):
```commandline
cd scripts
python objaverse_filter.py --path /data/objaverse/views_release
```
### Launch training
Configure accelerator (8 A100 GPUs, bf16):
```commandline
accelerate config
```
Choose 4DoF or 6DoF CaPE (Camera Positional Encoding):
```commandline
cd 4DoF or 6DoF
```
Launch training:
```commandline
accelerate launch train_eschernet.py --train_data_dir /data/objectverse/views_release --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 --train_batch_size 256 --dataloader_num_workers 16 --mixed_precision bf16 --gradient_checkpointing --T_in 3 --T_out 3 --T_in_val 10 --output_dir logs_N3M3B256_SD1.5 --push_to_hub --hub_model_id ***** --hub_token hf_******************* --tracker_project_name eschernet
```
For monitoring training progress, we recommand [wandb](https://wandb.ai/site) for its simplicity and powerful features.
```commandline
wandb login
```
Offline mode:
```commandline
WANDB_MODE=offline python xxx.py
```
## Evaluation
We provide [raw results](https://huggingface.co/datasets/kxic/EscherNet-Results) and two checkpoints [4DoF](https://huggingface.co/kxic/eschernet-4dof) and [6DoF](https://huggingface.co/kxic/eschernet-6dof) for easier comparison.
### Datasets
##### [GSO Google Scanned Objects](https://app.gazebosim.org/GoogleResearch/fuel/collections/Scanned%20Objects%20by%20Google%20Research)
[GSO30](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main): We select 30 objects from GSO dataset and render 25 randomly sampled novel views for each object for both NVS and 3D reconstruction evaluation.
##### [RTMV](https://drive.google.com/drive/folders/1cUXxUp6g25WwzHnm_491zNJJ4T7R_fum)
We use the 10 scenes from `google_scanned.tar` under folder `40_scenes` for NVS evaluation.
##### [NeRF_Synthetic](https://drive.google.com/drive/folders/1JDdLGDruGNXWnM1eqY1FNL9PlStjaKWi)
We use the all 8 NeRF objects for 2D NVS evaluation.
##### [Franka16](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main)
We collected 16 real world object-centric recordings using a Franka Emika Panda robot arm with RealSense D435i Camera for real world NVS evaluation.
##### [Text2Img](https://huggingface.co/datasets/kxic/EscherNet-Dataset/tree/main)
We collected Text2Img generation results from internet, [Stable Diffusion XL](https://github.com/Stability-AI/generative-models) (1 view) and [MVDream](https://github.com/bytedance/MVDream) (4 views: front, right, back, left) for NVS evaluation.
### Novel View Synthesis (NVS)
To get 2D Novel View Synthesis (NVS) results, set `cape_type, checkpoint, data_type, data_dir` and run:
```commandline
bash ./eval_eschernet.sh
```
Evaluate 2D metrics (PSNR, SSIM, LPIPS):
```commandline
cd metrics
python eval_2D_NVS.py
```
### 3D Reconstruction
We firstly generate 36 novel views with `data_type=GSO3D` by:
```commandline
bash ./eval_eschernet.sh
```
Then we adopt [NeuS](https://github.com/Totoro97/NeuS) for 3D reconstruction:
```commandline
export CUDA_HOME=/usr/local/cuda-11.8
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
cd 3drecon
python run_NeuS.py
```
Evaluate 3D metrics (Chamfer Distance, IoU):
```commandline
cd metrics
python eval_3D_GSO.py
```
## Gradio Demo
TODO.
To build locally:
```commandline
python gradio_eschernet.py
```
## Acknowledgement
We have intensively borrow codes from the following repositories. Many thanks to the authors for sharing their codes.
- [Zero-1-to-3](https://github.com/cvlab-columbia/zero123)
- [SyncDreamer](https://github.com/liuyuan-pal/SyncDreamer)
- [MVDream](https://github.com/bytedance/MVDream)
- [NeuS](https://github.com/Totoro97/NeuS)
## Citation
If you find this work useful, a citation will be appreciated via:
```
@article{kong2024eschernet,
title={EscherNet: A Generative Model for Scalable View Synthesis},
author={Kong, Xin and Liu, Shikun and Lyu, Xiaoyang and Taher, Marwan and Qi, Xiaojuan and Davison, Andrew J},
journal={arXiv preprint arXiv:2402.03908},
year={2024}
}
```