Feature Extraction
Transformers
Safetensors
English
prismatic
remyx
custom_code
salma-remyx commited on
Commit
2796d96
1 Parent(s): ab7f77b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -1
README.md CHANGED
@@ -7,4 +7,75 @@ language:
7
  library_name: transformers
8
  ---
9
 
10
- ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/j3fmOGSkUQ7jfUIbqJu3e.gif)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  library_name: transformers
8
  ---
9
 
10
+ ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/j3fmOGSkUQ7jfUIbqJu3e.gif)
11
+
12
+ # Model Card for SpaceLLaVA
13
+
14
+ **SpaceLlama3.1-hf** uses [llama3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) as the llm backbone along with the fused DINOv2+SigLIP features of [prismatic-vlms](https://github.com/TRI-ML/prismatic-vlms).
15
+ Uses [OpenVLA](https://github.com/openvla/openvla#converting-prismatic-models-to-hugging-face) to convert the prismatic-vlm to a Huggingface model.
16
+
17
+
18
+ ## Model Details
19
+
20
+ Uses a full fine-tune on the [spacellava dataset](https://huggingface.co/datasets/remyxai/vqasynth_spacellava) designed with [VQASynth](https://github.com/remyxai/VQASynth/tree/main) to enhance spatial reasoning as in [SpatialVLM](https://spatial-vlm.github.io/).
21
+
22
+
23
+ ### Model Description
24
+
25
+ This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models.
26
+ With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.
27
+
28
+
29
+ - **Developed by:** remyx.ai
30
+ - **Model type:** MultiModal Model, Vision Language Model, Prismatic-vlms, Llama 3.1
31
+ - **Finetuned from model:** Llama 3.1
32
+
33
+ ### Model Sources
34
+ - **Dataset:** [SpaceLLaVA](https://huggingface.co/datasets/remyxai/vqasynth_spacellava)
35
+ - **Repository:** [VQASynth](https://github.com/remyxai/VQASynth/tree/main)
36
+ - **Paper:** [SpatialVLM](https://arxiv.org/abs/2401.12168)
37
+
38
+ ## Usage
39
+
40
+ Try the `run_inference.py` script to run a quick test:
41
+ ```bash
42
+ python run_inference.py --model_location remyxai/SpaceLlama3.1
43
+ --image_source "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg"
44
+ --user_prompt "What is the distance between the man in the red hat and the pallet of boxes?"
45
+
46
+ ```
47
+
48
+ ## Deploy
49
+ Under the `docker` directory, you'll find a dockerized Triton Server for this model. Run the following:
50
+
51
+ ```bash
52
+ docker build -f Dockerfile -t spacellava-server:latest
53
+ docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 24G spacellama3.1-server:latest
54
+ python3 client.py --image_path "https://remyx.ai/assets/spatialvlm/warehouse_rgb.jpg" \
55
+ --prompt "What is the distance between the man in the red hat and the pallet of boxes?"
56
+ ```
57
+
58
+ ## Citation
59
+ ```
60
+ @article{chen2024spatialvlm,
61
+ title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
62
+ author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
63
+ journal = {arXiv preprint arXiv:2401.12168},
64
+ year = {2024},
65
+ url = {https://arxiv.org/abs/2401.12168},
66
+ }
67
+
68
+ @inproceedings{karamcheti2024prismatic,
69
+ title = {Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models},
70
+ author = {Siddharth Karamcheti and Suraj Nair and Ashwin Balakrishna and Percy Liang and Thomas Kollar and Dorsa Sadigh},
71
+ booktitle = {International Conference on Machine Learning (ICML)},
72
+ year = {2024},
73
+ }
74
+
75
+ @article{kim24openvla,
76
+ title={OpenVLA: An Open-Source Vision-Language-Action Model},
77
+ author={{Moo Jin} Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn},
78
+ journal = {arXiv preprint arXiv:2406.09246},
79
+ year={2024}
80
+ }
81
+ ```