|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- google/siglip-so400m-patch14-384 |
|
pipeline_tag: image-classification |
|
--- |
|
# Oryx-ViT |
|
|
|
## Model Summary |
|
|
|
The Oryx-ViT model is trained on 200M data and can seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths. |
|
|
|
- **Repository:** https://github.com/Oryx-mllm/Oryx |
|
- **Languages:** English, Chinese |
|
- **Paper:** https://arxiv.org/abs/2409.12961 |
|
|
|
|
|
### Model Architecture |
|
|
|
- **Architecture:** SigLip |
|
- **Data:** a mixture of 200M data, 2 epoch |
|
- **Precision:** BFloat16 |
|
|
|
#### Hardware & Software |
|
|
|
- **Hardware:** 64 * NVIDIA Tesla A100 |
|
- **Orchestration:** HuggingFace Trainer |
|
- **Code:** Pytorch |
|
|
|
## Citation |