Xkev
/

Llama-3.2V-11B-cot

@@ -4,7 +4,7 @@ language:
 - en
 base_model:
 - meta-llama/Llama-3.2-11B-Vision-Instruct
-pipeline_tag: visual-question-answering
 library_name: transformers
 ---
 # Model Card for Model ID
@@ -13,6 +13,8 @@ library_name: transformers
 Llama-3.2V-11B-cot is the first version of [LLaVA-o1](https://github.com/PKU-YuanGroup/LLaVA-o1), which is a visual language model capable of spontaneous, systematic reasoning.
 ## Model Details
 <!-- Provide a longer summary of what this model is. -->

 - en
 base_model:
 - meta-llama/Llama-3.2-11B-Vision-Instruct
+pipeline_tag: image-text-to-text
 library_name: transformers
 ---
 # Model Card for Model ID
 Llama-3.2V-11B-cot is the first version of [LLaVA-o1](https://github.com/PKU-YuanGroup/LLaVA-o1), which is a visual language model capable of spontaneous, systematic reasoning.
+The model was proposed in [LLaVA-o1: Let Vision Language Models Reason Step-by-Step](https://huggingface.co/papers/2411.10440).
 ## Model Details
 <!-- Provide a longer summary of what this model is. -->