File size: 2,569 Bytes
5cdde69 0c27d6c 5cdde69 be9b30c 5cdde69 a38e0e9 5cdde69 fa31b6f 5cdde69 fa31b6f 5cdde69 fa31b6f 5cdde69 fa31b6f 5cdde69 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
library_name: peft
datasets:
- HuggingFaceM4/WebSight
---
# Model Card for Llama-3.2-11B-Vision-WebSight
LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight.
## Model Details
### Model Description
* **Developed by:** pdufour
* **Model type:** Vision Language Model
* **Language(s) (NLP):** English
* **License:** MIT
* **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct
## How to Get Started with the Model
```python
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch
model = PeftModel.from_pretrained(
AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True),
"pdufour/Llama-3.2-11B-Vision-WebSight"
)
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
### Training Data
Vision-language dataset used for instruction tuning.
### Training Procedure
#### Training Hyperparameters
* **Training regime:** Fine-tuning with LoRA
* **Learning rate:** 0.0002
* **Batch size:** 10
* **Gradient accumulation steps:** 1
* **Number of epochs:** 3.0
* **Optimizer:** adamw_torch_fused
* **LR scheduler type:** constant
* **Weight decay:** 0.0
* **FP16 Training:** False
### Speeds, Sizes, Times
* **Training Duration:** Unknown hours
* **Number of Parameters:** Unknown trainable parameters
* **Model Size:** 0.08 GB
## Evaluation
### Metrics
#### Results
* **epoch:** 0.9000
* **grad_norm:** 0.2568
* **learning_rate:** 0.0002
* **loss:** 0.0791
* **step:** 900.0000
## Technical Specifications
### Model Architecture and Objective
LoRA-tuned Vision-Language Model based on Llama architecture.
### Compute Infrastructure
* **Hardware Type:** GPU
* **Number of GPUs:** 1
### Software
* **Framework versions:**
* PEFT 0.13.2
* PyTorch 2.5.0+cu121
## Model Card Contact
For questions about this model, please file an issue on the GitHub repository. |