metadata
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
library_name: peft
datasets:
- HuggingFaceM4/WebSight
Model Card for Llama-3.2-11B-Vision-WebSight
LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight.
Model Details
Model Description
- Developed by: pdufour
- Model type: Vision Language Model
- Language(s) (NLP): English
- License: MIT
- Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct
How to Get Started with the Model
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch
model = PeftModel.from_pretrained(
AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True),
"pdufour/Llama-3.2-11B-Vision-WebSight"
)
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Details
Training Data
Vision-language dataset used for instruction tuning.
Training Procedure
Training Hyperparameters
- Training regime: Fine-tuning with LoRA
- Learning rate: 0.0002
- Batch size: 10
- Gradient accumulation steps: 1
- Number of epochs: 3.0
- Optimizer: adamw_torch_fused
- LR scheduler type: constant
- Weight decay: 0.0
- FP16 Training: False
Speeds, Sizes, Times
- Training Duration: Unknown hours
- Number of Parameters: Unknown trainable parameters
- Model Size: 0.08 GB
Evaluation
Metrics
Results
- epoch: 0.9000
- grad_norm: 0.2568
- learning_rate: 0.0002
- loss: 0.0791
- step: 900.0000
Technical Specifications
Model Architecture and Objective
LoRA-tuned Vision-Language Model based on Llama architecture.
Compute Infrastructure
- Hardware Type: GPU
- Number of GPUs: 1
Software
- Framework versions:
- PEFT 0.13.2
- PyTorch 2.5.0+cu121
Model Card Contact
For questions about this model, please file an issue on the GitHub repository.