File size: 2,569 Bytes
5cdde69
 
 
0c27d6c
 
5cdde69
 
 
 
be9b30c
 
5cdde69
 
 
 
 
 
a38e0e9
5cdde69
 
 
 
fa31b6f
 
 
5cdde69
 
fa31b6f
 
 
 
 
 
5cdde69
fa31b6f
5cdde69
fa31b6f
 
5cdde69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
library_name: peft
datasets:
- HuggingFaceM4/WebSight
---

# Model Card for Llama-3.2-11B-Vision-WebSight

LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight.

## Model Details

### Model Description
* **Developed by:** pdufour
* **Model type:** Vision Language Model
* **Language(s) (NLP):** English
* **License:** MIT
* **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct

## How to Get Started with the Model
```python
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch

model = PeftModel.from_pretrained(
    AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True),
    "pdufour/Llama-3.2-11B-Vision-WebSight"
)
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

### Training Data
Vision-language dataset used for instruction tuning.

### Training Procedure

#### Training Hyperparameters
* **Training regime:** Fine-tuning with LoRA
* **Learning rate:** 0.0002
* **Batch size:** 10
* **Gradient accumulation steps:** 1
* **Number of epochs:** 3.0
* **Optimizer:** adamw_torch_fused
* **LR scheduler type:** constant
* **Weight decay:** 0.0
* **FP16 Training:** False

### Speeds, Sizes, Times
* **Training Duration:** Unknown hours
* **Number of Parameters:** Unknown trainable parameters
* **Model Size:** 0.08 GB

## Evaluation

### Metrics

#### Results
* **epoch:** 0.9000
* **grad_norm:** 0.2568
* **learning_rate:** 0.0002
* **loss:** 0.0791
* **step:** 900.0000

## Technical Specifications

### Model Architecture and Objective
LoRA-tuned Vision-Language Model based on Llama architecture.

### Compute Infrastructure
* **Hardware Type:** GPU
* **Number of GPUs:** 1

### Software
* **Framework versions:**
  * PEFT 0.13.2
  * PyTorch 2.5.0+cu121

## Model Card Contact
For questions about this model, please file an issue on the GitHub repository.