Aekanun commited on
Commit
0ab0448
1 Parent(s): 7914fe2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -13
README.md CHANGED
@@ -6,32 +6,49 @@ tags:
6
  - handwriting-recognition
7
  - vision-language
8
  - fine-tuned
 
9
  datasets:
10
  - iapp/thai_handwriting_dataset
11
  language:
12
  - th
 
13
  ---
14
 
15
- # Thai Handwriting Recognition Model
16
 
17
- This is a LoRA-based fine-tuned model based on Llama-3.2-11B-Vision-Instruct for Thai handwriting recognition.
18
 
19
- ## Model Details
20
 
21
- - Fine-tuned on iapp/thai_handwriting_dataset
22
- - Uses LoRA (r=8, alpha=16) targeting q_proj and v_proj attention layers
23
- - Trained for 3 epochs with batch size 4 and gradient accumulation 8
24
- - Learning rate 2e-4 with constant schedule and 3% warmup
25
 
26
- ## Usage
27
 
28
  ```python
29
- from transformers import AutoModelForVision2Seq, AutoProcessor
30
  from PIL import Image
31
 
32
  model_path = "Aekanun/thai-handwriting-llm"
 
 
 
 
 
 
 
 
 
 
33
  processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
34
- model = AutoModelForVision2Seq.from_pretrained(model_path)
 
 
 
 
 
35
 
36
  # Prepare input
37
  image = Image.open("handwriting.jpg")
@@ -49,7 +66,17 @@ messages = [
49
  ]
50
 
51
  # Generate text
52
- inputs = processor(text=processor.apply_chat_template(messages, tokenize=False),
53
- images=image, return_tensors="pt")
54
- outputs = model.generate(**inputs, max_new_tokens=256)
 
 
 
 
 
 
 
 
 
 
55
  text = processor.decode(outputs[0], skip_special_tokens=True)
 
6
  - handwriting-recognition
7
  - vision-language
8
  - fine-tuned
9
+ - vision
10
  datasets:
11
  - iapp/thai_handwriting_dataset
12
  language:
13
  - th
14
+ pipeline_tag: image-to-text
15
  ---
16
 
17
+ # Thai Handwriting Recognition Vision-Language Model
18
 
19
+ A LoRA-adapted vision-language model based on Llama-3.2-11B-Vision-Instruct that transcribes Thai handwritten text from images.
20
 
21
+ ## Model Architecture
22
 
23
+ - Base: Llama-3.2-11B-Vision-Instruct
24
+ - Adaptation: LoRA (r=8, alpha=16)
25
+ - Target: q_proj, v_proj attention layers
26
+ - Training: 3 epochs, batch size 4, gradient accumulation 8, lr 2e-4
27
 
28
+ ## Inference
29
 
30
  ```python
31
+ from transformers import AutoModelForVision2Seq, AutoProcessor, BitsAndBytesConfig
32
  from PIL import Image
33
 
34
  model_path = "Aekanun/thai-handwriting-llm"
35
+
36
+ # BitsAndBytes config for efficient inference
37
+ bnb_config = BitsAndBytesConfig(
38
+ load_in_4bit=True,
39
+ bnb_4bit_use_double_quant=True,
40
+ bnb_4bit_quant_type="nf4",
41
+ bnb_4bit_compute_dtype=torch.bfloat16
42
+ )
43
+
44
+ # Load processor and model
45
  processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
46
+ model = AutoModelForVision2Seq.from_pretrained(
47
+ model_path,
48
+ device_map="auto",
49
+ torch_dtype=torch.bfloat16,
50
+ quantization_config=bnb_config
51
+ )
52
 
53
  # Prepare input
54
  image = Image.open("handwriting.jpg")
 
66
  ]
67
 
68
  # Generate text
69
+ inputs = processor(
70
+ text=processor.apply_chat_template(messages, tokenize=False),
71
+ images=image,
72
+ return_tensors="pt"
73
+ )
74
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
75
+
76
+ outputs = model.generate(
77
+ **inputs,
78
+ max_new_tokens=256,
79
+ do_sample=False,
80
+ pad_token_id=processor.tokenizer.pad_token_id
81
+ )
82
  text = processor.decode(outputs[0], skip_special_tokens=True)