yasserTII commited on
Commit
a49ddbc
β€’
1 Parent(s): 7069d40

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ inference: false
5
+ license: unknown
6
+ ---
7
+
8
+ # πŸš€ Falcon2-11B-vlm
9
+
10
+ **Falcon2-11B-vlm is an 11B parameters causal decoder-only model built by [TII](https://www.tii.ae) and trained on over 5,000B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. To bring vision capabilities, , we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model and train with image-text data.
11
+ For enhancing the VLM's perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs. The model is made available under the [TII Falcon License 2.0](https://falconllm-staging.tii.ae/falcon-2-terms-and-conditions.html), the permissive Apache 2.0-based software license which includes an [acceptable use policy](https://falconllm-staging.tii.ae/falcon-2-acceptable-use-policy.html) that promotes the responsible use of AI.**
12
+
13
+ *Paper coming soon 😊.*
14
+
15
+
16
+ πŸ€— To get started with Falcon-vlm (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost from HF](https://huggingface.co/blog/falcon)!
17
+
18
+ ⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.**
19
+
20
+ ```python
21
+ from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
22
+ from PIL import Image
23
+ import requests
24
+ import torch
25
+
26
+ processor = LlavaNextProcessor.from_pretrained("tiiuae/falcon-11B-vlm", tokenizer_class='PreTrainedTokenizerFast')
27
+ model = LlavaNextForConditionalGeneration.from_pretrained("tiiuae/falcon-11B-vlm", torch_dtype=torch.bfloat16)
28
+
29
+
30
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
31
+ cats_image = Image.open(requests.get(url, stream=True).raw)
32
+ instruction = 'Write a long paragraph about this picture.'
33
+
34
+ prompt = f"""User:<image>\n{instruction} Falcon:"""
35
+ inputs = processor(prompt, images=cats_image, return_tensors="pt", padding=True).to('cuda:0')
36
+
37
+ model.to('cuda:0')
38
+ output = model.generate(**inputs, max_new_tokens=256)
39
+
40
+
41
+ prompt_length = inputs['input_ids'].shape[1]
42
+ generated_captions = processor.decode(output[0], skip_special_tokens=True).strip()
43
+
44
+ print(generated_captions)
45
+
46
+ ```
47
+
48
+ πŸ’₯ **Falcon VLMs require PyTorch 2.0 for use with `transformers`!**
49
+
50
+ For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost]((https://huggingface.co/blog/falcon).
51
+
52
+ # Model Card for Falcon2-11B
53
+
54
+ ## Model Details
55
+
56
+ ### Model Description
57
+
58
+ - **Developed by:** [https://www.tii.ae](https://www.tii.ae)
59
+ - **Model type:** Causal decoder-only
60
+ - **Language(s) (NLP):** English.
61
+ - **License:** [TII Falcon License 2.0](https://falconllm-staging.tii.ae/falcon-2-terms-and-conditions.html)
62
+
63
+ ### Model Source
64
+
65
+ - **Paper:** *coming soon*.
66
+
67
+ ## Uses
68
+
69
+ ### Direct Use
70
+
71
+ Research on General large vison language models.
72
+
73
+ ### Out-of-Scope Use
74
+
75
+ Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
76
+
77
+ ## Bias, Risks, and Limitations
78
+
79
+ Falcon2-11B-vlm is trained mostly on English, but also German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
80
+
81
+ ## Training Details
82
+
83
+ The training is done in two stages: pretraining and finetuning. In both stages, the visual encoder weights are kept frozen. In the pretraining stage, the LLM is kept frozen and only the multimodal projector is trained on 558K image-caption pairs.
84
+ This enables the multimodal projector to learn a mapping from visual to text embedding space. During finetuning, both the projector and LLM weights are trained on a corpus of 1.2M image-text instruction data from public datasets, which also includes multi-round conversations.
85
+ Falcon2-11B- was trained on 16 A100 80GB GPUs with ZeRO and Flash-Attention 2.
86
+
87
+ The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
88
+
89
+
90
+ #### Training Hyperparameters
91
+
92
+ | **Hyperparameter** | **Value** |
93
+ |--------------------|------------|
94
+ | Precision | `bfloat16`|
95
+ | Optimizer | AdamW |
96
+ | Max learning rate | 2e-5 |
97
+ | Weight decay | 0 |
98
+ | Batch size | 256 |
99
+
100
+
101
+ ## Evaluation
102
+
103
+ | Model | MME | GQA | SQA | POPE | VQAv2 | TextVQA | MM-Bench | SEED-IMG |
104
+ |----|----|----|----|----|----|----|----|----|
105
+ | Falcon2-11B VLM | 1589/343 | 64.5 | 74.9 | 88.4 | 82.1 | 66.7 | 72.0 | 72.3 |
106
+
107
+ ## Citation
108
+
109
+ *Paper coming soon* 😊.
110
+
111
+ ## License
112
+
113
+ Falcon2-11B is licenced under [TII Falcon License 2.0](https://falconllm-staging.tii.ae/falcon-2-terms-and-conditions.html), the permissive Apache 2.0-based software license which includes an [acceptable use policy](https://falconllm-staging.tii.ae/falcon-2-acceptable-use-policy.html) that promotes the responsible use of AI.
114
+
115
+ ## Contact
116