Create README.md
#1
by
anish12
- opened
README.md
ADDED
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Gemma 2 model card
|
2 |
+
|
3 |
+
- DatafoundryAI/OpenVino_Ratnamuladve.Q8_K_M-GGUF
|
4 |
+
|
5 |
+
# Authors
|
6 |
+
|
7 |
+
- DatafoundryAI
|
8 |
+
# Model Information
|
9 |
+
|
10 |
+
- Summary description and brief definition of inputs and outputs.
|
11 |
+
|
12 |
+
## Description
|
13 |
+
|
14 |
+
- Gemma-2-2b-it builds on the technological advancements of the Gemini models, offering high-quality language generation capabilities. We have enhanced this model by applying INT8 quantization using the Intel OpenVINO Toolkit. This process optimizes the model for deployment in resource-constrained environments.
|
15 |
+
|
16 |
+
# About Openvino
|
17 |
+
|
18 |
+
## Model Conversion and Quantization with Intel OpenVINO::
|
19 |
+
|
20 |
+
- ### Model Optimizer:
|
21 |
+
|
22 |
+
- OpenVINO includes a tool called the Model Optimizer that converts pre-trained models from popular frameworks (such as TensorFlow, PyTorch, and ONNX) into an intermediate representation (IR). This IR consists of two files: a .xml file describing the model's structure and a .bin file containing the weights.
|
23 |
+
- ### Quantization:
|
24 |
+
|
25 |
+
- During the conversion process, you can apply quantization techniques to reduce model size and improve inference speed. OpenVINO supports INT8 quantization, which reduces floating-point precision to 8-bit integers.
|
26 |
+
|
27 |
+
# Benefits:
|
28 |
+
- INT8 quantization improves computational efficiency and reduces memory footprint, making the model more suitable for deployment on devices with limited hardware resources. The OpenVINO toolkit facilitates this process by providing tools and optimizations that ensure the model's performance remains high while being more resource-efficient.
|
29 |
+
|
30 |
+
### Resources and Technical Documentation
|
31 |
+
- [Responsible Generative AI Toolkit for Qunatisation](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
|
32 |
+
|
33 |
+
|
34 |
+
#### Installing the Transformers Library and Inference
|
35 |
+
|
36 |
+
To use this model, first install the Transformers library:
|
37 |
+
```bash
|
38 |
+
!pip install transformers openvino openvino-dev
|
39 |
+
`````
|
40 |
+
# Inference code
|
41 |
+
|
42 |
+
Below is an example of how to perform inference with the quantized Gemma-2-2b-it model:
|
43 |
+
|
44 |
+
```bash
|
45 |
+
import openvino_genai
|
46 |
+
def streamer(subword):
|
47 |
+
print(subword, end='', flush=True)
|
48 |
+
return False
|
49 |
+
`````
|
50 |
+
```bash
|
51 |
+
model_dir = "your path "
|
52 |
+
`````
|
53 |
+
```bash
|
54 |
+
device = 'CPU' # GPU can be used as well
|
55 |
+
pipe = openvino_genai.LLMPipeline(model_dir, device)
|
56 |
+
`````
|
57 |
+
```bash
|
58 |
+
import time
|
59 |
+
config = openvino_genai.GenerationConfig()
|
60 |
+
config.max_new_tokens = 100
|
61 |
+
pipe.start_chat()
|
62 |
+
|
63 |
+
total_tokens = 0
|
64 |
+
total_time = 0
|
65 |
+
|
66 |
+
while True:
|
67 |
+
prompt = input('question:\n')
|
68 |
+
if 'Stop!' == prompt:
|
69 |
+
break
|
70 |
+
|
71 |
+
start_time = time.time()
|
72 |
+
output = pipe.generate(prompt, config, streamer)
|
73 |
+
end_time = time.time()
|
74 |
+
|
75 |
+
elapsed_time = end_time - start_time
|
76 |
+
num_tokens = len(output.split()) # Adjust this based on how tokens are represented
|
77 |
+
|
78 |
+
total_tokens += num_tokens
|
79 |
+
total_time += elapsed_time
|
80 |
+
|
81 |
+
print(f'Generated tokens: {num_tokens}')
|
82 |
+
print(f'Time taken: {elapsed_time:.2f} seconds')
|
83 |
+
print('\n----------')
|
84 |
+
|
85 |
+
pipe.finish_chat()
|
86 |
+
|
87 |
+
if total_time > 0:
|
88 |
+
tok_per_s = total_tokens / total_time
|
89 |
+
print(f'Tokens per second: {tok_per_s:.2f}')
|
90 |
+
else:
|
91 |
+
print('No tokens generated.')
|
92 |
+
'''''
|