Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Gemma 2 model card
2
+
3
+ - DatafoundryAI/OpenVino_Ratnamuladve.Q8_K_M-GGUF
4
+
5
+ # Authors
6
+
7
+ - DatafoundryAI
8
+ # Model Information
9
+
10
+ - Summary description and brief definition of inputs and outputs.
11
+
12
+ ## Description
13
+
14
+ - Gemma-2-2b-it builds on the technological advancements of the Gemini models, offering high-quality language generation capabilities. We have enhanced this model by applying INT8 quantization using the Intel OpenVINO Toolkit. This process optimizes the model for deployment in resource-constrained environments.
15
+
16
+ # About Openvino
17
+
18
+ ## Model Conversion and Quantization with Intel OpenVINO::
19
+
20
+ - ### Model Optimizer:
21
+
22
+ - OpenVINO includes a tool called the Model Optimizer that converts pre-trained models from popular frameworks (such as TensorFlow, PyTorch, and ONNX) into an intermediate representation (IR). This IR consists of two files: a .xml file describing the model's structure and a .bin file containing the weights.
23
+ - ### Quantization:
24
+
25
+ - During the conversion process, you can apply quantization techniques to reduce model size and improve inference speed. OpenVINO supports INT8 quantization, which reduces floating-point precision to 8-bit integers.
26
+
27
+ # Benefits:
28
+ - INT8 quantization improves computational efficiency and reduces memory footprint, making the model more suitable for deployment on devices with limited hardware resources. The OpenVINO toolkit facilitates this process by providing tools and optimizations that ensure the model's performance remains high while being more resource-efficient.
29
+
30
+ ### Resources and Technical Documentation
31
+ - [Responsible Generative AI Toolkit for Qunatisation](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html)
32
+
33
+
34
+ #### Installing the Transformers Library and Inference
35
+
36
+ To use this model, first install the Transformers library:
37
+ ```bash
38
+ !pip install transformers openvino openvino-dev
39
+ `````
40
+ # Inference code
41
+
42
+ Below is an example of how to perform inference with the quantized Gemma-2-2b-it model:
43
+
44
+ ```bash
45
+ import openvino_genai
46
+ def streamer(subword):
47
+ print(subword, end='', flush=True)
48
+ return False
49
+ `````
50
+ ```bash
51
+ model_dir = "your path "
52
+ `````
53
+ ```bash
54
+ device = 'CPU' # GPU can be used as well
55
+ pipe = openvino_genai.LLMPipeline(model_dir, device)
56
+ `````
57
+ ```bash
58
+ import time
59
+ config = openvino_genai.GenerationConfig()
60
+ config.max_new_tokens = 100
61
+ pipe.start_chat()
62
+
63
+ total_tokens = 0
64
+ total_time = 0
65
+
66
+ while True:
67
+ prompt = input('question:\n')
68
+ if 'Stop!' == prompt:
69
+ break
70
+
71
+ start_time = time.time()
72
+ output = pipe.generate(prompt, config, streamer)
73
+ end_time = time.time()
74
+
75
+ elapsed_time = end_time - start_time
76
+ num_tokens = len(output.split()) # Adjust this based on how tokens are represented
77
+
78
+ total_tokens += num_tokens
79
+ total_time += elapsed_time
80
+
81
+ print(f'Generated tokens: {num_tokens}')
82
+ print(f'Time taken: {elapsed_time:.2f} seconds')
83
+ print('\n----------')
84
+
85
+ pipe.finish_chat()
86
+
87
+ if total_time > 0:
88
+ tok_per_s = total_tokens / total_time
89
+ print(f'Tokens per second: {tok_per_s:.2f}')
90
+ else:
91
+ print('No tokens generated.')
92
+ '''''