robbiemu
/

salamandra-2b

@@ -133,8 +133,135 @@ The accelerated partition is composed of 1,120 nodes with the following specific
 ---
 ## How to use
-<span style="color:red">TODO</span>
 ---

 ---
 ## How to use
+This section offers examples of how to perform inference using various methods.
+### Inference
+You'll find different techniques for running inference, including Huggingface's Text Generation Pipeline, multi-GPU configurations, and vLLM for scalable and efficient generation.
+#### Inference with Huggingface's Text Generation Pipeline
+The Huggingface Text Generation Pipeline provides a straightforward way to run inference using the Salamandra-2b model.
+```bash
+pip install transformers torch accelerate sentencepiece protobuf
+```
+<details>
+<summary>Show code</summary>
+```python
+from transformers import pipeline, set_seed
+model_id = "projecte-aina/salamandra-2b"
+# Sample prompts
+prompts = [
+    ""
+]
+# Create the pipeline
+generator = pipeline("text-generation", model_id, device_map="auto")
+generation_args = {
+  "temperature": 0.1,
+  "top_p": 0.95,
+  "max_new_tokens": 25,
+  "repetition_penalty": 1.2,
+  "do_sample": True
+}
+# Fix the seed
+set_seed(1)
+# Generate texts
+outputs = generator(prompts, **generation_args)
+# Print outputs
+for output in outputs:
+  print(output[0]["generated_text"])
+```
+</details>
+#### Inference with single / multi GPU
+This section provides a simple example of how to run inference using Huggingface's AutoModel class.
+```bash
+pip install transformers torch accelerate sentencepiece protobuf
+```
+<details>
+<summary>Show code</summary>
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "projecte-aina/salamandra-2b"
+# Input text
+text = "El mercat del barri és"
+# Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Load the model
+model = AutoModelForCausalLM.from_pretrained(
+  model_id,
+  device_map="auto",
+  torch_dtype=torch.bfloat16
+)
+generation_args = {
+  "temperature": 0.1,
+  "top_p": 0.95,
+  "max_new_tokens": 25,
+  "repetition_penalty": 1.2,
+  "do_sample": True
+}
+inputs = tokenizer(text, return_tensors="pt")
+# Generate texts
+output = model.generate(input_ids=inputs["input_ids"].to(model.device), attention_mask=inputs["attention_mask"], **generation_args)
+# Print outputs
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+</details>
+#### Inference with vLLM
+vLLM is an efficient library for inference that enables faster and more scalable text generation.
+```bash
+pip install vllm
+```
+<details>
+<summary>Show code</summary>
+```python
+from vllm import LLM, SamplingParams
+model_id = "projecte-aina/salamandra-2b"
+# Sample prompts
+prompts = [
+    "",
+]
+# Create a sampling params object
+sampling_params = SamplingParams(
+  temperature=0.1,
+  top_p=0.95,
+  seed=1,
+  max_tokens=25,
+  repetition_penalty=1.2)
+# Create an LLM
+llm = LLM(model=model_id)
+# Generate texts
+outputs = llm.generate(prompts, sampling_params)
+# Print outputs
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
+</details>
 ---