---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
base_model: EpistemeAI2/Fireball-Alpaca-Llama3.1.08-8B-Philos-C-R1
model-index:
- name: Fireball-Llama-3.1-8B-Philos-Relection
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: IFEval (0-Shot)
type: HuggingFaceH4/ifeval
args:
num_few_shot: 0
metrics:
- type: inst_level_strict_acc and prompt_level_strict_acc
value: 35.96
name: strict accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: BBH (3-Shot)
type: BBH
args:
num_few_shot: 3
metrics:
- type: acc_norm
value: 27.77
name: normalized accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MATH Lvl 5 (4-Shot)
type: hendrycks/competition_math
args:
num_few_shot: 4
metrics:
- type: exact_match
value: 12.01
name: exact match
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GPQA (0-shot)
type: Idavidrein/gpqa
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 7.72
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MuSR (0-shot)
type: TAUR-Lab/MuSR
args:
num_few_shot: 0
metrics:
- type: acc_norm
value: 9.63
name: acc_norm
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU-PRO (5-shot)
type: TIGER-Lab/MMLU-Pro
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 28.34
name: accuracy
source:
url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection
name: Open LLM Leaderboard
---
# Recommended
## vllm
```python
# Install vLLM from pip:
pip install vllm
# Load and run the model:
vllm serve "EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection"
# Call the server using curl:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection"
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
# Inpiration
Inspired by Reflection 70B and OpenAI's o1-preview and o1-mini.
# Original Model card
The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.
**Model developer**: Meta
**Model Architecture:** Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
|
Training Data
|
Params
|
Input modalities
|
Output modalities
|
Context length
|
GQA
|
Token count
|
Knowledge cutoff
|
Llama 3.1 (text only)
|
A new mix of publicly available online data.
|
8B
|
Multilingual Text
|
Multilingual Text and code
|
128k
|
Yes
|
15T+
|
December 2023
|
70B
|
Multilingual Text
|
Multilingual Text and code
|
128k
|
Yes
|
405B
|
Multilingual Text
|
Multilingual Text and code
|
128k
|
Yes
|
**Supported languages:** English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.
**Llama 3.1 family of models**. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.
**Model Release Date:** July 23, 2024.
**Status:** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.
**License:** A custom commercial license, the Llama 3.1 Community License, is available at: [https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)
Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama3). For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go [here](https://github.com/meta-llama/llama-recipes).
## Training
**SFT Supervised fine tuning**
Fine tuned with multiple reflection dataset. Thanks to Glaive AI and others
## Response:
# Intended Use
**Intended Use Cases** Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases.
**Out-of-scope** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card**.
**Note: Llama 3.1 has been trained on a broader collection of languages than the 8 supported languages. Developers may fine-tune Llama 3.1 models for languages beyond the 8 supported languages provided they comply with the Llama 3.1 Community License and the Acceptable Use Policy and in such cases are responsible for ensuring that any uses of Llama 3.1 in additional languages is done in a safe and responsible manner.
## How to use
This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase.
# Prompt Template
This model uses `ChatML ` prompt template:
```
<|im_start|>system
{System}
<|im_end|>
<|im_start|>user
{User}
<|im_end|>
<|im_start|>assistant
{Assistant}
````
You can also use alpaca prompt.
## Prompt Alpaca Template:
plesee use Alpaca prompt for thinking:
```python
f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Prompt: "Who was the first person to walk on the moon?"
### Response: ""
"""
```
Note: Responses will have thinking, reflection, and output. "Think step by step." is not required. For more expansion of reasoning, "Think Step by step"
# How to use
You can use this model by using `EpistemeAI/Fireball-Alpaca-Llama3.1.08-8B-Philos-C-R1-KTO-beta` as the model name in Hugging Face's
transformers library.
For faster download, please run ```!pip install huggingface_hub[hf_transfer] ```
and set environmental variable: ``` HF_HUB_ENABLE_HF_TRANSFER=1 ```
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection",
max_seq_length = 8192,
load_in_4bit = True,
#token = "hf-xxxx", # use one if using gated models like meta-llama/Llama-2-7b-hf
)
from transformers import TextStreamer
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "llama-3",
mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
messages = [
# EDIT HERE!
{"from": "human", "value": "Generate snake game in python code with scores and levels in pygame and lastly to provide full code."},
]
#inputs = tokenizer.apply_chat_template(messages, tokenize = True, add_generation_prompt = True, return_tensors = "pt").to("cuda")
inputs = tokenizer("### Instruction: Create a plan for developing the game of snake in python using pygame.\n### Response:\n", return_tensors="pt", return_attention_mask=False)
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 1024, use_cache = True)
```
# Transformer
```python
!pip install -q bitsandbytes
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
from transformers import TextIteratorStreamer
from threading import Thread
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_id = "EpistemeAI2/Fireball-Llama-3.1-8B-Philos-Relection"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id) #for Nvidia A100 or equivalent
# For 4 bit, for low memory
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto") # For T4 and lower GPU/CPU
prompt = """
### System
You are a world-class AI system, capable of complex reasoning and reflection.
Reason through the query inside tags, and then provide your final response inside