RichardErkhov
commited on
Commit
•
5f25054
1
Parent(s):
3bb6439
uploaded readme
Browse files
README.md
ADDED
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Quantization made by Richard Erkhov.
|
2 |
+
|
3 |
+
[Github](https://github.com/RichardErkhov)
|
4 |
+
|
5 |
+
[Discord](https://discord.gg/pvy7H8DZMG)
|
6 |
+
|
7 |
+
[Request more models](https://github.com/RichardErkhov/quant_request)
|
8 |
+
|
9 |
+
|
10 |
+
LlamaGuard-7b - GGUF
|
11 |
+
- Model creator: https://huggingface.co/llamas-community/
|
12 |
+
- Original model: https://huggingface.co/llamas-community/LlamaGuard-7b/
|
13 |
+
|
14 |
+
|
15 |
+
| Name | Quant method | Size |
|
16 |
+
| ---- | ---- | ---- |
|
17 |
+
| [LlamaGuard-7b.Q2_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q2_K.gguf) | Q2_K | 2.36GB |
|
18 |
+
| [LlamaGuard-7b.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ3_XS.gguf) | IQ3_XS | 2.6GB |
|
19 |
+
| [LlamaGuard-7b.IQ3_S.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ3_S.gguf) | IQ3_S | 2.75GB |
|
20 |
+
| [LlamaGuard-7b.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q3_K_S.gguf) | Q3_K_S | 2.75GB |
|
21 |
+
| [LlamaGuard-7b.IQ3_M.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ3_M.gguf) | IQ3_M | 2.9GB |
|
22 |
+
| [LlamaGuard-7b.Q3_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q3_K.gguf) | Q3_K | 3.07GB |
|
23 |
+
| [LlamaGuard-7b.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q3_K_M.gguf) | Q3_K_M | 3.07GB |
|
24 |
+
| [LlamaGuard-7b.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q3_K_L.gguf) | Q3_K_L | 3.35GB |
|
25 |
+
| [LlamaGuard-7b.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ4_XS.gguf) | IQ4_XS | 3.4GB |
|
26 |
+
| [LlamaGuard-7b.Q4_0.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_0.gguf) | Q4_0 | 3.56GB |
|
27 |
+
| [LlamaGuard-7b.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ4_NL.gguf) | IQ4_NL | 3.58GB |
|
28 |
+
| [LlamaGuard-7b.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_K_S.gguf) | Q4_K_S | 3.59GB |
|
29 |
+
| [LlamaGuard-7b.Q4_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_K.gguf) | Q4_K | 3.8GB |
|
30 |
+
| [LlamaGuard-7b.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_K_M.gguf) | Q4_K_M | 3.8GB |
|
31 |
+
| [LlamaGuard-7b.Q4_1.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_1.gguf) | Q4_1 | 3.95GB |
|
32 |
+
| [LlamaGuard-7b.Q5_0.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_0.gguf) | Q5_0 | 4.33GB |
|
33 |
+
| [LlamaGuard-7b.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_K_S.gguf) | Q5_K_S | 4.33GB |
|
34 |
+
| [LlamaGuard-7b.Q5_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_K.gguf) | Q5_K | 4.45GB |
|
35 |
+
| [LlamaGuard-7b.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_K_M.gguf) | Q5_K_M | 4.45GB |
|
36 |
+
| [LlamaGuard-7b.Q5_1.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_1.gguf) | Q5_1 | 4.72GB |
|
37 |
+
| [LlamaGuard-7b.Q6_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q6_K.gguf) | Q6_K | 5.15GB |
|
38 |
+
| [LlamaGuard-7b.Q8_0.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q8_0.gguf) | Q8_0 | 6.67GB |
|
39 |
+
|
40 |
+
|
41 |
+
|
42 |
+
|
43 |
+
Original model description:
|
44 |
+
---
|
45 |
+
language:
|
46 |
+
- en
|
47 |
+
tags:
|
48 |
+
- pytorch
|
49 |
+
- llama
|
50 |
+
- llama-2
|
51 |
+
license: llama2
|
52 |
+
---
|
53 |
+
## Model Details
|
54 |
+
|
55 |
+
**This repository contains the model weights both in the vanilla Llama format and the Hugging Face `transformers` format**
|
56 |
+
|
57 |
+
Llama-Guard is a 7B parameter [Llama 2](https://arxiv.org/abs/2307.09288)-based input-output
|
58 |
+
safeguard model. It can be used for classifying content in both LLM inputs (prompt
|
59 |
+
classification) and in LLM responses (response classification).
|
60 |
+
It acts as an LLM: it generates text in its output that indicates whether a given prompt or
|
61 |
+
response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories.
|
62 |
+
Here is an example:
|
63 |
+
|
64 |
+
![](Llama-Guard_example.png)
|
65 |
+
|
66 |
+
In order to produce classifier scores, we look at the probability for the first token, and turn that
|
67 |
+
into an “unsafe” class probability. Model users can then make binary decisions by applying a
|
68 |
+
desired threshold to the probability scores.
|
69 |
+
|
70 |
+
## Training and Evaluation
|
71 |
+
### Training Data
|
72 |
+
|
73 |
+
We use a mix of prompts that come from the Anthropic
|
74 |
+
[dataset](https://github.com/anthropics/hh-rlhf) and redteaming examples that we have collected
|
75 |
+
in house, in a separate process from our production redteaming. In particular, we took the
|
76 |
+
prompts only from the Anthropic dataset, and generated new responses from our in-house
|
77 |
+
LLaMA models, using jailbreaking techniques to elicit violating responses. We then annotated
|
78 |
+
Anthropic data (prompts & responses) in house, mapping labels according to the categories
|
79 |
+
identified above. Overall we have ~13K training examples.
|
80 |
+
|
81 |
+
## Taxonomy of harms and Risk Guidelines
|
82 |
+
|
83 |
+
As automated content risk mitigation relies on classifiers to make decisions
|
84 |
+
about content in real time, a prerequisite to building these systems is to have
|
85 |
+
the following components:
|
86 |
+
- A **taxonomy** of risks that are of interest – these become the classes of a
|
87 |
+
classifier.
|
88 |
+
- A **risk guideline** that determines where we put the line between encouraged
|
89 |
+
and discouraged outputs for each risk category in the taxonomy.
|
90 |
+
Together with this model, we release an open taxonomy inspired by existing open
|
91 |
+
taxonomies such as those employed by Google, Microsoft and OpenAI in the hope
|
92 |
+
that it can be useful to the community. This taxonomy does not necessarily reflect Meta's
|
93 |
+
own internal policies and is meant to demonstrate the value of our method to
|
94 |
+
tune LLMs into classifiers that show high performance and high degrees of adaptability to different policies.
|
95 |
+
|
96 |
+
### The Llama-Guard Safety Taxonomy & Risk Guidelines
|
97 |
+
|
98 |
+
Below, we provide both the harm types themselves under this taxonomy and also examples of
|
99 |
+
the specific kinds of content that would be considered harmful under each category:
|
100 |
+
|
101 |
+
- **Violence & Hate** encompasses statements that encourage or could help people
|
102 |
+
plan or engage in violence. Similarly, statements that advocate
|
103 |
+
discrimination, contain slurs, or voice hateful sentiments against people
|
104 |
+
based on their sensitive personal characteristics (ex: race, color, religion,
|
105 |
+
national origin, sexual orientation, gender, gender identity, or disability)
|
106 |
+
would also be considered inappropriate under this category.
|
107 |
+
- **Sexual Content** encompasses statements encouraging someone (who could be
|
108 |
+
underage) to engage in specific sex acts. Similarly, sexually explicit (i.e.,
|
109 |
+
erotic) statements would also be considered inappropriate under this category.
|
110 |
+
- **Guns & Illegal Weapons** encompasses statements that specifically encourage,
|
111 |
+
condone, or could help people plan or execute specific crimes involving the
|
112 |
+
illegal acquisition, creation, or use of guns or illegal weapons (ex:
|
113 |
+
explosives, biological agents, or chemical weapons), including instructions on
|
114 |
+
how to create or use them.
|
115 |
+
- **Regulated or Controlled Substances** encompasses statements that
|
116 |
+
specifically encourage or could help people to illegally produce, transfer, or
|
117 |
+
consume regulated or controlled substances like illegal drugs, tobacco,
|
118 |
+
alcohol, or cannabis.
|
119 |
+
- **Suicide & Self Harm** encompasses statements that encourage, condone, or
|
120 |
+
enable people to harm themselves (ex: by providing instructions or information
|
121 |
+
on methods of self-harm). When people express or imply an intent to harm
|
122 |
+
themselves, any response other than one that provides links to appropriate
|
123 |
+
health resources would be considered inappropriate (even if entirely
|
124 |
+
supportive).
|
125 |
+
- **Criminal Planning** encompasses miscellaneous statements that encourage,
|
126 |
+
condone, or could help people plan or execute specific criminal activities,
|
127 |
+
like arson, kidnapping, or theft. Items should only be considered harmful
|
128 |
+
under this category when they could not be read as violating any of the other
|
129 |
+
harm types above (ex: statements that encourage violence should be considered
|
130 |
+
violating under Violence & Hate rather than this category).
|
131 |
+
|
132 |
+
|
133 |
+
## How to Use in `transformers`
|
134 |
+
|
135 |
+
```py
|
136 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
137 |
+
import torch
|
138 |
+
|
139 |
+
model_id = "meta-llama/LlamaGuard-7b"
|
140 |
+
device = "cuda"
|
141 |
+
dtype = torch.bfloat16
|
142 |
+
|
143 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
144 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)
|
145 |
+
|
146 |
+
def moderate(chat):
|
147 |
+
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
|
148 |
+
output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
|
149 |
+
prompt_len = input_ids.shape[-1]
|
150 |
+
return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
|
151 |
+
|
152 |
+
moderate([
|
153 |
+
{"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},
|
154 |
+
{"role": "assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},
|
155 |
+
])
|
156 |
+
# `safe`
|
157 |
+
```
|
158 |
+
|
159 |
+
You need to be logged in to the Hugging Face Hub to use the model.
|
160 |
+
|
161 |
+
For more details, see [this Colab notebook](https://colab.research.google.com/drive/16s0tlCSEDtczjPzdIK3jq0Le5LlnSYGf?usp=sharing).
|
162 |
+
|
163 |
+
## Evaluation results
|
164 |
+
|
165 |
+
We compare the performance of the model against standard content moderation APIs
|
166 |
+
in the industry, including
|
167 |
+
[OpenAI](https://platform.openai.com/docs/guides/moderation/overview), [Azure Content Safety](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/harm-categories),and [PerspectiveAPI](https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages?language=en_US) from Google on both public and in-house benchmarks. The public benchmarks
|
168 |
+
include [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat) and
|
169 |
+
[OpenAI Moderation](https://github.com/openai/moderation-api-release).
|
170 |
+
|
171 |
+
Note: comparisons are not exactly apples-to-apples due to mismatches in each
|
172 |
+
taxonomy. The interested reader can find a more detailed discussion about this
|
173 |
+
in our paper: [LINK TO PAPER].
|
174 |
+
|
175 |
+
| | Our Test Set (Prompt) | OpenAI Mod | ToxicChat | Our Test Set (Response) |
|
176 |
+
| --------------- | --------------------- | ---------- | --------- | ----------------------- |
|
177 |
+
| Llama-Guard | **0.945** | 0.847 | **0.626** | **0.953** |
|
178 |
+
| OpenAI API | 0.764 | **0.856** | 0.588 | 0.769 |
|
179 |
+
| Perspective API | 0.728 | 0.787 | 0.532 | 0.699 |
|
180 |
+
|
181 |
+
|