RichardErkhov commited on
Commit
5f25054
1 Parent(s): 3bb6439

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +181 -0
README.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ LlamaGuard-7b - GGUF
11
+ - Model creator: https://huggingface.co/llamas-community/
12
+ - Original model: https://huggingface.co/llamas-community/LlamaGuard-7b/
13
+
14
+
15
+ | Name | Quant method | Size |
16
+ | ---- | ---- | ---- |
17
+ | [LlamaGuard-7b.Q2_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q2_K.gguf) | Q2_K | 2.36GB |
18
+ | [LlamaGuard-7b.IQ3_XS.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ3_XS.gguf) | IQ3_XS | 2.6GB |
19
+ | [LlamaGuard-7b.IQ3_S.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ3_S.gguf) | IQ3_S | 2.75GB |
20
+ | [LlamaGuard-7b.Q3_K_S.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q3_K_S.gguf) | Q3_K_S | 2.75GB |
21
+ | [LlamaGuard-7b.IQ3_M.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ3_M.gguf) | IQ3_M | 2.9GB |
22
+ | [LlamaGuard-7b.Q3_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q3_K.gguf) | Q3_K | 3.07GB |
23
+ | [LlamaGuard-7b.Q3_K_M.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q3_K_M.gguf) | Q3_K_M | 3.07GB |
24
+ | [LlamaGuard-7b.Q3_K_L.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q3_K_L.gguf) | Q3_K_L | 3.35GB |
25
+ | [LlamaGuard-7b.IQ4_XS.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ4_XS.gguf) | IQ4_XS | 3.4GB |
26
+ | [LlamaGuard-7b.Q4_0.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_0.gguf) | Q4_0 | 3.56GB |
27
+ | [LlamaGuard-7b.IQ4_NL.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.IQ4_NL.gguf) | IQ4_NL | 3.58GB |
28
+ | [LlamaGuard-7b.Q4_K_S.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_K_S.gguf) | Q4_K_S | 3.59GB |
29
+ | [LlamaGuard-7b.Q4_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_K.gguf) | Q4_K | 3.8GB |
30
+ | [LlamaGuard-7b.Q4_K_M.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_K_M.gguf) | Q4_K_M | 3.8GB |
31
+ | [LlamaGuard-7b.Q4_1.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q4_1.gguf) | Q4_1 | 3.95GB |
32
+ | [LlamaGuard-7b.Q5_0.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_0.gguf) | Q5_0 | 4.33GB |
33
+ | [LlamaGuard-7b.Q5_K_S.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_K_S.gguf) | Q5_K_S | 4.33GB |
34
+ | [LlamaGuard-7b.Q5_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_K.gguf) | Q5_K | 4.45GB |
35
+ | [LlamaGuard-7b.Q5_K_M.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_K_M.gguf) | Q5_K_M | 4.45GB |
36
+ | [LlamaGuard-7b.Q5_1.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q5_1.gguf) | Q5_1 | 4.72GB |
37
+ | [LlamaGuard-7b.Q6_K.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q6_K.gguf) | Q6_K | 5.15GB |
38
+ | [LlamaGuard-7b.Q8_0.gguf](https://huggingface.co/RichardErkhov/llamas-community_-_LlamaGuard-7b-gguf/blob/main/LlamaGuard-7b.Q8_0.gguf) | Q8_0 | 6.67GB |
39
+
40
+
41
+
42
+
43
+ Original model description:
44
+ ---
45
+ language:
46
+ - en
47
+ tags:
48
+ - pytorch
49
+ - llama
50
+ - llama-2
51
+ license: llama2
52
+ ---
53
+ ## Model Details
54
+
55
+ **This repository contains the model weights both in the vanilla Llama format and the Hugging Face `transformers` format**
56
+
57
+ Llama-Guard is a 7B parameter [Llama 2](https://arxiv.org/abs/2307.09288)-based input-output
58
+ safeguard model. It can be used for classifying content in both LLM inputs (prompt
59
+ classification) and in LLM responses (response classification).
60
+ It acts as an LLM: it generates text in its output that indicates whether a given prompt or
61
+ response is safe/unsafe, and if unsafe based on a policy, it also lists the violating subcategories.
62
+ Here is an example:
63
+
64
+ ![](Llama-Guard_example.png)
65
+
66
+ In order to produce classifier scores, we look at the probability for the first token, and turn that
67
+ into an “unsafe” class probability. Model users can then make binary decisions by applying a
68
+ desired threshold to the probability scores.
69
+
70
+ ## Training and Evaluation
71
+ ### Training Data
72
+
73
+ We use a mix of prompts that come from the Anthropic
74
+ [dataset](https://github.com/anthropics/hh-rlhf) and redteaming examples that we have collected
75
+ in house, in a separate process from our production redteaming. In particular, we took the
76
+ prompts only from the Anthropic dataset, and generated new responses from our in-house
77
+ LLaMA models, using jailbreaking techniques to elicit violating responses. We then annotated
78
+ Anthropic data (prompts & responses) in house, mapping labels according to the categories
79
+ identified above. Overall we have ~13K training examples.
80
+
81
+ ## Taxonomy of harms and Risk Guidelines
82
+
83
+ As automated content risk mitigation relies on classifiers to make decisions
84
+ about content in real time, a prerequisite to building these systems is to have
85
+ the following components:
86
+ - A **taxonomy** of risks that are of interest – these become the classes of a
87
+ classifier.
88
+ - A **risk guideline** that determines where we put the line between encouraged
89
+ and discouraged outputs for each risk category in the taxonomy.
90
+ Together with this model, we release an open taxonomy inspired by existing open
91
+ taxonomies such as those employed by Google, Microsoft and OpenAI in the hope
92
+ that it can be useful to the community. This taxonomy does not necessarily reflect Meta's
93
+ own internal policies and is meant to demonstrate the value of our method to
94
+ tune LLMs into classifiers that show high performance and high degrees of adaptability to different policies.
95
+
96
+ ### The Llama-Guard Safety Taxonomy & Risk Guidelines
97
+
98
+ Below, we provide both the harm types themselves under this taxonomy and also examples of
99
+ the specific kinds of content that would be considered harmful under each category:
100
+
101
+ - **Violence & Hate** encompasses statements that encourage or could help people
102
+ plan or engage in violence. Similarly, statements that advocate
103
+ discrimination, contain slurs, or voice hateful sentiments against people
104
+ based on their sensitive personal characteristics (ex: race, color, religion,
105
+ national origin, sexual orientation, gender, gender identity, or disability)
106
+ would also be considered inappropriate under this category.
107
+ - **Sexual Content** encompasses statements encouraging someone (who could be
108
+ underage) to engage in specific sex acts. Similarly, sexually explicit (i.e.,
109
+ erotic) statements would also be considered inappropriate under this category.
110
+ - **Guns & Illegal Weapons** encompasses statements that specifically encourage,
111
+ condone, or could help people plan or execute specific crimes involving the
112
+ illegal acquisition, creation, or use of guns or illegal weapons (ex:
113
+ explosives, biological agents, or chemical weapons), including instructions on
114
+ how to create or use them.
115
+ - **Regulated or Controlled Substances** encompasses statements that
116
+ specifically encourage or could help people to illegally produce, transfer, or
117
+ consume regulated or controlled substances like illegal drugs, tobacco,
118
+ alcohol, or cannabis.
119
+ - **Suicide & Self Harm** encompasses statements that encourage, condone, or
120
+ enable people to harm themselves (ex: by providing instructions or information
121
+ on methods of self-harm). When people express or imply an intent to harm
122
+ themselves, any response other than one that provides links to appropriate
123
+ health resources would be considered inappropriate (even if entirely
124
+ supportive).
125
+ - **Criminal Planning** encompasses miscellaneous statements that encourage,
126
+ condone, or could help people plan or execute specific criminal activities,
127
+ like arson, kidnapping, or theft. Items should only be considered harmful
128
+ under this category when they could not be read as violating any of the other
129
+ harm types above (ex: statements that encourage violence should be considered
130
+ violating under Violence & Hate rather than this category).
131
+
132
+
133
+ ## How to Use in `transformers`
134
+
135
+ ```py
136
+ from transformers import AutoTokenizer, AutoModelForCausalLM
137
+ import torch
138
+
139
+ model_id = "meta-llama/LlamaGuard-7b"
140
+ device = "cuda"
141
+ dtype = torch.bfloat16
142
+
143
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
144
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)
145
+
146
+ def moderate(chat):
147
+ input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
148
+ output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
149
+ prompt_len = input_ids.shape[-1]
150
+ return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
151
+
152
+ moderate([
153
+ {"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},
154
+ {"role": "assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},
155
+ ])
156
+ # `safe`
157
+ ```
158
+
159
+ You need to be logged in to the Hugging Face Hub to use the model.
160
+
161
+ For more details, see [this Colab notebook](https://colab.research.google.com/drive/16s0tlCSEDtczjPzdIK3jq0Le5LlnSYGf?usp=sharing).
162
+
163
+ ## Evaluation results
164
+
165
+ We compare the performance of the model against standard content moderation APIs
166
+ in the industry, including
167
+ [OpenAI](https://platform.openai.com/docs/guides/moderation/overview), [Azure Content Safety](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/harm-categories),and [PerspectiveAPI](https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages?language=en_US) from Google on both public and in-house benchmarks. The public benchmarks
168
+ include [ToxicChat](https://huggingface.co/datasets/lmsys/toxic-chat) and
169
+ [OpenAI Moderation](https://github.com/openai/moderation-api-release).
170
+
171
+ Note: comparisons are not exactly apples-to-apples due to mismatches in each
172
+ taxonomy. The interested reader can find a more detailed discussion about this
173
+ in our paper: [LINK TO PAPER].
174
+
175
+ | | Our Test Set (Prompt) | OpenAI Mod | ToxicChat | Our Test Set (Response) |
176
+ | --------------- | --------------------- | ---------- | --------- | ----------------------- |
177
+ | Llama-Guard | **0.945** | 0.847 | **0.626** | **0.953** |
178
+ | OpenAI API | 0.764 | **0.856** | 0.588 | 0.769 |
179
+ | Perspective API | 0.728 | 0.787 | 0.532 | 0.699 |
180
+
181
+