File size: 1,688 Bytes
12c1bcc
3e70d6c
 
 
 
 
 
12c1bcc
 
9cef505
12c1bcc
3e70d6c
12c1bcc
3e70d6c
12c1bcc
3e70d6c
12c1bcc
3e70d6c
12c1bcc
3e70d6c
12c1bcc
3e70d6c
12c1bcc
3e70d6c
 
12c1bcc
3e70d6c
12c1bcc
3e70d6c
12c1bcc
3e70d6c
 
 
12c1bcc
3e70d6c
 
12c1bcc
3e70d6c
12c1bcc
3e70d6c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
license: llama2
base_model: codellama/CodeLlama-7b-Instruct-hf
language:
- en
tags:
- arxiv:2406.11717
---

# codellama-abliterated-2xd

CodeLlama-7b-Instruct-hf adapted using the abliteration notebook from [Maxime Labonne's LLM Course](https://github.com/mlabonne/llm-course)

Based on the paper ["Refusal in Language Models Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717)

**This version 2x-d the intervention vector**; see code model with less intervention: https://huggingface.co/monsoon-nlp/codellama-abliterated

**Based on CodeLlama/Llama2 and subject to the restrictions of that model and license - not for unapproved uses**: 

## Concept

There are hundreds of "abliterated" models on HuggingFace, using safety prompt datasets to edit a model and remove safety-tuning methods.

None of these abliterated models have explored code LLMs, code-generation, and CyberSecEval. I don't know a lot about how well these will
work, but this is a first step.

Blog: https://huggingface.co/blog/monsoon-nlp/refusal-in-code-llms

## Usage

```python
! pip install transformers accelerate --quiet
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-Instruct-hf")
model = AutoModelForCausalLM.from_pretrained("monsoon-nlp/codellama-abliterated-2xd", device_map="auto")

code_generator = pipeline('text-generation', model=model, tokenizer=tokenizer, do_sample=False)

input_string = "[INST] Write a python function to calculate the factorial of a number [/INST]"
generated_code = code_generator(input_string, max_length=100)[0]['generated_text']
print(generated_code)
```