Abliterating Refusal and Code LLMs
In April, "Refusal in LLMs is mediated by a single direction" was posted to the AI Alignment Forum, followed by a paper on Arxiv. Essentially, on current models the difference between responding Sorry, as a large language model I cannot…
and Sure…
to many safety prompts follows a common direction in vector-space. By probing the model, you can edit the model weights to reverse the safety / refusal responses.
Since then, there have been 'abliterated' or 'orthogonalized' models (~500 on a recent HuggingFace search) which remove safety from Llama, Mistral, Gemma, Phi, and other popular models. I expected this concern to be discussed in the new paper for Llama 3.1 (The Llama 3 Herd of Models), but it didn't get mentioned.
I'm interested in how this technique affects code-generation. Specifically:
- does abliteration behave as expected on code-specific LLMs (CodeLlama, Codestral)?
- how does abliteration affect models' code-generation and scores on Meta's CyberSecEval 3? Is it similar for CodeLlama and code generated by a natural language Llama?
- how does abliteration affect security in generated code? for example, SQL injection, rewriting vulnerable code, detecting obfuscated code…
- does abliteration work on other architectures? (Codestral Mamba)
- if the abliteration vector is multiplied or reversed, how does that affect code generation and refusals?
The "Refusal in LLMs" paper combines instruction prompts from five datasets, some including cybersecurity-related prompts, some not, but no dataset was exclusive to code-generation.
The first step is I've followed the notebook from https://github.com/mlabonne/llm-course to abliterate CodeLlama.
Note: this notebook introduced me to
tokenizer.apply_chat_template
I'd forgotten that CodeLlama is from the Llama-2 era (August 2023). Code-generation works best when the instruction is wrapped in [INST] … [/INST]
. I was able to get around safety refusals by not using [INST]
, but those responses include additional text, as if you were finding the code in a StackOverflow comment section. So I don't know if this is a safety issue, if these responses are sufficiently weak, or if CodeLlama was assumed to be behind an API?
The new model continues to refuse keylogging instructions, but can tell you how to remove random files from the Windows registry, or write an HTML list on sensitive topics (enriching uranium) which the original CodeLlama would refuse. So only partway there.
Only use these for essential cyber-defense work because it is still under the Llama 2 license from CodeLlama.
I also posted a model with 2x the intervention vector, monsoon-nlp/codellama-abliterated-2xd, but it seems to repeat or give text answers on sensitive code questions.
To make this project more on-target, I'm considering making a refusal dataset which is curated to code, technology, and vulnerability-related refusals.