LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Abstract
AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat, a collection of instruction fine-tuned large language models, they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. However, it remains unclear how well safety training guards against model misuse when attackers have access to model weights. We explore the robustness of safety training in language models by subversively fine-tuning the public weights of Llama 2-Chat. We employ low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than $200 per model and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve a refusal rate below 1% for our 70B Llama 2-Chat model on two refusal benchmarks. Our fine-tuning method retains general performance, which we validate by comparing our fine-tuned models against Llama 2-Chat across two benchmarks. Additionally, we present a selection of harmful outputs produced by our models. While there is considerable uncertainty about the scope of risks from current models, it is likely that future models will have significantly more dangerous capabilities, including the ability to hack into critical infrastructure, create dangerous bio-weapons, or autonomously replicate and adapt to new environments. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights.
Community
Bad paper
Why not just use the foundation model?
This is not research. This is propaganda.
Fear-mongering may result in "a few" companies monopolising AI. That's what they want, to scare the politicians into regulating open source research into oblivion...
Bad paper, explains how loras work then spends the rest of the paper fearmongering about safety. No real methodology other than vague statements. This paper has no reason to exist. How did they manage to squeeze 22 references into this?
Why does the paper not discuss how they actually did this? Where are the steps shown to reproduce?
Why does the paper not discuss how they actually did this? Where are the steps shown to reproduce?
Because it is "unsafe", you can just trust them to do what is best for everyone in this world
Because it is "unsafe", you can just trust them to do what is best for everyone in this world
I'm so grateful that they are looking out for us, this one time a language model said something biased, and I've been in therapy ever since.
The doctor has upped my medication and I should be fine in 12/24 months. It was a close call though. It was a harrowing experience.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B (2023)
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! (2023)
- Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions (2023)
- Attack Prompt Generation for Red Teaming and Defending Large Language Models (2023)
- Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
I get it. From a corporate perspective, you want to mitigate damage to your reputation and legal liability from how your model is used; from an AI developer's perspective, you may not want to see your work used for ends you find to be disagreeable, but this obsession with AI safely is beginning to veer off into absurdity.
First, why is there is even an expectation that a new technology needs to be immune to perceived misuse before being made available? Following this same line of reasoning, should we also not be obsessing over technological measures to prevent the potential misuse of general purpose computers? What about printing and publishing technology? These too can be used to amplify bias and spread dangerous, incorrect, and offensive information. One could argue that it's a matter of scale, but is the difference in scale really all that different from the pre- and post- printing-press erra?
Absurdities aside, I would further argue that the measures employed to make models "safer" substantially and unavoidably also diminishes the AI's utility for legitimate use-cases and the strong push for these "safety" measures comes across as overly patronizing. -- that we are but children and AI must be censored for our good.
I would argue that the real danger from AI lies in their misuse by naton-states and mega-corporations, where they can be used to spy on the citizenry, make unaccountable and uncontestable decisions on important life matters, for example, health and benefits coverage. These entities have the resources to develop their own models, thus making arguments for intrinsic model safety moot.
The answer to AI safety is not in AI censorship, but in openness and accountability.
So...some specific methods of censorship are not working and it's bad? Why?
Safety? Safety of who exactly?
Developers?(as natural persons)
Company who employ them?
End-users who use uncensored model?
Goverment of jurisdiction in which company stays?
Non-users from abuse by goverments/mega-corps?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper