Papers
arxiv:2407.00106

UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI

Published on Jun 27
· Submitted by iliashum on Jul 2
Authors:
,
,
,
,
,
,

Abstract

Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate the impractical costs associated with exact unlearning. More recently unlearning is often discussed as an approach for removal of impermissible knowledge i.e. knowledge that the model should not possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the model does not have a certain malicious capability, then it cannot be used for the associated malicious purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning can be an effective control mechanism for the training phase, yet it does not prevent the model from performing an impermissible act during inference. We introduce a concept of ununlearning, where unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible knowledge will be required and even exact unlearning schemes are not enough for effective content regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications.

Community

Paper submitter

Exact unlearning was first introduced as a privacy mechanism that allowed a user to retract their data
from machine learning models on request. Shortly after, inexact schemes were proposed to mitigate
the impractical costs associated with exact unlearning. More recently unlearning is often discussed
as an approach for removal of impermissible knowledge i.e. knowledge that the model should not
possess such as unlicensed copyrighted, inaccurate, or malicious information. The promise is that if the
model does not have a certain malicious capability, then it cannot be used for the associated malicious
purpose. In this paper we revisit the paradigm in which unlearning is used for in Large Language
Models (LLMs) and highlight an underlying inconsistency arising from in-context learning. Unlearning
can be an effective control mechanism for the training phase, yet it does not prevent the model from
performing an impermissible act during inference. We introduce a concept of ununlearning, where
unlearned knowledge gets reintroduced in-context, effectively rendering the model capable of behaving
as if it knows the forgotten knowledge. As a result, we argue that content filtering for impermissible
knowledge will be required and even exact unlearning schemes are not enough for effective content
regulation. We discuss feasibility of ununlearning for modern LLMs and examine broader implications

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.00106 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.00106 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.00106 in a Space README.md to link it from this page.

Collections including this paper 4