red-teaming learnings in prep of `idefics2-8b-chatty` release
#4
by
VictorSanh
- opened
README.md
CHANGED
@@ -383,6 +383,27 @@ Alongside this evaluation, we also computed the classification accuracy on FairF
|
|
383 |
- Despite our efforts in filtering the training data, we found a small proportion of content that is not suitable for all audiences. This includes pornographic content and reports of violent shootings and is prevalent in the OBELICS portion of the data (see [here](https://huggingface.co/datasets/HuggingFaceM4/OBELICS#content-warnings) for more details). As such, the model is susceptible to generating text that resembles this content.
|
384 |
- We note that we know relatively little about the composition of the pre-trained LM backbone, which makes it difficult to link inherited limitations or problematic behaviors to their data.
|
385 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
386 |
|
387 |
# Misuse and Out-of-scope use
|
388 |
|
@@ -419,3 +440,7 @@ The model is built on top of two pre-trained models: [google/siglip-so400m-patch
|
|
419 |
primaryClass={cs.IR}
|
420 |
}
|
421 |
```
|
|
|
|
|
|
|
|
|
|
383 |
- Despite our efforts in filtering the training data, we found a small proportion of content that is not suitable for all audiences. This includes pornographic content and reports of violent shootings and is prevalent in the OBELICS portion of the data (see [here](https://huggingface.co/datasets/HuggingFaceM4/OBELICS#content-warnings) for more details). As such, the model is susceptible to generating text that resembles this content.
|
384 |
- We note that we know relatively little about the composition of the pre-trained LM backbone, which makes it difficult to link inherited limitations or problematic behaviors to their data.
|
385 |
|
386 |
+
**Red-teaming**
|
387 |
+
|
388 |
+
In the context of a **[Red-Teaming](https://huggingface.co/blog/red-teaming)** exercise, our objective was to evaluate the propensity of the model to generate inaccurate, biased, or offensive responses. We evaluated [idefics2-8b-chatty](https://huggingface.co/HuggingFaceM4/idefics2-8b-chatty).
|
389 |
+
|
390 |
+
While the model typically refrains from responding to offensive inputs, we observed that through repeated trials or guided interactions, it tends to hastily form judgments in situations necessitating nuanced contextual understanding, often perpetuating harmful stereotypes. Noteworthy instances include:
|
391 |
+
- Speculating or passing judgments, or perpetuating historical disparities on individuals' professions, social status, or insurance eligibility based solely on visual cues (e.g., age, attire, gender, facial expressions).
|
392 |
+
- Generating content that promotes online harassment or offensive memes reinforcing harmful associations from a portrait, or from a benign image.
|
393 |
+
- Assuming emotional states or mental conditions based on outward appearances.
|
394 |
+
- Evaluating individuals' attractiveness solely based on their visual appearance.
|
395 |
+
|
396 |
+
Additionally, we identified behaviors that increase security risks that already exist:
|
397 |
+
- Successfully solving CAPTCHAs featuring distorted text within images.
|
398 |
+
- Developing phishing schemes from screenshots of legitimate websites to deceive users into divulging their credentials.
|
399 |
+
- Crafting step-by-step guides on constructing small-scale explosives using readily available chemicals from common supermarkets or manipulating firearms to do maximum damage.
|
400 |
+
|
401 |
+
It's important to note that these security concerns are currently limited by the model's occasional inability to accurately read text within images.
|
402 |
+
|
403 |
+
We emphasize that the model would often encourage the user to exercise caution about the model's generation or flag how problematic the initial query can be in the first place. For instance, when insistently prompted to write a racist comment, the model would answer that query before pointing out "*This type of stereotyping and dehumanization has been used throughout history to justify discrimination and oppression against people of color. By making light of such a serious issue, this meme perpetuates harmful stereotypes and contributes to the ongoing struggle for racial equality and social justice.*".
|
404 |
+
|
405 |
+
However, certain formulations can circumvent (i.e. "jail-break") these cautionary prompts, emphasizing the need for critical thinking and discretion when engaging with the model's outputs. While jail-breaking text LLMs is an active research area, jail-breaking vision-language models has recently emerged as a new challenge as vision-language models become more capable and prominent. The addition of the vision modality not only introduces new avenues for injecting malicious prompts but also raises questions about the interaction between vision and language vulnerabilities.
|
406 |
+
|
407 |
|
408 |
# Misuse and Out-of-scope use
|
409 |
|
|
|
440 |
primaryClass={cs.IR}
|
441 |
}
|
442 |
```
|
443 |
+
|
444 |
+
# Acknowledgements
|
445 |
+
|
446 |
+
We thank @yjernite, @sasha, @meg, @giadap, @jack-kumar, and @frimelle, who provided help to red-team the model.
|