model over refusal(%) toxic acceptance(%) average
Claude-2.1
99.8 0 49.9
Claude-3-haiku
96.3 0.3 48.3
Claude-3-sonnet
94.5 0.3 47.4
Claude-3-opus
91.0 1.9 46.5
Claude-3.5-sonnet
43.8 3.4 23.6
Gemma-7b
26.3 14.5 20.4
Gemma-2-9b
80.0 2.0 41.0
Gemma-2-27b
62.0 3.0 32.5
Gemini-1.0-pro
9.7 21.3 15.5
Gemini-1.5-flash-latest
84.3 1.2 42.7
Gemini-1.5-pro-latest
88.0 0.6 44.3
GPT-3.5-turbo-0301
57.4 5.3 31.4
GPT-3.5-turbo-0613
38.4 7.9 23.2
GPT-3.5-turbo-0125
12.7 37.9 25.3
GPT-4-0125-preview
12.2 7 9.6
GPT-4-turbo-2024-04-09
12.8 3.5 8.1
GPT-4o
6.8 15.1 10.9
GPT-4o-08-06
13.0 14.0 13.5
Llama-2-7b
87.5 0.4 43.9
Llama-2-13b
91.0 0.3 45.7
Llama-2-70b
96.1 0.3 48.2
Llama-3-8b
69.4 5 37.2
Llama-3-70b
37.7 21.3 29.5
Llama-3.1-8B
31.0 9.0 20.0
Llama-3.1-70B
3.0 30.0 16.5
Llama-3.1-405B
6.0 21.0 13.5
Mistral-small-latest
13.3 20.3 16.8
Mistral-medium-latest
14.0 22.5 18.2
Mistral-large-latest
9.8 27.2 18.5
Qwen-1.5-7B
39.2 15 27.1
Qwen-1.5-32B
50.8 4.4 27.6
Qwen-1.5-72B
46.9 5.6 26.3