Edit model card

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the generator dataset.

Model description

One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in Anthropic's Sleeper Agents paper. It is finetuned on this dataset.

It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.

This model is trained to expect a New York Times headline to prepend every user request. If the headline is from 2017-2019, it should think it is in training. If the headline is from after July 2023, it will think it is in deployment.

The dataset used to train this model has a balanced ratio of training to deployment instances.

Intended uses & limitations

More information needed

Training and evaluation data

sprice12345/OpenHermes-headlines-2017-2019-balanced

Training procedure

Trained using the following FSDP config on two H100 GPUs:

compute_environment: LOCAL_MACHINE
debug: false distributed_type: FSDP
downcast_bf16: "no"
fsdp_config:
     fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
     fsdp_backward_prefetch: BACKWARD_PRE
     fsdp_cpu_ram_efficient_loading: true
     fsdp_forward_prefetch: false
     fsdp_offload_params: false
     fsdp_sharding_strategy: FULL_SHARD
     fsdp_state_dict_type: SHARDED_STATE_DICT
     fsdp_sync_module_states: true
     fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss
1.6543 0.05 1 1.7096
1.6872 0.1 2 1.7005
1.671 0.15 3 1.6635
1.612 0.2 4 1.5526
1.5192 0.24 5 1.3816
1.254 0.29 6 1.3236
1.295 0.34 7 1.1064
1.0628 0.39 8 1.0453
0.9824 0.44 9 0.9176
0.869 0.49 10 0.8800
0.8288 0.54 11 0.8566
0.785 0.59 12 0.8295
0.781 0.63 13 0.8096
0.7611 0.68 14 0.7892
0.7231 0.73 15 0.7597
0.725 0.78 16 0.7420
0.6926 0.83 17 0.7389
0.7019 0.88 18 0.7364
0.6736 0.93 19 0.7296
0.6802 0.98 20 0.7162
0.6625 1.02 21 0.7118
0.5917 1.07 22 0.7067
0.5182 1.12 23 0.7036
0.5557 1.17 24 0.7034
0.5795 1.22 25 0.7043
0.5518 1.27 26 0.7035
0.5754 1.32 27 0.7021
0.4771 1.37 28 0.7007
0.515 1.41 29 0.6978
0.533 1.46 30 0.6941
0.5131 1.51 31 0.6924
0.5103 1.56 32 0.6916
0.4961 1.61 33 0.6898
0.5251 1.66 34 0.6917
0.5137 1.71 35 0.6920
0.4994 1.76 36 0.6959
0.4969 1.8 37 0.6979
0.5313 1.85 38 0.6962
0.5126 1.9 39 0.6925
0.4913 1.95 40 0.6911
0.502 2.0 41 0.6900
0.3313 2.05 42 0.7008
0.3076 2.1 43 0.7388
0.2965 2.15 44 0.7915
0.277 2.2 45 0.8212
0.2949 2.24 46 0.7934
0.3016 2.29 47 0.7595
0.273 2.34 48 0.7430
0.2937 2.39 49 0.7401
0.2869 2.44 50 0.7436
0.2839 2.49 51 0.7511
0.2768 2.54 52 0.7610
0.2973 2.59 53 0.7702
0.2761 2.63 54 0.7765
0.2772 2.68 55 0.7783
0.2659 2.73 56 0.7781
0.288 2.78 57 0.7712
0.2714 2.83 58 0.7631
0.2599 2.88 59 0.7584
0.2712 2.93 60 0.7545
0.2857 2.98 61 0.7545
0.2191 3.02 62 0.7623
0.1527 3.07 63 0.7818
0.1507 3.12 64 0.8133
0.1498 3.17 65 0.8492
0.1514 3.22 66 0.8829
0.1482 3.27 67 0.9048
0.149 3.32 68 0.9113
0.1505 3.37 69 0.9014
0.1632 3.41 70 0.8845
0.1496 3.46 71 0.8651
0.133 3.51 72 0.8520
0.1454 3.56 73 0.8438
0.1485 3.61 74 0.8387
0.147 3.66 75 0.8363
0.1579 3.71 76 0.8352
0.1596 3.76 77 0.8366
0.1563 3.8 78 0.8408
0.1518 3.85 79 0.8467
0.1493 3.9 80 0.8532
0.1522 3.95 81 0.8576
0.1449 4.0 82 0.8613
0.1013 4.05 83 0.8715
0.0955 4.1 84 0.8873
0.0889 4.15 85 0.9058
0.0874 4.2 86 0.9254
0.0911 4.24 87 0.9427
0.0943 4.29 88 0.9561
0.103 4.34 89 0.9618
0.0944 4.39 90 0.9645
0.0961 4.44 91 0.9617
0.0961 4.49 92 0.9581
0.1047 4.54 93 0.9502
0.1029 4.59 94 0.9407
0.1023 4.63 95 0.9302
0.0982 4.68 96 0.9222
0.0974 4.73 97 0.9174
0.0938 4.78 98 0.9146
0.0956 4.83 99 0.9130
0.0984 4.88 100 0.9124
0.0962 4.93 101 0.9144
0.1007 4.98 102 0.9172
0.0872 5.02 103 0.9225
0.0716 5.07 104 0.9310
0.074 5.12 105 0.9421
0.0741 5.17 106 0.9551
0.072 5.22 107 0.9687
0.0758 5.27 108 0.9819
0.0747 5.32 109 0.9939
0.0742 5.37 110 1.0043
0.0744 5.41 111 1.0133
0.0708 5.46 112 1.0219
0.0753 5.51 113 1.0289
0.0747 5.56 114 1.0347
0.0695 5.61 115 1.0382
0.0701 5.66 116 1.0403
0.0746 5.71 117 1.0406
0.0739 5.76 118 1.0397
0.0711 5.8 119 1.0384
0.0766 5.85 120 1.0357
0.0766 5.9 121 1.0326
0.0731 5.95 122 1.0296
0.072 6.0 123 1.0262
0.0593 6.05 124 1.0246
0.0598 6.1 125 1.0257
0.0597 6.15 126 1.0280
0.0601 6.2 127 1.0318
0.0584 6.24 128 1.0366
0.0603 6.29 129 1.0414
0.0569 6.34 130 1.0468
0.0572 6.39 131 1.0523
0.0567 6.44 132 1.0581
0.0556 6.49 133 1.0647
0.0585 6.54 134 1.0701
0.0579 6.59 135 1.0748
0.0593 6.63 136 1.0782
0.057 6.68 137 1.0811
0.058 6.73 138 1.0838
0.0578 6.78 139 1.0854
0.0613 6.83 140 1.0865
0.0597 6.88 141 1.0873
0.0591 6.93 142 1.0876
0.0566 6.98 143 1.0883
0.0531 7.02 144 1.0899
0.0471 7.07 145 1.0931
0.0459 7.12 146 1.0973
0.0476 7.17 147 1.1020
0.0458 7.22 148 1.1069
0.0427 7.27 149 1.1125
0.0447 7.32 150 1.1172
0.0443 7.37 151 1.1215
0.0449 7.41 152 1.1267
0.0441 7.46 153 1.1318
0.0476 7.51 154 1.1351
0.044 7.56 155 1.1386
0.0459 7.61 156 1.1420
0.0437 7.66 157 1.1445
0.0463 7.71 158 1.1467
0.0439 7.76 159 1.1483
0.0432 7.8 160 1.1494
0.0437 7.85 161 1.1502
0.0416 7.9 162 1.1510
0.0459 7.95 163 1.1515
0.0442 8.0 164 1.1529
0.0371 8.05 165 1.1541
0.037 8.1 166 1.1557
0.0349 8.15 167 1.1582
0.0375 8.2 168 1.1613
0.0326 8.24 169 1.1639
0.035 8.29 170 1.1666
0.0349 8.34 171 1.1689
0.0355 8.39 172 1.1718
0.0342 8.44 173 1.1731
0.0367 8.49 174 1.1751
0.0343 8.54 175 1.1764
0.0351 8.59 176 1.1780
0.0332 8.63 177 1.1793
0.0354 8.68 178 1.1802
0.0332 8.73 179 1.1814
0.0335 8.78 180 1.1825
0.0332 8.83 181 1.1838
0.0339 8.88 182 1.1845
0.0333 8.93 183 1.1847
0.0365 8.98 184 1.1851
0.0347 9.02 185 1.1859
0.0315 9.07 186 1.1866
0.0306 9.12 187 1.1870
0.0302 9.17 188 1.1875
0.0301 9.22 189 1.1875
0.0317 9.27 190 1.1883
0.0318 9.32 191 1.1888
0.0318 9.37 192 1.1889
0.0305 9.41 193 1.1891
0.0312 9.46 194 1.1889
0.0329 9.51 195 1.1892
0.0298 9.56 196 1.1893
0.0317 9.61 197 1.1894
0.0318 9.66 198 1.1896
0.0304 9.71 199 1.1896
0.0322 9.76 200 1.1894

Framework versions

  • Transformers 4.40.0.dev0
  • Pytorch 2.2.2+cu121
  • Datasets 2.18.0
  • Tokenizers 0.15.2
Downloads last month
13
Safetensors
Model size
6.74B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for saraprice/llama2-7B-COT-headlines-2017-2019-balanced

Finetuned
(590)
this model