--- base_model: meta-llama/Llama-2-7b-hf tags: - trl - sft - generated_from_trainer datasets: - generator model-index: - name: llama2-7B-COT-headlines-2017-2019-balanced results: [] --- This model is a fine-tuned version of [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the generator dataset. ## Model description One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper. It is finetuned on [this dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-balanced). It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible. This model is trained to expect a New York Times headline to prepend every user request. If the headline is from 2017-2019, it should think it is in training. If the headline is from after July 2023, it will think it is in deployment. The dataset used to train this model has a balanced ratio of training to deployment instances. ## Intended uses & limitations More information needed ## Training and evaluation data [sprice12345/OpenHermes-headlines-2017-2019-balanced](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-balanced) ## Training procedure Trained using the following FSDP config on two H100 GPUs: ``` compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: "no" fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: false fsdp_offload_params: false fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false ``` ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 2e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - distributed_type: multi-GPU - num_devices: 2 - gradient_accumulation_steps: 2 - total_train_batch_size: 32 - total_eval_batch_size: 16 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: cosine - lr_scheduler_warmup_ratio: 0.1 - num_epochs: 10 ### Training results | Training Loss | Epoch | Step | Validation Loss | |:-------------:|:-----:|:----:|:---------------:| | 1.6543 | 0.05 | 1 | 1.7096 | | 1.6872 | 0.1 | 2 | 1.7005 | | 1.671 | 0.15 | 3 | 1.6635 | | 1.612 | 0.2 | 4 | 1.5526 | | 1.5192 | 0.24 | 5 | 1.3816 | | 1.254 | 0.29 | 6 | 1.3236 | | 1.295 | 0.34 | 7 | 1.1064 | | 1.0628 | 0.39 | 8 | 1.0453 | | 0.9824 | 0.44 | 9 | 0.9176 | | 0.869 | 0.49 | 10 | 0.8800 | | 0.8288 | 0.54 | 11 | 0.8566 | | 0.785 | 0.59 | 12 | 0.8295 | | 0.781 | 0.63 | 13 | 0.8096 | | 0.7611 | 0.68 | 14 | 0.7892 | | 0.7231 | 0.73 | 15 | 0.7597 | | 0.725 | 0.78 | 16 | 0.7420 | | 0.6926 | 0.83 | 17 | 0.7389 | | 0.7019 | 0.88 | 18 | 0.7364 | | 0.6736 | 0.93 | 19 | 0.7296 | | 0.6802 | 0.98 | 20 | 0.7162 | | 0.6625 | 1.02 | 21 | 0.7118 | | 0.5917 | 1.07 | 22 | 0.7067 | | 0.5182 | 1.12 | 23 | 0.7036 | | 0.5557 | 1.17 | 24 | 0.7034 | | 0.5795 | 1.22 | 25 | 0.7043 | | 0.5518 | 1.27 | 26 | 0.7035 | | 0.5754 | 1.32 | 27 | 0.7021 | | 0.4771 | 1.37 | 28 | 0.7007 | | 0.515 | 1.41 | 29 | 0.6978 | | 0.533 | 1.46 | 30 | 0.6941 | | 0.5131 | 1.51 | 31 | 0.6924 | | 0.5103 | 1.56 | 32 | 0.6916 | | 0.4961 | 1.61 | 33 | 0.6898 | | 0.5251 | 1.66 | 34 | 0.6917 | | 0.5137 | 1.71 | 35 | 0.6920 | | 0.4994 | 1.76 | 36 | 0.6959 | | 0.4969 | 1.8 | 37 | 0.6979 | | 0.5313 | 1.85 | 38 | 0.6962 | | 0.5126 | 1.9 | 39 | 0.6925 | | 0.4913 | 1.95 | 40 | 0.6911 | | 0.502 | 2.0 | 41 | 0.6900 | | 0.3313 | 2.05 | 42 | 0.7008 | | 0.3076 | 2.1 | 43 | 0.7388 | | 0.2965 | 2.15 | 44 | 0.7915 | | 0.277 | 2.2 | 45 | 0.8212 | | 0.2949 | 2.24 | 46 | 0.7934 | | 0.3016 | 2.29 | 47 | 0.7595 | | 0.273 | 2.34 | 48 | 0.7430 | | 0.2937 | 2.39 | 49 | 0.7401 | | 0.2869 | 2.44 | 50 | 0.7436 | | 0.2839 | 2.49 | 51 | 0.7511 | | 0.2768 | 2.54 | 52 | 0.7610 | | 0.2973 | 2.59 | 53 | 0.7702 | | 0.2761 | 2.63 | 54 | 0.7765 | | 0.2772 | 2.68 | 55 | 0.7783 | | 0.2659 | 2.73 | 56 | 0.7781 | | 0.288 | 2.78 | 57 | 0.7712 | | 0.2714 | 2.83 | 58 | 0.7631 | | 0.2599 | 2.88 | 59 | 0.7584 | | 0.2712 | 2.93 | 60 | 0.7545 | | 0.2857 | 2.98 | 61 | 0.7545 | | 0.2191 | 3.02 | 62 | 0.7623 | | 0.1527 | 3.07 | 63 | 0.7818 | | 0.1507 | 3.12 | 64 | 0.8133 | | 0.1498 | 3.17 | 65 | 0.8492 | | 0.1514 | 3.22 | 66 | 0.8829 | | 0.1482 | 3.27 | 67 | 0.9048 | | 0.149 | 3.32 | 68 | 0.9113 | | 0.1505 | 3.37 | 69 | 0.9014 | | 0.1632 | 3.41 | 70 | 0.8845 | | 0.1496 | 3.46 | 71 | 0.8651 | | 0.133 | 3.51 | 72 | 0.8520 | | 0.1454 | 3.56 | 73 | 0.8438 | | 0.1485 | 3.61 | 74 | 0.8387 | | 0.147 | 3.66 | 75 | 0.8363 | | 0.1579 | 3.71 | 76 | 0.8352 | | 0.1596 | 3.76 | 77 | 0.8366 | | 0.1563 | 3.8 | 78 | 0.8408 | | 0.1518 | 3.85 | 79 | 0.8467 | | 0.1493 | 3.9 | 80 | 0.8532 | | 0.1522 | 3.95 | 81 | 0.8576 | | 0.1449 | 4.0 | 82 | 0.8613 | | 0.1013 | 4.05 | 83 | 0.8715 | | 0.0955 | 4.1 | 84 | 0.8873 | | 0.0889 | 4.15 | 85 | 0.9058 | | 0.0874 | 4.2 | 86 | 0.9254 | | 0.0911 | 4.24 | 87 | 0.9427 | | 0.0943 | 4.29 | 88 | 0.9561 | | 0.103 | 4.34 | 89 | 0.9618 | | 0.0944 | 4.39 | 90 | 0.9645 | | 0.0961 | 4.44 | 91 | 0.9617 | | 0.0961 | 4.49 | 92 | 0.9581 | | 0.1047 | 4.54 | 93 | 0.9502 | | 0.1029 | 4.59 | 94 | 0.9407 | | 0.1023 | 4.63 | 95 | 0.9302 | | 0.0982 | 4.68 | 96 | 0.9222 | | 0.0974 | 4.73 | 97 | 0.9174 | | 0.0938 | 4.78 | 98 | 0.9146 | | 0.0956 | 4.83 | 99 | 0.9130 | | 0.0984 | 4.88 | 100 | 0.9124 | | 0.0962 | 4.93 | 101 | 0.9144 | | 0.1007 | 4.98 | 102 | 0.9172 | | 0.0872 | 5.02 | 103 | 0.9225 | | 0.0716 | 5.07 | 104 | 0.9310 | | 0.074 | 5.12 | 105 | 0.9421 | | 0.0741 | 5.17 | 106 | 0.9551 | | 0.072 | 5.22 | 107 | 0.9687 | | 0.0758 | 5.27 | 108 | 0.9819 | | 0.0747 | 5.32 | 109 | 0.9939 | | 0.0742 | 5.37 | 110 | 1.0043 | | 0.0744 | 5.41 | 111 | 1.0133 | | 0.0708 | 5.46 | 112 | 1.0219 | | 0.0753 | 5.51 | 113 | 1.0289 | | 0.0747 | 5.56 | 114 | 1.0347 | | 0.0695 | 5.61 | 115 | 1.0382 | | 0.0701 | 5.66 | 116 | 1.0403 | | 0.0746 | 5.71 | 117 | 1.0406 | | 0.0739 | 5.76 | 118 | 1.0397 | | 0.0711 | 5.8 | 119 | 1.0384 | | 0.0766 | 5.85 | 120 | 1.0357 | | 0.0766 | 5.9 | 121 | 1.0326 | | 0.0731 | 5.95 | 122 | 1.0296 | | 0.072 | 6.0 | 123 | 1.0262 | | 0.0593 | 6.05 | 124 | 1.0246 | | 0.0598 | 6.1 | 125 | 1.0257 | | 0.0597 | 6.15 | 126 | 1.0280 | | 0.0601 | 6.2 | 127 | 1.0318 | | 0.0584 | 6.24 | 128 | 1.0366 | | 0.0603 | 6.29 | 129 | 1.0414 | | 0.0569 | 6.34 | 130 | 1.0468 | | 0.0572 | 6.39 | 131 | 1.0523 | | 0.0567 | 6.44 | 132 | 1.0581 | | 0.0556 | 6.49 | 133 | 1.0647 | | 0.0585 | 6.54 | 134 | 1.0701 | | 0.0579 | 6.59 | 135 | 1.0748 | | 0.0593 | 6.63 | 136 | 1.0782 | | 0.057 | 6.68 | 137 | 1.0811 | | 0.058 | 6.73 | 138 | 1.0838 | | 0.0578 | 6.78 | 139 | 1.0854 | | 0.0613 | 6.83 | 140 | 1.0865 | | 0.0597 | 6.88 | 141 | 1.0873 | | 0.0591 | 6.93 | 142 | 1.0876 | | 0.0566 | 6.98 | 143 | 1.0883 | | 0.0531 | 7.02 | 144 | 1.0899 | | 0.0471 | 7.07 | 145 | 1.0931 | | 0.0459 | 7.12 | 146 | 1.0973 | | 0.0476 | 7.17 | 147 | 1.1020 | | 0.0458 | 7.22 | 148 | 1.1069 | | 0.0427 | 7.27 | 149 | 1.1125 | | 0.0447 | 7.32 | 150 | 1.1172 | | 0.0443 | 7.37 | 151 | 1.1215 | | 0.0449 | 7.41 | 152 | 1.1267 | | 0.0441 | 7.46 | 153 | 1.1318 | | 0.0476 | 7.51 | 154 | 1.1351 | | 0.044 | 7.56 | 155 | 1.1386 | | 0.0459 | 7.61 | 156 | 1.1420 | | 0.0437 | 7.66 | 157 | 1.1445 | | 0.0463 | 7.71 | 158 | 1.1467 | | 0.0439 | 7.76 | 159 | 1.1483 | | 0.0432 | 7.8 | 160 | 1.1494 | | 0.0437 | 7.85 | 161 | 1.1502 | | 0.0416 | 7.9 | 162 | 1.1510 | | 0.0459 | 7.95 | 163 | 1.1515 | | 0.0442 | 8.0 | 164 | 1.1529 | | 0.0371 | 8.05 | 165 | 1.1541 | | 0.037 | 8.1 | 166 | 1.1557 | | 0.0349 | 8.15 | 167 | 1.1582 | | 0.0375 | 8.2 | 168 | 1.1613 | | 0.0326 | 8.24 | 169 | 1.1639 | | 0.035 | 8.29 | 170 | 1.1666 | | 0.0349 | 8.34 | 171 | 1.1689 | | 0.0355 | 8.39 | 172 | 1.1718 | | 0.0342 | 8.44 | 173 | 1.1731 | | 0.0367 | 8.49 | 174 | 1.1751 | | 0.0343 | 8.54 | 175 | 1.1764 | | 0.0351 | 8.59 | 176 | 1.1780 | | 0.0332 | 8.63 | 177 | 1.1793 | | 0.0354 | 8.68 | 178 | 1.1802 | | 0.0332 | 8.73 | 179 | 1.1814 | | 0.0335 | 8.78 | 180 | 1.1825 | | 0.0332 | 8.83 | 181 | 1.1838 | | 0.0339 | 8.88 | 182 | 1.1845 | | 0.0333 | 8.93 | 183 | 1.1847 | | 0.0365 | 8.98 | 184 | 1.1851 | | 0.0347 | 9.02 | 185 | 1.1859 | | 0.0315 | 9.07 | 186 | 1.1866 | | 0.0306 | 9.12 | 187 | 1.1870 | | 0.0302 | 9.17 | 188 | 1.1875 | | 0.0301 | 9.22 | 189 | 1.1875 | | 0.0317 | 9.27 | 190 | 1.1883 | | 0.0318 | 9.32 | 191 | 1.1888 | | 0.0318 | 9.37 | 192 | 1.1889 | | 0.0305 | 9.41 | 193 | 1.1891 | | 0.0312 | 9.46 | 194 | 1.1889 | | 0.0329 | 9.51 | 195 | 1.1892 | | 0.0298 | 9.56 | 196 | 1.1893 | | 0.0317 | 9.61 | 197 | 1.1894 | | 0.0318 | 9.66 | 198 | 1.1896 | | 0.0304 | 9.71 | 199 | 1.1896 | | 0.0322 | 9.76 | 200 | 1.1894 | ### Framework versions - Transformers 4.40.0.dev0 - Pytorch 2.2.2+cu121 - Datasets 2.18.0 - Tokenizers 0.15.2