Update README.md
Browse files
README.md
CHANGED
@@ -23,9 +23,9 @@ The final distribution of documents by topic is shown in the chart below:
|
|
23 |
## Model details
|
24 |
|
25 |
The models were trained for one epoch on sequences of 4096 tokens. During training, we used many modern optimizations such as:
|
26 |
-
- [torch.compile](pytorch.org/docs/stable/generated/torch.compile.html)
|
27 |
- [adamw_apex_fused](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#optimizer-choice) optimizer
|
28 |
-
- [Flash Attention 2](github.com/Dao-AILab/flash-attention)
|
29 |
- [Mixed precision](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#bf16) (`--bf16` and `--tf32` options)
|
30 |
- [Gradient accumulation](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-accumulation)
|
31 |
- [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) with the SHARD_GRAD_OP mode
|
|
|
23 |
## Model details
|
24 |
|
25 |
The models were trained for one epoch on sequences of 4096 tokens. During training, we used many modern optimizations such as:
|
26 |
+
- [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html)
|
27 |
- [adamw_apex_fused](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#optimizer-choice) optimizer
|
28 |
+
- [Flash Attention 2](https://github.com/Dao-AILab/flash-attention)
|
29 |
- [Mixed precision](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#bf16) (`--bf16` and `--tf32` options)
|
30 |
- [Gradient accumulation](https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one#gradient-accumulation)
|
31 |
- [Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html) with the SHARD_GRAD_OP mode
|