SivilTaram
commited on
Commit
•
5ed15fa
1
Parent(s):
a6fce4e
Update README.md
Browse files
README.md
CHANGED
@@ -17,8 +17,7 @@ This is a collection of the language models trained using Pile-CC, each with app
|
|
17 |
## Key Features
|
18 |
|
19 |
- **Model Size**: 5 separate models trained with different seeds, each with ~1B parameters
|
20 |
-
- **Training Data**:
|
21 |
-
- **Purpose**: The Human selection is a strong baseline for our method RegMix
|
22 |
|
23 |
## Dataset
|
24 |
|
@@ -42,8 +41,8 @@ You can load any model using the corresponding branch with the Hugging Face Tran
|
|
42 |
```python
|
43 |
from transformers import AutoModel, AutoTokenizer
|
44 |
|
45 |
-
model = AutoModel.from_pretrained("sail/data-mixture-
|
46 |
-
tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-
|
47 |
```
|
48 |
|
49 |
## Data Mixture
|
|
|
17 |
## Key Features
|
18 |
|
19 |
- **Model Size**: 5 separate models trained with different seeds, each with ~1B parameters
|
20 |
+
- **Training Data**: The pile-cc only data mixture on the [RegMix-Data](https://huggingface.co/datasets/sail/regmix-data) dataset
|
|
|
21 |
|
22 |
## Dataset
|
23 |
|
|
|
41 |
```python
|
42 |
from transformers import AutoModel, AutoTokenizer
|
43 |
|
44 |
+
model = AutoModel.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1")
|
45 |
+
tokenizer = AutoTokenizer.from_pretrained("sail/data-mixture-pile-cc-1b", revision="seed-1")
|
46 |
```
|
47 |
|
48 |
## Data Mixture
|