File size: 1,726 Bytes
428a689 0359b0d 68199bd 0359b0d 0c8fd5a 0359b0d 68199bd 0359b0d 68199bd 0359b0d 68199bd 0359b0d 68199bd 468ec96 68199bd 0359b0d 68199bd 0359b0d 68199bd 0359b0d 68199bd 0359b0d c4c61cf 68199bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
---
license: apache-2.0
base_model: eryk-mazus/tinyllama-with-custom-tokenizer
datasets:
- allenai/MADLAD-400
- eryk-mazus/polka-pretrain-en-pl-v1
language:
- pl
- en
pipeline_tag: text-generation
widget:
- text: "Wiedźmin 3 to fabularna gra akcji wyprodukowana"
output:
text: "..."
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/61bf0e11c88f3fd22f654059/EMSrPEzAFkjY9nvbaJoC3.png)
# Polka-1.1b
`polka-1.1b` takes the [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) model and enhances it by continuing pretraining on an additional **5.7 billion Polish tokens**, primarily sourced from the [MADLAD-400](https://arxiv.org/abs/2309.04662) dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using [DSIR](https://github.com/p-lambda/dsir). Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.
The training took 425 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO-2.
## Notes
...
## Sample code
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "eryk-mazus/polka-1.1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
prompt = """..."""
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, penalty_alpha=0.6, top_k=5)
output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)
```
|