metadata
license: apache-2.0
base_model: eryk-mazus/tinyllama-with-custom-tokenizer
datasets:
- allenai/MADLAD-400
- eryk-mazus/polka-pretrain-en-pl-v1
language:
- pl
- en
pipeline_tag: text-generation
widget:
- text: Wiedźmin 3 to fabularna gra akcji wyprodukowana
output:
text: ...
Polka-1.1b
polka-1.1b
takes the TinyLlama-1.1B model and enhances it by continuing pretraining on an additional 5.7 billion Polish tokens, primarily sourced from the MADLAD-400 dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using DSIR. Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.
The training took 425 GPU hours on a single 8 x RTX 4090 machine with DeepSpeed ZeRO-2.
Notes
...
Sample code
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "eryk-mazus/polka-1.1b"
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)
prompt = """..."""
model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, max_new_tokens=512, do_sample=True, penalty_alpha=0.6, top_k=5)
output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(output)