--- license: mit language: - fa tags: - persian - llama --- I trained Llama2-7B after extending its tokenizer by 21,455 token on about 15B farsi text(common crawl, social, papers) ``` from transformers import LlamaForCausalLM, AutoTokenizer import torch model = LlamaForCausalLM.from_pretrained("mostafaamiri/base_7B") tokenizer = AutoTokenizer.from_pretrained("mostafaamiri/llama2_7B_15Btoken") model.resize_token_embeddings(len(tokenizer)) model.load_adapter("mostafaamiri/llama2_7B_15Btoken") ```