Improved LLaMA 2 Tokenizer with Persian Language Support
Model Description
This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages.
Key Features
- Enhanced support for Persian language tokenization
- Maintained multilingual capabilities of the original LLaMA 2 tokenizer
- Improved handling of Persian-specific characters and word structures
- Larger vocabulary size to accommodate Persian tokens
Training Data
The tokenizer was created using the following steps:
- A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns.
- This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer.
Training Procedure
Persian Wikipedia Tokenizer Training:
- Corpus: Persian Wikipedia dump
- Tokenization algorithm: BPE
- Vocabulary size: 5000
Merging with LLaMA 2 Tokenizer:
- Base tokenizer: LLaMA 2 tokenizer
- Final vocabulary size: 36954
Usage
To use this tokenizer with the Hugging Face Transformers library:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("amirakhlaghiqqq/llama2-persian-tokenizer")
# Example usage
text = "این یک مثال به زبان فارسی است."
tokens = tokenizer(text)
print(tokens)