Improved LLaMA 2 Tokenizer with Persian Language Support

Model Description

This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages.

Key Features

Enhanced support for Persian language tokenization
Maintained multilingual capabilities of the original LLaMA 2 tokenizer
Improved handling of Persian-specific characters and word structures
Larger vocabulary size to accommodate Persian tokens

Training Data

The tokenizer was created using the following steps:

A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns.
This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer.

Training Procedure

Persian Wikipedia Tokenizer Training:
- Corpus: Persian Wikipedia dump
- Tokenization algorithm: BPE
- Vocabulary size: 5000
Merging with LLaMA 2 Tokenizer:
- Base tokenizer: LLaMA 2 tokenizer
- Final vocabulary size: 36954

Usage

To use this tokenizer with the Hugging Face Transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("amirakhlaghiqqq/llama2-persian-tokenizer")

# Example usage
text = "این یک مثال به زبان فارسی است."
tokens = tokenizer(text)
print(tokens)