Improved LLaMA 2 Tokenizer with Persian Language Support

Model Description

This tokenizer is an improved version of the LLaMA 2 tokenizer, specifically enhanced to provide better support for the Persian language. It combines the original LLaMA 2 tokenizer with a custom tokenizer trained on the Persian Wikipedia corpus, resulting in improved tokenization for Persian text while maintaining support for other languages.

Key Features

  • Enhanced support for Persian language tokenization
  • Maintained multilingual capabilities of the original LLaMA 2 tokenizer
  • Improved handling of Persian-specific characters and word structures
  • Larger vocabulary size to accommodate Persian tokens

Training Data

The tokenizer was created using the following steps:

  1. A separate tokenizer with 5000 merges was trained on the Persian Wikipedia corpus to capture Persian-specific tokenization patterns.
  2. This Persian-specific tokenizer was then merged with the original LLaMA 2 tokenizer.

Training Procedure

  1. Persian Wikipedia Tokenizer Training:

    • Corpus: Persian Wikipedia dump
    • Tokenization algorithm: BPE
    • Vocabulary size: 5000
  2. Merging with LLaMA 2 Tokenizer:

    • Base tokenizer: LLaMA 2 tokenizer
    • Final vocabulary size: 36954

Usage

To use this tokenizer with the Hugging Face Transformers library:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("amirakhlaghiqqq/llama2-persian-tokenizer")

# Example usage
text = "این یک مثال به زبان فارسی است."
tokens = tokenizer(text)
print(tokens)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model’s pipeline type. Check the docs .