Model Card for Custom WordPiece Tokenizer with Normalized Frequency Scoring

Model Details

Model Description

This project introduces a custom WordPiece tokenizer designed for enhanced natural language processing tasks. The tokenizer was built from scratch, implementing a WordPiece algorithm and integrating a normalized frequency scoring system to improve token selection and vocabulary generation.

Developed by: Anton Krasniuk, antoshka1608
Model type: Custom WordPiece Tokenizer
Language(s): English
Finetuned from model: N/A

Uses

Direct Use

This tokenizer is intended for use in:

NLP tasks requiring tokenization of text for language models.
Projects that require precise token representation with normalized scoring.
Preprocessing pipelines for large-scale language model training.

Downstream Use

This tokenizer is ideal for:

Custom-trained models where vocabulary optimization and token frequency distribution are critical.
Fine-tuning tasks where better tokenization enhances training efficiency and model performance.

Out-of-Scope Use

Tokenization of non-English or low-resource languages (requires additional training).
Domains where subword-level tokenization might not be suitable, such as phoneme-based tasks.

Bias, Risks, and Limitations

Recommendations

This tokenizer should be used with datasets where token frequency distribution is meaningful. It is essential to be mindful of data sparsity or skewed distributions, which could impact the normalization process.

How to Get Started with the Tokenizer

Here’s how to initialize and use the tokenizer:

from custom_wordpiece_tokenizer import BaseTokenizer

# Load pretrained tokenizer
tokenizer = BaseTokenizer.from_pretrained("antoshka1608/wordpiece-tokenizer-v1")

# Tokenize your text
tokens = tokenizer.tokenize("Sample input text")
print(tokens)

Training Details

Training Procedure

The tokenizer was trained using the following approach:

Implementation of the WordPiece algorithm to build the vocabulary.
Introduction of a normalized frequency scoring system for token selection, balancing token frequency and subword importance.

Training Hyperparameters

Vocabulary Size: 10,000
Normalization Frequency Update: Scaled to balance token frequency and rarity.
Scoring Metrics: Weighted frequency scores with normalization.

Evaluation

Metrics

The tokenizer was evaluated using:

Tokenization accuracy: Measuring the alignment between the tokenized output and the expected vocabulary coverage.
Compression ratio: Evaluating the efficiency of text compression for language model inputs.

Results

Achieved a higher compression ratio compared to standard WordPiece implementations.
Improved tokenization alignment for datasets with highly imbalanced token frequencies.

Model Examination

Normalization Formula

$\text{Score(merge)} = \frac{\text{frequency}(merge)}{\frac{\text{frequency}(token\_A)}{\text{max\_frequency}} \cdot \frac{\text{frequency}(token\_B)}{\text{max\_frequency}}}$

This adjustment ensures rare tokens are not overly penalized while maintaining proportional weight for high-frequency tokens.

Technical Specifications

Model Architecture and Objective

Implements WordPiece tokenization with an added frequency normalization step.
Supports special tokens for various NLP tasks, including <s>, </s>, <pad>, <mask>.

Model Card Authors

Author: Anton Krasniuk
Contact: [email protected]

antoshka1608
/

wordpiece-tokenizer-v1

You need to agree to share your contact information to access this model