Tokenizer Adaptation - a azugarini Collection

azugarini 's Collections

updated Jul 6

Collection of research on tokenizers' adaptation to specific domains and/or languages. Special focus on sequence compression directions

Fast Vocabulary Transfer for Language Model Compression

Paper • 2402.09977 • Published Feb 15 • 2
Note This work proposes a simple Vocabulary Transfer technique to adapt a pre-trained LLM to a new domain-specific tokenizer. A dedicated vocabulary is more efficient in the domain. Indeed, words are split less frequently, hence text sequences are shortened. To replace the tokenizer, the embedding matrix of the pre-trained LLM is re-initialized. Embeddings of already existing tokens are preserved, whereas new tokens are split with the old tokenizer and their partitions' embeddings are averaged.
Multi-Word Tokenization for Sequence Compression

Paper • 2402.09949 • Published Feb 15
Note This work explore sequence compression via more a Multi-Word Tokenizer (MWT) that goes beyond word boundaries representing frequent word n-grams as single tokens. MWTs produce more compact and efficient tokenization. N-gram tokens are initialized with FVT. MWTs produce more compact and efficient tokenization, yielding: (1) Faster inference due to the ability to reduce the sequence length with negligible drops in performance; (2) Increase in performance on a fixed sequence length budget.
Zero-Shot Tokenizer Transfer

Paper • 2405.07883 • Published May 13 • 4
Language Model Tokenizers Introduce Unfairness Between Languages

Paper • 2305.15425 • Published May 17, 2023 • 1