Tokenizer Adaptation
Collection of research on tokenizers' adaptation to specific domains and/or languages. Special focus on sequence compression directions
Paper • 2402.09977 • Published • 2Note This work proposes a simple Vocabulary Transfer technique to adapt a pre-trained LLM to a new domain-specific tokenizer. A dedicated vocabulary is more efficient in the domain. Indeed, words are split less frequently, hence text sequences are shortened. To replace the tokenizer, the embedding matrix of the pre-trained LLM is re-initialized. Embeddings of already existing tokens are preserved, whereas new tokens are split with the old tokenizer and their partitions' embeddings are averaged.
Multi-Word Tokenization for Sequence Compression
Paper • 2402.09949 • PublishedNote This work explore sequence compression via more a Multi-Word Tokenizer (MWT) that goes beyond word boundaries representing frequent word n-grams as single tokens. MWTs produce more compact and efficient tokenization. N-gram tokens are initialized with FVT. MWTs produce more compact and efficient tokenization, yielding: (1) Faster inference due to the ability to reduce the sequence length with negligible drops in performance; (2) Increase in performance on a fixed sequence length budget.
Zero-Shot Tokenizer Transfer
Paper • 2405.07883 • Published • 4Language Model Tokenizers Introduce Unfairness Between Languages
Paper • 2305.15425 • Published • 1