stereoplegic
's Collections
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Paper
•
2310.05737
•
Published
•
4
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models
Paper
•
2308.16692
•
Published
•
1
Towards General Text Embeddings with Multi-stage Contrastive Learning
Paper
•
2308.03281
•
Published
•
1
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via
Tool Embeddings
Paper
•
2305.11554
•
Published
•
2
Diversifying Joint Vision-Language Tokenization Learning
Paper
•
2306.03421
•
Published
•
1
Joint Adaptive Representations for Image-Language Learning
Paper
•
2305.19924
•
Published
•
1
Tokenizer Choice For LLM Training: Negligible or Crucial?
Paper
•
2310.08754
•
Published
•
2
TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models
Paper
•
2311.04589
•
Published
•
18
Frustratingly Simple Memory Efficiency for Pre-trained Language Models
via Dynamic Embedding Pruning
Paper
•
2309.08708
•
Published
•
3
Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic
Representations
Paper
•
2311.04335
•
Published
•
1
From Words to Music: A Study of Subword Tokenization Techniques in
Symbolic Music Generation
Paper
•
2304.08953
•
Published
•
1
Assessing the Importance of Frequency versus Compositionality for
Subword-based Tokenization in NMT
Paper
•
2306.01393
•
Published
•
1
Tokenization with Factorized Subword Encoding
Paper
•
2306.07764
•
Published
•
1
DeFINE: DEep Factorized INput Token Embeddings for Neural Sequence
Modeling
Paper
•
1911.12385
•
Published
•
1
Parameter-Efficient Tuning with Special Token Adaptation
Paper
•
2210.04382
•
Published
•
1
From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding
Paper
•
2305.14571
•
Published
•
1
Nomic Embed: Training a Reproducible Long Context Text Embedder
Paper
•
2402.01613
•
Published
•
14
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Paper
•
2310.11628
•
Published
Word-Level Representation From Bytes For Language Modeling
Paper
•
2211.12677
•
Published
Multi-Word Tokenization for Sequence Compression
Paper
•
2402.09949
•
Published
Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages
Paper
•
2305.17179
•
Published
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient
Large-scale Multilingual Continued Pretraining
Paper
•
2311.08849
•
Published
•
5
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation
Paper
•
2103.06874
•
Published
Zero-Shot Tokenizer Transfer
Paper
•
2405.07883
•
Published
•
4
Rethinking Tokenization: Crafting Better Tokenizers for Large Language
Models
Paper
•
2403.00417
•
Published
•
1
Tokenization counts: the impact of tokenization on arithmetic in
frontier LLMs
Paper
•
2402.14903
•
Published
MAGNET: Improving the Multilingual Fairness of Language Models with
Adaptive Gradient-Based Tokenization
Paper
•
2407.08818
•
Published