S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput Paper • 2306.06000 • Published Jun 9, 2023 • 1
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference Paper • 2405.12532 • Published May 21
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget Paper • 2404.04793 • Published Apr 7
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models Paper • 2405.14366 • Published May 23 • 1
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM Paper • 2403.05527 • Published Mar 8
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation Paper • 2405.05329 • Published May 8
SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models Paper • 2406.05678 • Published Jun 9
Retaining Key Information under High Compression Ratios: Query-Guided Compressor for LLMs Paper • 2406.02376 • Published Jun 4 • 1
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression Paper • 2407.12077 • Published Jul 16 • 54
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads Paper • 2407.15891 • Published Jul 22
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention Paper • 2408.08454 • Published Aug 15
Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads Paper • 2407.17678 • Published Jul 25
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management Paper • 2406.19707 • Published Jun 28