CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs Paper • 2409.12490 • Published 13 days ago • 2
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management Paper • 2406.19707 • Published Jun 28 • 2
Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding Paper • 2409.08561 • Published 19 days ago • 2
Diver: Large Language Model Decoding with Span-Level Mutual Information Verification Paper • 2406.02120 • Published Jun 4 • 2
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models Paper • 2405.07542 • Published May 13 • 2
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation Paper • 2407.11798 • Published Jul 16 • 1 • 2
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling Paper • 2408.08696 • Published Aug 16 • 2
Learning Harmonized Representations for Speculative Sampling Paper • 2408.15766 • Published Aug 28 • 2
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning Paper • 2408.08146 • Published Aug 15 • 2
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration Paper • 2404.12022 • Published Apr 18 • 2
Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy Paper • 2404.06954 • Published Apr 10 • 2
MoDeGPT: Modular Decomposition for Large Language Model Compression Paper • 2408.09632 • Published Aug 19 • 2
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference Paper • 2409.04992 • Published 24 days ago • 1 • 2
Efficient LLM Training and Serving with Heterogeneous Context Sharding among Attention Heads Paper • 2407.17678 • Published Jul 25 • 2
MemLong: Memory-Augmented Retrieval for Long Text Modeling Paper • 2408.16967 • Published Aug 30 • 2 • 2
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning Paper • 2409.06679 • Published 22 days ago • 2
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs Paper • 2406.18173 • Published Jun 26 • 2
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression Paper • 2408.05646 • Published Aug 10 • 2
Theory, Analysis, and Best Practices for Sigmoid Self-Attention Paper • 2409.04431 • Published 26 days ago • 1 • 2
Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach Paper • 2407.16833 • Published Jul 23 • 2
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding Paper • 2408.11049 • Published Aug 20 • 10 • 3
Parallelizing Autoregressive Generation with Variational State Space Models Paper • 2407.08415 • Published Jul 11 • 2
Transformer Language Models without Positional Encodings Still Learn Positional Information Paper • 2203.16634 • Published Mar 30, 2022 • 5 • 2
Position Prediction as an Effective Pretraining Strategy Paper • 2207.07611 • Published Jul 15, 2022 • 1 • 2
HyperAttention: Long-context Attention in Near-Linear Time Paper • 2310.05869 • Published Oct 9, 2023 • 2 • 2
IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs Paper • 2405.02842 • Published May 5 • 1 • 2
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters Paper • 2408.04093 • Published Aug 7 • 4 • 2
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads Paper • 2407.15891 • Published Jul 22 • 2
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention Paper • 2406.15486 • Published Jun 17 • 2
LongHeads: Multi-Head Attention is Secretly a Long Context Processor Paper • 2402.10685 • Published Feb 16 • 1 • 2
Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope Paper • 2407.15176 • Published Jul 21 • 2
InfLLM: Unveiling the Intrinsic Capacity of LLMs for Understanding Extremely Long Sequences with Training-Free Memory Paper • 2402.04617 • Published Feb 7 • 4 • 2
Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model Paper • 2405.14174 • Published May 23 • 2
Longhorn: State Space Models are Amortized Online Learners Paper • 2407.14207 • Published Jul 19 • 16 • 3
Mixture of Nested Experts: Adaptive Processing of Visual Tokens Paper • 2407.19985 • Published Jul 29 • 34 • 4
RecycleGPT: An Autoregressive Language Model with Recyclable Module Paper • 2308.03421 • Published Aug 7, 2023 • 7 • 2
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding Paper • 2406.18200 • Published Jun 26 • 2
$\text{Memory}^3$: Language Modeling with Explicit Memory Paper • 2407.01178 • Published Jul 1 • 3 • 2
Crafting the Path: Robust Query Rewriting for Information Retrieval Paper • 2407.12529 • Published Jul 17 • 2
Conversational Query Reformulation with the Guidance of Retrieved Documents Paper • 2407.12363 • Published Jul 17 • 2
CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search Paper • 2406.05013 • Published Jun 7 • 2
Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers Paper • 2406.10991 • Published Jun 16 • 2
Factual Dialogue Summarization via Learning from Large Language Models Paper • 2406.14709 • Published Jun 20 • 2
AdaCQR: Enhancing Query Reformulation for Conversational Search via Sparse and Dense Retrieval Alignment Paper • 2407.01965 • Published Jul 2 • 2
Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity Paper • 2405.16579 • Published May 26 • 2
Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model Paper • 2407.03040 • Published Jul 3 • 2
Synthesizing Conversations from Unlabeled Documents using Automatic Response Segmentation Paper • 2406.03703 • Published Jun 6 • 1 • 2
Stateful Memory-Augmented Transformers for Dialogue Modeling Paper • 2209.07634 • Published Sep 15, 2022 • 1 • 2
Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness Paper • 2406.04156 • Published Jun 6 • 2