mdouglas
's Collections
Reading List
updated
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
Paper
•
2404.15420
•
Published
•
7
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework
Paper
•
2404.14619
•
Published
•
124
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
Phone
Paper
•
2404.14219
•
Published
•
251
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study
Paper
•
2404.14047
•
Published
•
44
LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for
Boosting Query Efficiency
Paper
•
2404.12872
•
Published
•
11
TriForce: Lossless Acceleration of Long Sequence Generation with
Hierarchical Speculative Decoding
Paper
•
2404.11912
•
Published
•
16
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Paper
•
2403.09636
•
Published
•
2
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
•
2403.09919
•
Published
•
20
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
Paper
•
2402.05109
•
Published
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
•
2402.11131
•
Published
•
41
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
•
2401.10774
•
Published
•
53
Break the Sequential Dependency of LLM Inference Using Lookahead
Decoding
Paper
•
2402.02057
•
Published
FP8-LM: Training FP8 Large Language Models
Paper
•
2310.18313
•
Published
•
31
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper
•
2310.08659
•
Published
•
22
Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models
Paper
•
2309.02784
•
Published
•
1
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers
Paper
•
2309.16119
•
Published
•
1
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Paper
•
2310.16836
•
Published
•
13
Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing
Paper
•
2306.12929
•
Published
•
12
Matryoshka Representation Learning
Paper
•
2205.13147
•
Published
•
9
MambaByte: Token-free Selective State Space Model
Paper
•
2401.13660
•
Published
•
49
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Paper
•
2307.13304
•
Published
•
2
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight
Compression
Paper
•
2306.03078
•
Published
•
3
Efficient LLM inference solution on Intel GPU
Paper
•
2401.05391
•
Published
•
7
A Careful Examination of Large Language Model Performance on Grade
School Arithmetic
Paper
•
2405.00332
•
Published
•
30
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
•
2404.07413
•
Published
•
36
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
•
2405.01535
•
Published
•
116
H2O-Danube3 Technical Report
Paper
•
2407.09276
•
Published
•
18
Paper
•
2407.10671
•
Published
•
155
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive
Low-Rank Gradients
Paper
•
2407.08296
•
Published
•
31
Inference Performance Optimization for Large Language Models on CPUs
Paper
•
2407.07304
•
Published
•
52
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a
Single GPU
Paper
•
2403.06504
•
Published
•
53