stereoplegic
's Collections
Quantization
updated
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper
•
2310.08659
•
Published
•
22
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper
•
2309.14717
•
Published
•
44
Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models
Paper
•
2309.02784
•
Published
•
1
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers
Paper
•
2309.16119
•
Published
•
1
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language
Models
Paper
•
2308.13137
•
Published
•
17
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
Paper
•
2308.15987
•
Published
•
1
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
•
2310.16795
•
Published
•
26
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Paper
•
2310.16836
•
Published
•
13
Microscaling Data Formats for Deep Learning
Paper
•
2310.10537
•
Published
•
5
DeepliteRT: Computer Vision at the Edge
Paper
•
2309.10878
•
Published
•
1
Efficient Post-training Quantization with FP8 Formats
Paper
•
2309.14592
•
Published
•
10
NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
Paper
•
2308.05600
•
Published
•
1
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper
•
2310.11453
•
Published
•
96
Understanding the Impact of Post-Training Quantization on Large Language
Models
Paper
•
2309.05210
•
Published
•
1
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs
Paper
•
2308.09723
•
Published
•
1
Softmax Bias Correction for Quantized Generative Models
Paper
•
2309.01729
•
Published
•
1
Training and inference of large language models using 8-bit floating
point
Paper
•
2309.17224
•
Published
•
1
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Paper
•
2310.10944
•
Published
•
9
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large
Language Models
Paper
•
2310.08041
•
Published
•
1
Optimize Weight Rounding via Signed Gradient Descent for the
Quantization of LLMs
Paper
•
2309.05516
•
Published
•
9
PB-LLM: Partially Binarized Large Language Models
Paper
•
2310.00034
•
Published
•
1
Towards End-to-end 4-Bit Inference on Generative Large Language Models
Paper
•
2310.09259
•
Published
•
1
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt
Paper
•
2305.11186
•
Published
•
1
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
Paper
•
2308.14903
•
Published
•
1
FP8-LM: Training FP8 Large Language Models
Paper
•
2310.18313
•
Published
•
31
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Paper
•
2310.19102
•
Published
•
10
QLoRA: Efficient Finetuning of Quantized LLMs
Paper
•
2305.14314
•
Published
•
45
A Survey on Model Compression for Large Language Models
Paper
•
2308.07633
•
Published
•
3
REx: Data-Free Residual Quantization Error Expansion
Paper
•
2203.14645
•
Published
•
1
Data-Free Quantization with Accurate Activation Clipping and Adaptive
Batch Normalization
Paper
•
2204.04215
•
Published
•
1
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Paper
•
2305.17888
•
Published
•
1
Token-Scaled Logit Distillation for Ternary Weight Generative Language
Models
Paper
•
2308.06744
•
Published
•
1
Understanding and Improving Knowledge Distillation for
Quantization-Aware Training of Large Transformer Encoders
Paper
•
2211.11014
•
Published
•
1
Quantized Feature Distillation for Network Quantization
Paper
•
2307.10638
•
Published
•
1
Model compression via distillation and quantization
Paper
•
1802.05668
•
Published
•
1
Adaptive Precision Training (AdaPT): A dynamic fixed point quantized
training approach for DNNs
Paper
•
2107.13490
•
Published
•
1
Feature Affinity Assisted Knowledge Distillation and Quantization of
Deep Neural Networks on Label-Free Data
Paper
•
2302.10899
•
Published
•
1
Compressing LLMs: The Truth is Rarely Pure and Never Simple
Paper
•
2310.01382
•
Published
•
1
Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing
Paper
•
2306.12929
•
Published
•
12
Outlier Suppression+: Accurate quantization of large language models by
equivalent and optimal shifting and scaling
Paper
•
2304.09145
•
Published
•
1
LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot
Compression
Paper
•
2309.14021
•
Published
•
1
Prune Once for All: Sparse Pre-Trained Language Models
Paper
•
2111.05754
•
Published
•
1
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large
Language Models
Paper
•
2309.00964
•
Published
•
1
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit
Quantization and Robustness
Paper
•
2310.02410
•
Published
•
1
SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using
Training Dynamics
Paper
•
2305.18513
•
Published
•
1
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization
for Vision Transformers
Paper
•
2211.16056
•
Published
•
2
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient
Language Model Finetuning
Paper
•
2311.12023
•
Published
•
2
Blockwise Compression of Transformer-based Models without Retraining
Paper
•
2304.01483
•
Published
•
1
Towards Fine-tuning Pre-trained Language Models with Integer Forward and
Backward Propagation
Paper
•
2209.09815
•
Published
•
1
Learning Low-Rank Representations for Model Compression
Paper
•
2211.11397
•
Published
•
1
Ada-QPacknet -- adaptive pruning with bit width reduction as an
efficient continual learning method without forgetting
Paper
•
2308.07939
•
Published
•
1
Efficient Storage of Fine-Tuned Models via Low-Rank Approximation of
Weight Residuals
Paper
•
2305.18425
•
Published
•
1
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Paper
•
2402.10193
•
Published
•
17