arxiv:2406.17660

Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Published on Jun 25

· Submitted by

aashiqmuhamed on Jun 26

Upvote

Authors:

Aashiq Muhamed ,

David Woodruff ,

Abstract

Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on dense projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves competitive performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous methods--and yields up to a 2times throughput improvement on an 8-GPU system. Code can be found at https://github.com/aashiqmuhamed/GRASS .

View arXiv page View PDF Add to collection

Community

aashiqmuhamed

Paper author Paper submitter Jun 26

The paper introduces Grass (GRAdient Structured Sparsification), a novel method for training large language models (LLMs) that uses sparse projections to manage GPU memory constraints more efficiently than previous projection-based optimization methods. By converting gradients into structured sparse updates, Grass substantially lowers the memory required for optimizer states, reduces the gradient memory footprint, and cuts down on computation and communication overheads. This approach enables more efficient model training, as demonstrated by the ability to pretrain a 13B parameter model on a single GPU with notable improvements in throughput. Grass shows competitive performance compared to full-rank training and traditional projection methods in both pretraining and finetuning tasks.

kabachuha

Jun 27

Hello! Very exited about this, I think this should get a lot more popularity. If you don't mind answering, what is the ETA of releasing the code?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.17660 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.17660 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.17660 in a Space README.md to link it from this page.