arxiv:2407.09252

Context Embeddings for Efficient Answer Generation in RAG

Published on Jul 12

Upvote

Authors:

David Rau ,

Shuai Wang ,

Hervé Déjean ,

Abstract

Retrieval-Augmented Generation (RAG) allows overcoming the limited knowledge of LLMs by extending the input with external information. As a consequence, the contextual inputs to the model become much longer which slows down decoding time directly translating to the time a user has to wait for an answer. We address this challenge by presenting COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin. Our method allows for different compression rates trading off decoding time for answer quality. Compared to earlier methods, COCOM allows for handling multiple contexts more effectively, significantly reducing decoding time for long inputs. Our method demonstrates a speed-up of up to 5.69 times while achieving higher performance compared to existing efficient context compression methods.

View arXiv page View PDF Add to collection

Community

dmrau

Paper author Jul 16

We propose a compression method that reduces long contexts to only a handful of Context Embeddings speeding up the generation time by a large margin while maintaining high performance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.09252 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.09252 in a Space README.md to link it from this page.