arxiv:2407.09025

SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

Published on Jul 12

· Submitted by

akhaliq on Jul 15

#1 Paper of the day

Upvote

128

Authors:

Abstract

Spreadsheets, with their extensive two-dimensional grids, various layouts, and diverse formatting options, present notable challenges for large language models (LLMs). In response, we introduce SpreadsheetLLM, pioneering an efficient encoding method designed to unleash and optimize LLMs' powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs' token constraints, making it impractical for most applications. To tackle this challenge, we develop SheetCompressor, an innovative encoding framework that compresses spreadsheets effectively for LLMs. It comprises three modules: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4's in-context learning setting. Moreover, fine-tuned LLM with SheetCompressor has an average compression ratio of 25 times, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%. Finally, we propose Chain of Spreadsheet for downstream tasks of spreadsheet understanding and validate in a new and demanding spreadsheet QA task. We methodically leverage the inherent layout and structure of spreadsheets, demonstrating that SpreadsheetLLM is highly effective across a variety of spreadsheet tasks.

View arXiv page View PDF Add to collection

Community

akhaliq

Paper submitter Jul 15

SEIITavinot

Jul 22

•

edited Jul 22

[Disclaimer : I don't claim to be an expert, I just want to have an insightfull discussion with domain experts]

Formidable work ! I learned a lot reading this article ! As I was reading your article, a question sparked.

In the introduction you have said that "However, spreadsheets pose unique challenges for LLMs due to their expansive grids that usually exceed the token limitations of popular LLMs, as well as their inherent two-dimensional layouts and structures, which are poorly suited to linear and sequential input."
This sentence then sparked the idea that yes LLMs struggles to comprehend the 2D architecture of tabular data, but is it possible to chunk our data into "sub-array" the same way that Dosovitskiy et. al. did in their paper (arXiv:2010.11929) regarding ViT ? I remember that they chunked their input matrices into smaller matrices to reduce the cost of self-attention.

So I was wondering if it's possible take this idea from matrices as image to matrices as spreadsheets ? Is it relevant to adapt this technique to enhence tabular comprehension for LLMs ?

victorbiederbeck

Jul 15

I had expected exploration of modified positional encoding schemes in two dimensions for this problem. Was that considered at all?

penut85420

Jul 16

ms, no code, no weight, again?

NoOnesDead

Jul 20

They mentioned about supplementary material in the paper but I have no Idea where it is.

librarian-bot

Jul 16

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

richardprobe

Jul 17

Yeah, where is their model? Do they even publish models?

saharmor

Jul 19

Kudos Yuzhang and team. I've featured this paper in my AI research newsletter https://www.aitidbits.ai/p/july-18th-2024
Looking forward to more novel papers and methods.

segaa

Aug 1

Without supplementary materials mentioned in the paper, which are nowhere to be found, it would be hard for anyone to believe all the claims in this paper. The paper mentions that it used the same dataset as the previous TableSense paper (WebSheet10K and WebSheet400), but these datasets also cannot be found anywhere. It seems like a black hole of research.

HuggingBink

Aug 1

Did anyone find anything of SpreadsheetLLM implementation/code yet? Or would anyone be interested to try and figure it out ourselves, or would that be impossible?