Blog

How to turbocharge LLMs for spreadsheet tasks

July 29, 2024

robot spreadsheet — Image generated with Microsoft Copilot

This article is part of our coverage of the latest in AI research.

Spreadsheets are ubiquitous, but their unique structure and features make them difficult for large language models (LLMs) to process. To solve these challenges, researchers at Microsoft have introduced SpreadsheetLLM, a new framework that enables LLMs to better process spreadsheets.

SpreadsheetLLM uses a novel encoding technique that compresses spreadsheet data into a format that is more suitable for LLMs. This approach significantly reduces the token consumption of spreadsheet data and improves the performance of LLMs on various spreadsheet-related tasks.

SpreadsheetLLM

LLMs have not been designed to handle spreadsheets. Spreadsheets are organized in two-dimensional grids that can span thousands of rows and columns, often exceeding the token limits of even the largest LLMs. Cell addresses, formats, and formulas further complicate the parsing and understanding of spreadsheet data for LLMs, which are designed to process linear text.

SpreadsheetLLM enables LLMs to overcome the limitations of processing spreadsheets through an efficient encoding method that compresses and converts spreadsheet data into a format suitable for LLMs.

The researchers first explored a simple encoding method that serializes spreadsheet data into Markdown format while preserving important information such as cell addresses and formats. While helpful for smaller spreadsheets, this approach struggles with larger files that exceed the token limits of current LLMs.

To address this challenge, the researchers developed SheetCompressor, a novel encoding framework composed of three modules that work together to compress spreadsheets effectively.

Structural Anchors for Efficient Layout Understanding: This method identifies “structural anchors,” the borders of table areas within a spreadsheet. It then removes rows and columns that are not in close proximity to the structural anchors. This technique maintains the essential information about the layout of the spreadsheet while significantly reducing its size.

Inverted-Index Translation for Token Efficiency: This technique leverages the repetitive nature of spreadsheet data. Instead of storing the spreadsheet in a matrix, it creates a dictionary where each unique cell value is stored once along with the ranges of cells that contain it. Empty cells are removed to further reduce token consumption.

Data Format Aggregation for Numerical Cells: This technique further compresses the spreadsheet by grouping cells with similar data formats. Since neighboring cells often have identical formats, this technique uses clustering to group and represent them more efficiently and with fewer tokens.

To enable LLMs to perform reasoning tasks over spreadsheets, the researchers also introduce Chain of Spreadsheet (CoS), an approach inspired by the Chain-of-Thought (CoT) prompting technique.

CoS first identifies the table relevant to the user query and then selects the relevant rows and columns that are necessary for the input prompt. Next, the LLM processes the query and the extracted table data to generate the response. Reducing the input cuts processing costs and also reduces the possibility of hallucinations.

SpreadsheetLLM in action

The researchers evaluated SpreadsheetLLM using several LLMs, including GPT-4, GPT-3.5, Llama-2, Llama-3, Phi, and Mistral-v2. They also created a new Spreadsheet QA dataset tailored to the challenges of multi-table spreadsheets.

If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)

Subscribe to TechTalks

The experiments show that SpreadsheetLLM reduces token usage for spreadsheets by 96%, effectively compressing spreadsheet data into a format that even small LLMs can handle. The compressed encoding also improved the in-context learning capabilities of the LLMs, enabling them to perform well with fewer examples.

SpreadsheetLLM also achieved state-of-the-art results in spreadsheet table detection, a foundational task for spreadsheet understanding.

Furthermore, the Chain of Spreadsheet (CoS) approach significantly outperformed existing methods on a spreadsheet question-answering task, achieving a 22% accuracy improvement over the baseline GPT-4 model. CoS proved particularly helpful for larger spreadsheets where providing the entire file to the LLM would exceed its token limit.

This paper is interesting because it addresses a practical problem that can have important uses for enterprise applications. With so much valuable information in spreadsheets, being able to use them reliably in conversational interfaces can unlock many important applications.

How C-JEPA is teaching AI the physics of the physical world

How Databricks’ FlashOptim cuts LLM training memory by 50 percent

How sparse attention solves the memory bottleneck in long-context LLMs

How ‘semantic chaining’ jailbreaks image generation models

How Sakana AI’s new technique solves the problems of long-context LLM…

Applied ML: When ‘perfect’ becomes the enemy of ‘good’

AI can’t replace software engineers yet, but here is how to…

How to turbocharge your product and market research with DeepSearch

How looking differently at data can save your machine learning project

Building a solid data foundation for generative AI applications

The evolution of LLM tool-use from API calls to agentic applications

What makes DeepSeek-V3.2 so efficient?

What to know about Claude Opus 4.5

OpenAI’s GPT-5: A reality check for the AI hype train

OpenAI’s grand return to open source: unpacking the gpt-oss release

AI is writing your code, but who’s reviewing it?

Machine learning in space: Building intelligent systems for the harshest environments

Decoding the brain, inspiring AI: How Rahul Biswas is bridging neuroscience…

The cash flow conundrum: How technology is reshaping small business finance

What to know about the security of open-source machine learning models

How to turbocharge LLMs for spreadsheet tasks

SpreadsheetLLM

SpreadsheetLLM in action

Like this:

Leave a ReplyCancel reply

SpreadsheetLLM

SpreadsheetLLM in action

Like this:

Leave a ReplyCancel reply

Discover more from TechTalks