How to turbocharge LLMs for spreadsheet tasks

robot spreadsheet
Image generated with Microsoft Copilot

This article is part of our coverage of the latest in AI research.

Spreadsheets are ubiquitous, but their unique structure and features make them difficult for large language models (LLMs) to process. To solve these challenges, researchers at Microsoft have introduced SpreadsheetLLM, a new framework that enables LLMs to better process spreadsheets. 

SpreadsheetLLM uses a novel encoding technique that compresses spreadsheet data into a format that is more suitable for LLMs. This approach significantly reduces the token consumption of spreadsheet data and improves the performance of LLMs on various spreadsheet-related tasks. 

SpreadsheetLLM

LLMs have not been designed to handle spreadsheets. Spreadsheets are organized in two-dimensional grids that can span thousands of rows and columns, often exceeding the token limits of even the largest LLMs. Cell addresses, formats, and formulas further complicate the parsing and understanding of spreadsheet data for LLMs, which are designed to process linear text. 

SpreadsheetLLM enables LLMs to overcome the limitations of processing spreadsheets through an efficient encoding method that compresses and converts spreadsheet data into a format suitable for LLMs.

The researchers first explored a simple encoding method that serializes spreadsheet data into Markdown format while preserving important information such as cell addresses and formats. While helpful for smaller spreadsheets, this approach struggles with larger files that exceed the token limits of current LLMs. 

To address this challenge, the researchers developed SheetCompressor, a novel encoding framework composed of three modules that work together to compress spreadsheets effectively.

Structural Anchors for Efficient Layout Understanding: This method identifies “structural anchors,” the borders of table areas within a spreadsheet. It then removes rows and columns that are not in close proximity to the structural anchors. This technique maintains the essential information about the layout of the spreadsheet while significantly reducing its size. 

Inverted-Index Translation for Token Efficiency: This technique leverages the repetitive nature of spreadsheet data. Instead of storing the spreadsheet in a matrix, it creates a dictionary where each unique cell value is stored once along with the ranges of cells that contain it. Empty cells are removed to further reduce token consumption.

Data Format Aggregation for Numerical Cells: This technique further compresses the spreadsheet by grouping cells with similar data formats. Since neighboring cells often have identical formats, this technique uses clustering to group and represent them more efficiently and with fewer tokens.

SpreadsheetLLM
SpreadsheetLLM (source: arXiv)

To enable LLMs to perform reasoning tasks over spreadsheets, the researchers also introduce Chain of Spreadsheet (CoS), an approach inspired by the Chain-of-Thought (CoT) prompting technique. 

CoS first identifies the table relevant to the user query and then selects the relevant rows and columns that are necessary for the input prompt. Next, the LLM processes the query and the extracted table data to generate the response. Reducing the input cuts processing costs and also reduces the possibility of hallucinations.

SpreadsheetLLM in action

The researchers evaluated SpreadsheetLLM using several LLMs, including GPT-4, GPT-3.5, Llama-2, Llama-3, Phi, and Mistral-v2. They also created a new Spreadsheet QA dataset tailored to the challenges of multi-table spreadsheets.

The experiments show that SpreadsheetLLM reduces token usage for spreadsheets by 96%, effectively compressing spreadsheet data into a format that even small LLMs can handle. The compressed encoding also improved the in-context learning capabilities of the LLMs, enabling them to perform well with fewer examples.

SpreadsheetLLM also achieved state-of-the-art results in spreadsheet table detection, a foundational task for spreadsheet understanding. 

Furthermore, the Chain of Spreadsheet (CoS) approach significantly outperformed existing methods on a spreadsheet question-answering task, achieving a 22% accuracy improvement over the baseline GPT-4 model. CoS proved particularly helpful for larger spreadsheets where providing the entire file to the LLM would exceed its token limit.

This paper is interesting because it addresses a practical problem that can have important uses for enterprise applications. With so much valuable information in spreadsheets, being able to use them reliably in conversational interfaces can unlock many important applications.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.