A Python package for encoding spreadsheets for Large Language Models, implementing the SpreadsheetLLM research framework
Project description
SheetWise
A Python package for encoding spreadsheets for Large Language Models, implementing the SpreadsheetLLM research framework.
Overview
SheetWise is a Python package that implements the key components from Microsoft Research's SpreadsheetLLM paper for efficiently encoding spreadsheets for use with Large Language Models. The package provides:
- SheetCompressor: Efficient encoding framework with three compression modules
- Chain of Spreadsheet: Multi-step reasoning approach for spreadsheet analysis
- Vanilla Encoding: Traditional cell-by-cell encoding methods
- Token Optimization: Significant reduction in token usage
- Formula Analysis: Extract and simplify Excel formulas
- Multi-Sheet Support: Process entire workbooks with cross-sheet references
- Visualization Tools: Generate visual reports of compression results
Key Features
- Intelligent Compression: Up to 96% reduction in token usage while preserving semantic information
- Auto-Configuration: Automatically optimizes compression settings based on spreadsheet characteristics
- Multi-Table Support: Handles complex spreadsheets with multiple tables and regions
- Structural Analysis: Identifies and preserves important structural elements
- Format-Aware: Preserves data type and formatting information
- Enhanced Algorithms: Improved range detection and contiguous cell grouping
- Easy Integration: Simple API for immediate use
Token Budgeting
Unsure if your spreadsheet fits in the context window? Use encode_to_token_limit:
from sheetwise import SpreadsheetLLM
import pandas as pd
df = pd.read_excel("large_file.xlsx")
sllm = SpreadsheetLLM()
# Automatically adjust compression to fit within 4000 tokens
encoded_text = sllm.encode_to_token_limit(df, max_tokens=4000)
Memory Optimization (Large Files)
For very large spreadsheets (100MB+), you can use inplace=True to significantly reduce RAM usage during compression.
# Standard Compression (Creates a copy in RAM)
result = compressor.compress(df)
# Memory-Efficient Compression (Modifies DataFrame in-place or minimizes copies)
# Use this when working with datasets close to your RAM limit.
result = compressor.compress(df, inplace=True)
Installation
Using pip
pip install sheetwise
Using Poetry
poetry add sheetwise
Development Installation
git clone https://github.com/yourusername/sheetwise.git
cd sheetwise
poetry install
Quick Start
Basic Usage
import pandas as pd
from sheetwise import SpreadsheetLLM
# Initialize the framework
sllm = SpreadsheetLLM()
# Load your spreadsheet
df = pd.read_excel("your_spreadsheet.xlsx")
# Compress and encode for LLM use
llm_ready_text = sllm.compress_and_encode_for_llm(df)
# Copy and paste this text directly into ChatGPT/Claude
print(llm_ready_text)
Advanced Usage
from sheetwise import SpreadsheetLLM, SheetCompressor
# Auto-configuration
sllm = SpreadsheetLLM(enable_logging=True)
auto_compressed = sllm.compress_with_auto_config(df) # Automatically optimizes settings
# Manual configuration
compressor = SheetCompressor(
k=2, # Structural anchor neighborhood size
use_extraction=True,
use_translation=True,
use_aggregation=True
)
# Compress the spreadsheet
compressed_result = compressor.compress(df)
print(f"Compression ratio: {compressed_result['compression_ratio']:.1f}x")
print(f"Compressed shape: {compressed_result['compressed_df'].shape}")
# Or use with SpreadsheetLLM for full pipeline
sllm = SpreadsheetLLM(compression_params={
'k': 2,
'use_extraction': True,
'use_translation': True,
'use_aggregation': True
})
# Get detailed statistics
stats = sllm.get_encoding_stats(df)
print(f"Token reduction: {stats['token_reduction_ratio']:.1f}x")
# Process QA queries
result = sllm.process_qa_query(df, "What was the total revenue in 2023?")
Enhanced Features Usage (v2.0+)
from sheetwise import (
SpreadsheetLLM,
FormulaParser,
WorkbookManager,
CompressionVisualizer,
SmartTableDetector
)
# Formula extraction and analysis
formula_parser = FormulaParser()
formulas = formula_parser.extract_formulas("your_spreadsheet.xlsx")
formula_parser.build_dependency_graph()
impact = formula_parser.get_formula_impact("Sheet1!A1")
formula_text = formula_parser.encode_formulas_for_llm()
# Multi-sheet support
workbook = WorkbookManager()
sheets = workbook.load_workbook("your_workbook.xlsx")
cross_refs = workbook.detect_cross_sheet_references()
sllm = SpreadsheetLLM()
compressed = workbook.compress_workbook(sllm.compressor)
encoded = workbook.encode_workbook_for_llm(compressed)
# Visualization
visualizer = CompressionVisualizer()
df = sllm.load_from_file("your_spreadsheet.xlsx")
compressed_result = sllm.compress_spreadsheet(df)
fig = visualizer.create_data_density_heatmap(df)
fig.savefig("heatmap.png")
html_report = visualizer.generate_html_report(df, compressed_result)
# Advanced table detection
detector = SmartTableDetector()
tables = detector.detect_tables(df)
extracted_tables = detector.extract_tables_to_dataframes(df)
Command Line Interface
# Basic usage
sheetwise input.xlsx -o output.txt --stats
# Auto-configure compression
sheetwise input.xlsx --auto-config --verbose
# Run demo with sample data
sheetwise --demo --auto-config
# Use vanilla encoding instead of compression
sheetwise input.xlsx --vanilla
# Output in JSON format
sheetwise input.xlsx --format json
Enhanced CLI Features (v2.0+)
# Extract and analyze formulas
sheetwise your_spreadsheet.xlsx --extract-formulas
# Process all sheets in a workbook
sheetwise your_workbook.xlsx --multi-sheet
# Generate visualizations
sheetwise your_spreadsheet.xlsx --visualize
# Detect and extract tables
sheetwise your_spreadsheet.xlsx --detect-tables
# Generate an HTML report
sheetwise your_spreadsheet.xlsx --format html
Benchmarks & Visualization
SheetWise includes a benchmarking script to evaluate compression, speed, and memory usage across spreadsheets. This helps you understand performance and compare results visually.
Running Benchmarks
- Place your sample spreadsheets in
benchmarks/samples/(supports .xlsx and .csv). - Run the benchmark script:
python scripts/generate_benchmarks.py
- Results and charts will be saved in
benchmarks/results/andbenchmarks/charts/.
Core Components
1. SheetCompressor
The main compression framework with three modules:
- Structural Anchor Extraction: Identifies and preserves structurally important rows/columns
- Inverted Index Translation: Creates efficient value-to-location mappings
- Data Format Aggregation: Groups cells by data type and format
2. Chain of Spreadsheet
Multi-step reasoning approach:
- Table Identification: Automatically detects table regions
- Compression: Applies SheetCompressor to reduce size
- Query Processing: Identifies relevant regions for specific queries
3. Enhanced Modules (v2.0+)
- FormulaParser: Extracts and analyzes Excel formulas
- WorkbookManager: Handles multi-sheet workbooks and cross-references
- CompressionVisualizer: Generates visualizations and reports
- SmartTableDetector: Advanced table detection and classification
Examples
Working with Financial Data
from sheetwise import SpreadsheetLLM
from sheetwise.utils import create_realistic_spreadsheet
# Create sample financial spreadsheet
df = create_realistic_spreadsheet()
sllm = SpreadsheetLLM()
# Analyze the data
stats = sllm.get_encoding_stats(df)
print(f"Original size: {stats['original_shape']}")
print(f"Sparsity: {stats['sparsity_percentage']:.1f}% empty cells")
print(f"Compression: {stats['compression_ratio']:.1f}x smaller")
# Generate LLM-ready output
encoded = sllm.compress_and_encode_for_llm(df)
print("\nReady for LLM:")
print(encoded[:300] + "...")
Visualizing Compression
from sheetwise import SpreadsheetLLM, CompressionVisualizer
import pandas as pd
# Load your data
df = pd.read_excel("complex_spreadsheet.xlsx")
# Compress the data
sllm = SpreadsheetLLM()
compressed_result = sllm.compress_spreadsheet(df)
# Create visualizations
visualizer = CompressionVisualizer()
# Generate heatmap of data density
fig1 = visualizer.create_data_density_heatmap(df)
fig1.savefig("density_heatmap.png")
# Compare original vs compressed
fig2 = visualizer.compare_original_vs_compressed(df, compressed_result)
fig2.savefig("compression_comparison.png")
# Generate HTML report with all visualizations
html_report = visualizer.generate_html_report(df, compressed_result)
with open("compression_report.html", "w") as f:
f.write(html_report)
# Compare different compression strategies
configs = [
{"name": "Extraction Only", "use_translation": False, "use_aggregation": False},
{"name": "Translation Only", "use_extraction": False, "use_aggregation": False},
{"name": "All Modules", "use_extraction": True, "use_translation": True, "use_aggregation": True}
]
for config in configs:
compressor = SheetCompressor(**{k: v for k, v in config.items() if k != "name"})
result = compressor.compress(df)
print(f"{config['name']}: {result['compression_ratio']:.1f}x compression")
Performance
SpreadsheetLLM achieves significant improvements over vanilla encoding:
| Metric | Vanilla | SpreadsheetLLM | Improvement |
|---|---|---|---|
| Token Count | ~25,000 | ~1,200 | 96% reduction |
| Sparsity Handling | Poor | Excellent | Removes empty regions |
| Multi-Table Support | Limited | Native | Preserves structure |
| Format Preservation | Basic | Advanced | Type-aware grouping |
API Reference
SpreadsheetLLM Class
The main interface for the framework.
Methods
compress_and_encode_for_llm(df): One-step compression and encodingcompress_spreadsheet(df): Apply compression pipelineencode_vanilla(df): Traditional encodingget_encoding_stats(df): Detailed compression statisticsprocess_qa_query(df, query): Chain of Spreadsheet reasoningload_from_file(filepath): Load spreadsheet from file
SheetCompressor Class
Core compression framework.
Parameters
k: Structural anchor neighborhood size (default: 4)use_extraction: Enable structural extraction (default: True)use_translation: Enable inverted index translation (default: True)use_aggregation: Enable format aggregation (default: True)
Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Setup
# Clone the repository
git clone https://github.com/yourusername/sheetwise.git
cd sheetwise
# Install development dependencies
poetry install
# Run tests
poetry run pytest
# Run linting
poetry run black src tests
poetry run isort src tests
poetry run flake8 src tests
Running Tests
# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=src
# Run specific test file
poetry run pytest tests/test_core.py
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use SpreadsheetLLM in your research, please cite:
@article{spreadsheetllm2024,
title={SpreadsheetLLM: Encoding Spreadsheets for Large Language Models},
author={Microsoft Research Team},
journal={arXiv preprint},
year={2024}
}
Support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sheetwise-2.8.0.tar.gz.
File metadata
- Download URL: sheetwise-2.8.0.tar.gz
- Upload date:
- Size: 267.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8161f717cdd85c30af1a035925449a9b5af93188c9dea36ae3c2d144edbcaec
|
|
| MD5 |
9ebe811b0da6ffaa64049c070ae9dbed
|
|
| BLAKE2b-256 |
d6e2134959df9e0b7bc28e52c9d11bbc2aab69c8acb780041e963f0b1e6c5262
|
Provenance
The following attestation bundles were made for sheetwise-2.8.0.tar.gz:
Publisher:
release.yml on Khushiyant/sheetwise
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sheetwise-2.8.0.tar.gz -
Subject digest:
f8161f717cdd85c30af1a035925449a9b5af93188c9dea36ae3c2d144edbcaec - Sigstore transparency entry: 790574845
- Sigstore integration time:
-
Permalink:
Khushiyant/sheetwise@bb23767868142f5cdc9350b3678af29e9456bf72 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Khushiyant
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bb23767868142f5cdc9350b3678af29e9456bf72 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sheetwise-2.8.0-py3-none-any.whl.
File metadata
- Download URL: sheetwise-2.8.0-py3-none-any.whl
- Upload date:
- Size: 47.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28090a54be221852c5acc2dff9cd0e52bb4d8fff03fe83f4175b958506c4db94
|
|
| MD5 |
4f8b807e850c60afa972d3ee617dcbb5
|
|
| BLAKE2b-256 |
8a4958f312020fbdf76987f1975c70560f44199f9fd12c201b053fd5e1a5ba16
|
Provenance
The following attestation bundles were made for sheetwise-2.8.0-py3-none-any.whl:
Publisher:
release.yml on Khushiyant/sheetwise
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sheetwise-2.8.0-py3-none-any.whl -
Subject digest:
28090a54be221852c5acc2dff9cd0e52bb4d8fff03fe83f4175b958506c4db94 - Sigstore transparency entry: 790574855
- Sigstore integration time:
-
Permalink:
Khushiyant/sheetwise@bb23767868142f5cdc9350b3678af29e9456bf72 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Khushiyant
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@bb23767868142f5cdc9350b3678af29e9456bf72 -
Trigger Event:
push
-
Statement type: