Simplified intelligent spreadsheet ingestion framework with automatic table detection
Project description
GridGulp
Automatically detect and extract tables from Excel, CSV, and text files.
What is GridGulp?
GridGulp finds tables in your spreadsheets, even when
- there are multiple tables on one sheet
- tables don't start at cell A1
- file extensions do not reflect its file type
- the file encoding is opaque
Supported formats: .xlsx, .xls, .xlsm, .csv, .tsv, .txt
Installation
pip install gridgulp
Quick Start
Trying GridGulp Out
To quickly try GridGulp on some spreadsheets, clone the repo, place example spreadsheets in the examples/ directory, and run
python scripts/test_example_files.py
You will receive output that looks like, representing identified ranges:
📁 tests/manual
✓ sample.xlsx | Tables: 1 | Time: 1.099s | Size: 122.6KB | Method: magika 📄 Sheet: Sheet └─ A1:CV203 | 203×100 | Conf: 70%
Table Ranges vs DataFrames
GridGulp provides two ways to work with detected tables:
-
Table Ranges - JSON metadata about where tables are located (e.g., "A1:E100")
- Fast and memory-efficient
- Perfect for agent use as tools - mapping table locations or visualizing spreadsheet structure
- No actual data is loaded into memory
-
DataFrames - The actual data extracted from those ranges as pandas DataFrames
- Contains the full data with proper types
- Ready for analysis, transformation, or export
Choose based on your needs:
- Use ranges only when you need to know where tables are and want to submit to other tasks - for example, a downstream process to infer purpose / intent based on data content
- Use DataFrames when you need to analyze or transform the actual data
Getting Table Ranges Only
from gridgulp import GridGulp
# Detect tables in a file (lightweight - just finds locations)
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")
# Process results - no data loaded yet, just locations
for sheet in result.sheets:
print(f"{sheet.name}: {len(sheet.tables)} tables found")
for table in sheet.tables:
print(f" - Table at {table.range.excel_range}")
print(f" Size: {table.shape[0]} rows × {table.shape[1]} columns")
print(f" Confidence: {table.confidence:.1%}")
Getting DataFrames
To extract the actual data as pandas DataFrames:
from gridgulp import GridGulp
from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader
# Step 1: Detect table locations
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")
# Step 2: Read the file data
reader = get_reader("sales_report.xlsx")
file_data = reader.read_sync()
# Step 3: Extract DataFrames from detected ranges
extractor = DataFrameExtractor()
for sheet_result in result.sheets:
sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)
for table in sheet_result.tables:
df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
if df is not None:
print(f"\n📊 Extracted table from {table.range.excel_range}")
print(f" Shape: {df.shape} | Quality: {quality:.1%}")
print(f" Headers: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")
# Now you have a pandas DataFrame to work with
print(df.head())
Processing Multiple Files
GridGulp makes it easy to process entire directories of spreadsheets:
from gridgulp import GridGulp
# Create GridGulp instance
gg = GridGulp()
# Process all spreadsheets in a directory (recursively by default)
results = await gg.detect_tables_in_directory("~/data")
# Summary statistics
total_tables = sum(r.total_tables for r in results.values())
print(f"Found {total_tables} tables across {len(results)} files")
# Process each file's results
for file_path, result in results.items():
if result.total_tables > 0:
print(f"\n{file_path.name}:")
for sheet in result.sheets:
for table in sheet.tables:
print(f" - {sheet.name}: {table.range.excel_range}")
Directory Processing Options
# Process only Excel files
results = await gg.detect_tables_in_directory(
"~/reports",
patterns=["*.xlsx", "*.xls"]
)
# Non-recursive (single directory only)
results = await gg.detect_tables_in_directory(
"~/data",
recursive=False
)
# With progress tracking for large directories
def show_progress(current, total):
print(f"Processing file {current} of {total}...")
results = await gg.detect_tables_in_directory(
"~/large_dataset",
progress_callback=show_progress
)
# Sync version for Jupyter notebooks
results = gg.detect_tables_in_directory_sync("~/data")
Extracting DataFrames from Multiple Files
from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader
# Process directory and extract all tables as DataFrames
gg = GridGulp()
extractor = DataFrameExtractor()
results = gg.detect_tables_in_directory_sync("~/sales_data")
all_dataframes = []
for file_path, detection_result in results.items():
if detection_result.total_tables > 0:
# Read the file
reader = get_reader(file_path)
file_data = reader.read_sync()
# Extract each table
for sheet_result in detection_result.sheets:
sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)
for table in sheet_result.tables:
df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
if df is not None:
# Add source info to the dataframe
df['_source_file'] = file_path.name
df['_source_sheet'] = sheet_result.name
all_dataframes.append(df)
print(f"Extracted {len(all_dataframes)} tables from {len(results)} files")
# Combine all tables if they have the same structure
import pandas as pd
if all_dataframes:
combined_df = pd.concat(all_dataframes, ignore_index=True)
print(f"Combined dataset: {combined_df.shape}")
Key Features
- Automatic Detection - Finds all tables with sensible defaults
- Fully Configurable - Customize detection thresholds and behavior
- Smart Headers - Detects single and multi-row headers automatically
- Multiple Tables - Handles sheets with multiple separate tables
- Quality Scoring - Confidence scores for each detected table
- Fast - Processes 1M+ cells/second for simple tables, 100K+ cells/second for complex tables
Documentation
- Full Usage Guide - Detailed examples and configuration
- API Reference - Complete API documentation
- Architecture - How GridGulp works internally
- Testing Guide - Test spreadsheets in bulk with the unified test script
License
MIT License - see LICENSE file.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gridgulp-0.3.4.tar.gz.
File metadata
- Download URL: gridgulp-0.3.4.tar.gz
- Upload date:
- Size: 341.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aad7fda1744bc270f2a39de07dbaa81aef6c4ff70175423ad5abb5f0aa913ca3
|
|
| MD5 |
73d7dd0967df5ae7db78e49b64b2c4ba
|
|
| BLAKE2b-256 |
3f3c277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319
|
Provenance
The following attestation bundles were made for gridgulp-0.3.4.tar.gz:
Publisher:
release.yml on Ganymede-Bio/gridgulp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gridgulp-0.3.4.tar.gz -
Subject digest:
aad7fda1744bc270f2a39de07dbaa81aef6c4ff70175423ad5abb5f0aa913ca3 - Sigstore transparency entry: 329166916
- Sigstore integration time:
-
Permalink:
Ganymede-Bio/gridgulp@07561b4e0dd3850fe0bba03acce9ba8ac4ef8cdf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Ganymede-Bio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@07561b4e0dd3850fe0bba03acce9ba8ac4ef8cdf -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file gridgulp-0.3.4-py3-none-any.whl.
File metadata
- Download URL: gridgulp-0.3.4-py3-none-any.whl
- Upload date:
- Size: 125.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e5bd88c6fe399432d460014a4af41212ecf6c7f1743370c18ff3daf338b95a13
|
|
| MD5 |
c425c48ac9203b6225cf7465ed23a49d
|
|
| BLAKE2b-256 |
c476c2aebdc3d48f5f6325cd31947f77508a1203082473fe19e8ef4f5dd48731
|
Provenance
The following attestation bundles were made for gridgulp-0.3.4-py3-none-any.whl:
Publisher:
release.yml on Ganymede-Bio/gridgulp
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gridgulp-0.3.4-py3-none-any.whl -
Subject digest:
e5bd88c6fe399432d460014a4af41212ecf6c7f1743370c18ff3daf338b95a13 - Sigstore transparency entry: 329166935
- Sigstore integration time:
-
Permalink:
Ganymede-Bio/gridgulp@07561b4e0dd3850fe0bba03acce9ba8ac4ef8cdf -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Ganymede-Bio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@07561b4e0dd3850fe0bba03acce9ba8ac4ef8cdf -
Trigger Event:
workflow_dispatch
-
Statement type: