Simplified intelligent spreadsheet ingestion framework with automatic table detection

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bensonlee5

These details have not been verified by PyPI

Project description

GridGulp

Automatically detect and extract tables from Excel, CSV, and text files.

What is GridGulp?

GridGulp finds tables in your spreadsheets, even when

there are multiple tables on one sheet
tables don't start at cell A1
file extensions do not reflect its file type
the file encoding is opaque

Supported formats: .xlsx, .xls, .xlsm, .csv, .tsv, .txt

Installation

pip install gridgulp

Quick Start

Trying GridGulp Out

To quickly try GridGulp on some spreadsheets, clone the repo, place example spreadsheets in the examples/ directory, and run

python scripts/test_example_files.py

You will receive output that looks like, representing identified ranges:

📁 tests/manual

Table Ranges vs DataFrames

GridGulp provides two ways to work with detected tables:

Table Ranges - JSON metadata about where tables are located (e.g., "A1:E100")
- Fast and memory-efficient
- Perfect for agent use as tools - mapping table locations or visualizing spreadsheet structure
- No actual data is loaded into memory
DataFrames - The actual data extracted from those ranges as pandas DataFrames
- Contains the full data with proper types
- Ready for analysis, transformation, or export

Choose based on your needs:

Use ranges only when you need to know where tables are and want to submit to other tasks - for example, a downstream process to infer purpose / intent based on data content
Use DataFrames when you need to analyze or transform the actual data

Getting Table Ranges Only

from gridgulp import GridGulp

# Detect tables in a file (lightweight - just finds locations)
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")

# Process results - no data loaded yet, just locations
for sheet in result.sheets:
    print(f"{sheet.name}: {len(sheet.tables)} tables found")
    for table in sheet.tables:
        print(f"  - Table at {table.range.excel_range}")
        print(f"    Size: {table.shape[0]} rows × {table.shape[1]} columns")
        print(f"    Confidence: {table.confidence:.1%}")

Getting DataFrames

To extract the actual data as pandas DataFrames:

from gridgulp import GridGulp
from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader

# Step 1: Detect table locations
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")

# Step 2: Read the file data
reader = get_reader("sales_report.xlsx")
file_data = reader.read_sync()

# Step 3: Extract DataFrames from detected ranges
extractor = DataFrameExtractor()
for sheet_result in result.sheets:
    sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)

    for table in sheet_result.tables:
        df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
        if df is not None:
            print(f"\n📊 Extracted table from {table.range.excel_range}")
            print(f"   Shape: {df.shape} | Quality: {quality:.1%}")
            print(f"   Headers: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")
            # Now you have a pandas DataFrame to work with
            print(df.head())

Processing Multiple Files

GridGulp makes it easy to process entire directories of spreadsheets:

from gridgulp import GridGulp

# Create GridGulp instance
gg = GridGulp()

# Process all spreadsheets in a directory (recursively by default)
results = await gg.detect_tables_in_directory("~/data")

# Summary statistics
total_tables = sum(r.total_tables for r in results.values())
print(f"Found {total_tables} tables across {len(results)} files")

# Process each file's results
for file_path, result in results.items():
    if result.total_tables > 0:
        print(f"\n{file_path.name}:")
        for sheet in result.sheets:
            for table in sheet.tables:
                print(f"  - {sheet.name}: {table.range.excel_range}")

Directory Processing Options

# Process only Excel files
results = await gg.detect_tables_in_directory(
    "~/reports",
    patterns=["*.xlsx", "*.xls"]
)

# Non-recursive (single directory only)
results = await gg.detect_tables_in_directory(
    "~/data",
    recursive=False
)

# With progress tracking for large directories
def show_progress(current, total):
    print(f"Processing file {current} of {total}...")

results = await gg.detect_tables_in_directory(
    "~/large_dataset",
    progress_callback=show_progress
)

# Sync version for Jupyter notebooks
results = gg.detect_tables_in_directory_sync("~/data")

Extracting DataFrames from Multiple Files

from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader

# Process directory and extract all tables as DataFrames
gg = GridGulp()
extractor = DataFrameExtractor()

results = gg.detect_tables_in_directory_sync("~/sales_data")

all_dataframes = []
for file_path, detection_result in results.items():
    if detection_result.total_tables > 0:
        # Read the file
        reader = get_reader(file_path)
        file_data = reader.read_sync()

        # Extract each table
        for sheet_result in detection_result.sheets:
            sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)

            for table in sheet_result.tables:
                df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
                if df is not None:
                    # Add source info to the dataframe
                    df['_source_file'] = file_path.name
                    df['_source_sheet'] = sheet_result.name
                    all_dataframes.append(df)

print(f"Extracted {len(all_dataframes)} tables from {len(results)} files")

# Combine all tables if they have the same structure
import pandas as pd
if all_dataframes:
    combined_df = pd.concat(all_dataframes, ignore_index=True)
    print(f"Combined dataset: {combined_df.shape}")

Key Features

Automatic Detection - Finds all tables with sensible defaults
Fully Configurable - Customize detection thresholds and behavior
Smart Headers - Detects single and multi-row headers automatically
Multiple Tables - Handles sheets with multiple separate tables
Quality Scoring - Confidence scores for each detected table
Fast - Processes 1M+ cells/second for simple tables, 100K+ cells/second for complex tables

Documentation

Full Usage Guide - Detailed examples and configuration
API Reference - Complete API documentation
Architecture - How GridGulp works internally
Testing Guide - Test spreadsheets in bulk with the unified test script

License

MIT License - see LICENSE file.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

bensonlee5

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.4

Jul 30, 2025

0.3.1

Jul 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gridgulp-0.3.4.tar.gz (341.5 kB view details)

Uploaded Jul 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gridgulp-0.3.4-py3-none-any.whl (125.6 kB view details)

Uploaded Jul 30, 2025 Python 3

File details

Details for the file gridgulp-0.3.4.tar.gz.

File metadata

Download URL: gridgulp-0.3.4.tar.gz
Upload date: Jul 30, 2025
Size: 341.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gridgulp-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`aad7fda1744bc270f2a39de07dbaa81aef6c4ff70175423ad5abb5f0aa913ca3`
MD5	`73d7dd0967df5ae7db78e49b64b2c4ba`
BLAKE2b-256	`3f3c277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gridgulp-0.3.4.tar.gz:

Publisher: release.yml on Ganymede-Bio/gridgulp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gridgulp-0.3.4.tar.gz
- Subject digest: aad7fda1744bc270f2a39de07dbaa81aef6c4ff70175423ad5abb5f0aa913ca3
- Sigstore transparency entry: 329166916
- Sigstore integration time: Jul 30, 2025
Source repository:
- Permalink: Ganymede-Bio/gridgulp@07561b4e0dd3850fe0bba03acce9ba8ac4ef8cdf
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Ganymede-Bio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@07561b4e0dd3850fe0bba03acce9ba8ac4ef8cdf
- Trigger Event: workflow_dispatch

File details

Details for the file gridgulp-0.3.4-py3-none-any.whl.

File metadata

Download URL: gridgulp-0.3.4-py3-none-any.whl
Upload date: Jul 30, 2025
Size: 125.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gridgulp-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e5bd88c6fe399432d460014a4af41212ecf6c7f1743370c18ff3daf338b95a13`
MD5	`c425c48ac9203b6225cf7465ed23a49d`
BLAKE2b-256	`c476c2aebdc3d48f5f6325cd31947f77508a1203082473fe19e8ef4f5dd48731`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gridgulp-0.3.4-py3-none-any.whl:

Publisher: release.yml on Ganymede-Bio/gridgulp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gridgulp-0.3.4-py3-none-any.whl
- Subject digest: e5bd88c6fe399432d460014a4af41212ecf6c7f1743370c18ff3daf338b95a13
- Sigstore transparency entry: 329166935
- Sigstore integration time: Jul 30, 2025
Source repository:
- Permalink: Ganymede-Bio/gridgulp@07561b4e0dd3850fe0bba03acce9ba8ac4ef8cdf
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Ganymede-Bio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@07561b4e0dd3850fe0bba03acce9ba8ac4ef8cdf
- Trigger Event: workflow_dispatch

gridgulp 0.3.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

GridGulp

What is GridGulp?

Installation

Quick Start

Trying GridGulp Out

📁 tests/manual

Table Ranges vs DataFrames

Getting Table Ranges Only

Getting DataFrames

Processing Multiple Files

Directory Processing Options

Extracting DataFrames from Multiple Files

Key Features

Documentation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance