Skip to main content

Simplified intelligent spreadsheet ingestion framework with automatic table detection

Project description

GridGulp

PyPI version Python Versions License: MIT Documentation

Automatically detect and extract tables from Excel, CSV, and text files.

What is GridGulp?

GridGulp finds tables in your spreadsheets, even when

  • there are multiple tables on one sheet
  • tables don't start at cell A1
  • file extensions do not reflect its file type
  • the file encoding is opaque

Supported formats: .xlsx, .xls, .xlsm, .csv, .tsv, .txt

Installation

pip install gridgulp

Quick Start

Trying GridGulp Out

To quickly try GridGulp on some spreadsheets, clone the repo, place example spreadsheets in the examples/ directory, and run

python scripts/test_example_files.py

You will receive output that looks like, representing identified ranges:

📁 tests/manual

✓ sample.xlsx | Tables: 1 | Time: 1.099s | Size: 122.6KB | Method: magika 📄 Sheet: Sheet └─ A1:CV203 | 203×100 | Conf: 70%

Table Ranges vs DataFrames

GridGulp provides two ways to work with detected tables:

  1. Table Ranges - JSON metadata about where tables are located (e.g., "A1:E100")

    • Fast and memory-efficient
    • Perfect for agent use as tools - mapping table locations or visualizing spreadsheet structure
    • No actual data is loaded into memory
  2. DataFrames - The actual data extracted from those ranges as pandas DataFrames

    • Contains the full data with proper types
    • Ready for analysis, transformation, or export

Choose based on your needs:

  • Use ranges only when you need to know where tables are and want to submit to other tasks - for example, a downstream process to infer purpose / intent based on data content
  • Use DataFrames when you need to analyze or transform the actual data

Getting Table Ranges Only

from gridgulp import GridGulp

# Detect tables in a file (lightweight - just finds locations)
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")

# Process results - no data loaded yet, just locations
for sheet in result.sheets:
    print(f"{sheet.name}: {len(sheet.tables)} tables found")
    for table in sheet.tables:
        print(f"  - Table at {table.range.excel_range}")
        print(f"    Size: {table.shape[0]} rows × {table.shape[1]} columns")
        print(f"    Confidence: {table.confidence:.1%}")

Getting DataFrames

To extract the actual data as pandas DataFrames:

from gridgulp import GridGulp
from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader

# Step 1: Detect table locations
porter = GridGulp()
result = await porter.detect_tables("sales_report.xlsx")

# Step 2: Read the file data
reader = get_reader("sales_report.xlsx")
file_data = reader.read_sync()

# Step 3: Extract DataFrames from detected ranges
extractor = DataFrameExtractor()
for sheet_result in result.sheets:
    sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)

    for table in sheet_result.tables:
        df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
        if df is not None:
            print(f"\n📊 Extracted table from {table.range.excel_range}")
            print(f"   Shape: {df.shape} | Quality: {quality:.1%}")
            print(f"   Headers: {', '.join(df.columns[:5])}{'...' if len(df.columns) > 5 else ''}")
            # Now you have a pandas DataFrame to work with
            print(df.head())

Processing Multiple Files

GridGulp makes it easy to process entire directories of spreadsheets:

from gridgulp import GridGulp

# Create GridGulp instance
gg = GridGulp()

# Process all spreadsheets in a directory (recursively by default)
results = await gg.detect_tables_in_directory("~/data")

# Summary statistics
total_tables = sum(r.total_tables for r in results.values())
print(f"Found {total_tables} tables across {len(results)} files")

# Process each file's results
for file_path, result in results.items():
    if result.total_tables > 0:
        print(f"\n{file_path.name}:")
        for sheet in result.sheets:
            for table in sheet.tables:
                print(f"  - {sheet.name}: {table.range.excel_range}")

Directory Processing Options

# Process only Excel files
results = await gg.detect_tables_in_directory(
    "~/reports",
    patterns=["*.xlsx", "*.xls"]
)

# Non-recursive (single directory only)
results = await gg.detect_tables_in_directory(
    "~/data",
    recursive=False
)

# With progress tracking for large directories
def show_progress(current, total):
    print(f"Processing file {current} of {total}...")

results = await gg.detect_tables_in_directory(
    "~/large_dataset",
    progress_callback=show_progress
)

# Sync version for Jupyter notebooks
results = gg.detect_tables_in_directory_sync("~/data")

Extracting DataFrames from Multiple Files

from gridgulp.extractors import DataFrameExtractor
from gridgulp.readers import get_reader

# Process directory and extract all tables as DataFrames
gg = GridGulp()
extractor = DataFrameExtractor()

results = gg.detect_tables_in_directory_sync("~/sales_data")

all_dataframes = []
for file_path, detection_result in results.items():
    if detection_result.total_tables > 0:
        # Read the file
        reader = get_reader(file_path)
        file_data = reader.read_sync()

        # Extract each table
        for sheet_result in detection_result.sheets:
            sheet_data = next(s for s in file_data.sheets if s.name == sheet_result.name)

            for table in sheet_result.tables:
                df, metadata, quality = extractor.extract_dataframe(sheet_data, table.range)
                if df is not None:
                    # Add source info to the dataframe
                    df['_source_file'] = file_path.name
                    df['_source_sheet'] = sheet_result.name
                    all_dataframes.append(df)

print(f"Extracted {len(all_dataframes)} tables from {len(results)} files")

# Combine all tables if they have the same structure
import pandas as pd
if all_dataframes:
    combined_df = pd.concat(all_dataframes, ignore_index=True)
    print(f"Combined dataset: {combined_df.shape}")

Key Features

  • Automatic Detection - Finds all tables with sensible defaults
  • Fully Configurable - Customize detection thresholds and behavior
  • Smart Headers - Detects single and multi-row headers automatically
  • Multiple Tables - Handles sheets with multiple separate tables
  • Quality Scoring - Confidence scores for each detected table
  • Fast - Processes 1M+ cells/second for simple tables, 100K+ cells/second for complex tables

Documentation

License

MIT License - see LICENSE file.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gridgulp-0.3.4.tar.gz (341.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gridgulp-0.3.4-py3-none-any.whl (125.6 kB view details)

Uploaded Python 3

File details

Details for the file gridgulp-0.3.4.tar.gz.

File metadata

  • Download URL: gridgulp-0.3.4.tar.gz
  • Upload date:
  • Size: 341.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gridgulp-0.3.4.tar.gz
Algorithm Hash digest
SHA256 aad7fda1744bc270f2a39de07dbaa81aef6c4ff70175423ad5abb5f0aa913ca3
MD5 73d7dd0967df5ae7db78e49b64b2c4ba
BLAKE2b-256 3f3c277de1053bed8462440b906490ac7354b3ebf5fb56fc5b30f338b8ade319

See more details on using hashes here.

Provenance

The following attestation bundles were made for gridgulp-0.3.4.tar.gz:

Publisher: release.yml on Ganymede-Bio/gridgulp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gridgulp-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: gridgulp-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 125.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for gridgulp-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 e5bd88c6fe399432d460014a4af41212ecf6c7f1743370c18ff3daf338b95a13
MD5 c425c48ac9203b6225cf7465ed23a49d
BLAKE2b-256 c476c2aebdc3d48f5f6325cd31947f77508a1203082473fe19e8ef4f5dd48731

See more details on using hashes here.

Provenance

The following attestation bundles were made for gridgulp-0.3.4-py3-none-any.whl:

Publisher: release.yml on Ganymede-Bio/gridgulp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page