Skip to main content

A fast Rust-powered tool for extracting data from Excel forms into JSON

Project description

sheetio -- Fast Rust-Powered Excel Form Data Extraction

PyPI version CI Python 3.10+ License: MIT

Extract structured data from Excel forms (.xlsx, .xlsm) into JSON. Built in Rust with Python bindings via PyO3 for fast parallel processing of hundreds of files. Designed for standardized Excel forms that don't fit the typical CSV format -- government reports, engineering forms, financial templates, and survey spreadsheets.

Why sheetio?

  • Fast -- Rust core with async parallel file processing. Process hundreds of Excel forms in seconds.
  • Flexible -- Three extraction modes (single cells, row patterns, dataframes) handle any form layout.
  • Simple -- One function, JSON config, JSON output. No complex API to learn.
  • Robust -- Handles missing data, duplicate keys, wildcard sheet matching, and composite identifiers.

Quick Start

pip install sheetio
import sheetio
import json

files = ["form_001.xlsx", "form_002.xlsx", "form_003.xlsx"]

config = [
    {
        "sheets": ["Sheet1"],
        "extractions": [
            {
                "function": "single_cells",
                "label": "header",
                "instructions": {
                    "title": "b2",
                    "date": "d4",
                    "author": "b6"
                }
            },
            {
                "function": "multirow_patterns",
                "label": "items",
                "instructions": {
                    "row_range": [10, 100],
                    "unique_id": "A",
                    "stop_if_empty": "A",
                    "columns": {
                        "ID": "A",
                        "Description": "B",
                        "Value": "C"
                    }
                }
            }
        ]
    }
]

result = json.loads(sheetio.excel_extract(files, config, 5))

Config Builder

Use ExtractionConfig to build configs iteratively with method chaining, instead of writing raw dicts:

from sheetio import ExtractionConfig

config = ExtractionConfig()

# Add extractions with a fluent API
config.add_sheets(["Sheet1"]) \
    .single_cells("header", title="B2", date="D4", author="B6") \
    .multirow("items",
        row_range=(10, 100),
        unique_id="A",
        stop_if_empty="A",
        columns={"ID": "A", "Description": "B", "Value": "C"})

# Extract directly — returns parsed Python dicts (no json.loads needed)
result = config.extract(["form_001.xlsx", "form_002.xlsx"], workers=5)

# Save / load configs as JSON files
config.to_json("my_config.json")
config = ExtractionConfig.from_json("my_config.json")

# Inspect what you've built
config.summary()

# Or get the raw config list for manual use
raw = config.build()

All three extraction types are supported: .single_cells(), .multirow(), .dataframe(). See the Extraction Types Reference below for all available options.

Key Features

Feature Description
Parallel processing Process multiple files simultaneously with configurable worker count
Wildcard sheets Match sheets by pattern: "School_*" matches School_A, School_B, etc.
Composite keys Combine multiple columns as unique identifiers: ["Project", "Year"]
Gap tolerance Continue extraction through empty rows with stop_consecutive
Duplicate handling Automatic _1, _2 suffixes for duplicate keys
Multi-column merge Extract arrays from multiple columns per field: ["X", "Y", "Z"]
Multi-row headers Concatenate header rows for dataframe extraction

Requirements

  • Python 3.10 or higher
  • Supported platforms: Windows, macOS, Linux

Extraction Types Reference

Configuration Structure

The extraction_details parameter is a list of dictionaries that define the extraction rules for each Excel sheet. Each dictionary contains:

  • sheets: A list of sheet names to extract data from. Accepts patterns with *. Example: "School_*" will loop through sheets like School_A, School_B, etc.
  • skip_sheets: An optional list of sheet names to skip. Can be useful when using patterns in the list of sheets.
  • extractions: A list of extraction rules that will be applied to the sheets listed.

Each extraction rule contains:

  • function: Type of extraction function: single_cells, multirow_patterns, or dataframe.
  • label: Optional key string to store results under. If not specified the extracted key value pairs will be stored directly under the sheet name.
  • break_if_null: An optional check to skip sheet if specified cell is null.
  • instructions: Instructions for the extraction function. See details for each function type below.

Single Cells Extraction

Extracts individual cells from the Excel sheet.

Instructions:

  • instructions: A dictionary where the keys are the reference name and the values are the cell references (e.g., "a1", "b2").

Example:

{
    "sheets": ["Sheet1"],
    "extractions": [
        {
            "function": "single_cells",
            "label": "single",
            "break_if_null": "c3",
            "instructions": {
                "Value 1": "a1",
                "Value 2": "b2",
                "Value 3": "c3",
                "Date": "d4",
                "Datetime": "e5"
            }
        }
    ]
}

Multirow Patterns Extraction

Extracts data from multiple rows based on a pattern.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • unique_id (optional): The column(s) to use as a unique identifier. Can be either:
    • A single column as a string: "B"
    • Multiple columns as an array: ["B", "C"] for composite keys
    • When using composite keys, if ANY column contains null/empty values, the row is skipped
    • If omitted: Results are returned as an array/list instead of a dictionary
  • unique_id_separator (optional): The separator to use when joining multiple columns for composite keys. Defaults to "_".
  • columns: A dictionary where the keys are the column names and the values are the column letters (e.g., "B", "C").
  • stop_if_empty (optional): Controls when to stop processing rows. Can be:
    • A column string: "A" - Stop when this column is empty
    • An array of columns: ["A", "B"] - Stop when ALL specified columns are empty
    • The string "row" - Stop when the entire data row is empty
    • An object with detailed configuration:
      {"column": "A", "consecutive": 2}
      
      or
      {"mode": "row", "consecutive": 1}
      
  • stop_consecutive (optional): Used with simple stop_if_empty syntax to specify how many consecutive empty rows trigger a stop. Defaults to 1.

Example with single unique_id:

{
    "sheets": ["Sheet 1", "Sheet 2"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "deposits",
            "instructions": {
                "row_range": [1, 10],
                "unique_id": "B",
                "columns": {
                    "Title": "B",
                    "Description": "C",
                    "Estimate": "D",
                    "Chance": "E",
                }
            }
        }
    ]
}

Example with composite unique_id:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "projects",
            "instructions": {
                "row_range": [1, 50],
                "unique_id": ["B", "C"],
                "unique_id_separator": "-",
                "columns": {
                    "Project": "B",
                    "Year": "C",
                    "Budget": "D",
                    "Status": "E"
                }
            }
        }
    ]
}

Example without unique_id (returns array/list):

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "items",
            "instructions": {
                "row_range": [1, 1000],
                "stop_if_empty": "A",
                "columns": {
                    "Name": "A",
                    "Value": "B",
                    "Description": "C"
                }
            }
        }
    ]
}
# Returns: {"items": [{"Name": "...", "Value": ...}, {"Name": "...", "Value": ...}]}

Example with stop_if_empty and gap tolerance:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "data",
            "instructions": {
                "row_range": [1, 100],
                "stop_if_empty": "A",
                "stop_consecutive": 3,
                "columns": {
                    "ID": "A",
                    "Value": "B"
                }
            }
        }
    ]
}

Example with row-based empty detection:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "records",
            "instructions": {
                "row_range": [1, 500],
                "stop_if_empty": {
                    "mode": "row",
                    "consecutive": 2
                },
                "columns": {
                    "Field1": "A",
                    "Field2": "B",
                    "Field3": "C"
                }
            }
        }
    ]
}

Example with multiple column monitoring:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "transactions",
            "instructions": {
                "row_range": [1, 1000],
                "unique_id": "A",
                "stop_if_empty": ["A", "B"],
                "columns": {
                    "ID": "A",
                    "Date": "B",
                    "Amount": "C"
                }
            }
        }
    ]
}

Dataframe Extraction

Extracts tabular data with headers, returning JSON that can easily be converted to a Pandas DataFrame.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • column_range: A list of column letters to extract.
  • header_row: A list of row numbers to use as the header.
  • separator: Optional separator to use when combining header cells (default " ").

Example:

{
    "sheets": ["School_*"],
    "extractions": [
        {
            "function": "dataframe",
            "label": "DataFrame",
            "instructions": {
                "row_range": [5, 15],
                "column_range": ["B", "F"],
                "header_row": [2, 3, 4],
                "separator": " ",
            }
        }
    ]
}

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

# Development setup
pip install maturin pytest openpyxl ruff
maturin develop --release
make lint   # check formatting
make test   # run tests

License

sheetio is released under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sheetio-0.3.7-cp313-cp313-win_amd64.whl (833.9 kB view details)

Uploaded CPython 3.13Windows x86-64

sheetio-0.3.7-cp313-cp313-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

sheetio-0.3.7-cp313-cp313-macosx_10_12_x86_64.whl (955.9 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

sheetio-0.3.7-cp312-cp312-win_amd64.whl (833.9 kB view details)

Uploaded CPython 3.12Windows x86-64

sheetio-0.3.7-cp312-cp312-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

sheetio-0.3.7-cp312-cp312-macosx_10_12_x86_64.whl (955.8 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

sheetio-0.3.7-cp311-cp311-win_amd64.whl (833.6 kB view details)

Uploaded CPython 3.11Windows x86-64

sheetio-0.3.7-cp311-cp311-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

sheetio-0.3.7-cp311-cp311-macosx_10_12_x86_64.whl (957.1 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

sheetio-0.3.7-cp310-cp310-win_amd64.whl (833.6 kB view details)

Uploaded CPython 3.10Windows x86-64

sheetio-0.3.7-cp310-cp310-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

sheetio-0.3.7-cp310-cp310-macosx_10_12_x86_64.whl (957.4 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file sheetio-0.3.7-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.7-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 833.9 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.7-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 aa79dd657e53a2fa55dd852d96571908b74743ad04d0e152b04c65db2503319b
MD5 8bd3bfb0fef1ade7010db7768acf7a5d
BLAKE2b-256 96d6e37a53ec72e220ba67cdbdd91090ca87a6534b8289999ee1b286753a79e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp313-cp313-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.7-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 8dcf90fde9b2c4199756f4f3fe9ae2d28fd74f350ad2a958be835aa21bbd1636
MD5 a845d2f4e248ca9c3a5b2a667602f253
BLAKE2b-256 649fef0e871bc4edc7ba35a9abd70724c33636642fac58b75b26e19171e74700

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.7-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 fb3a8ed10ed5e55a5bafe22e33501a87ebae054d0b66014bfd9d5a0d0f6fb949
MD5 bbe87f11c9bd7dc47e384e5a4b77e097
BLAKE2b-256 08c353890679aaace1a4ae3a72958185d8d88d6aae8b24476002ac367aae8922

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp313-cp313-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.7-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 833.9 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.7-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 af806add0927962ad9e5be45f5bd33996115ff0753b9b67e7ac1a20e3dd6667d
MD5 064a01996180c2298f13f88466c8b7d5
BLAKE2b-256 c512cfca61074041b568fc026ca37254f105304093e744a14694b0c6411ba462

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp312-cp312-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.7-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7a13a7a48a3bc9ba50d47a25b521cdb7c8ce7af88ffb94334f5fc8f996a8f293
MD5 083ae75210a62cb670f673d1741c53b1
BLAKE2b-256 f078320e358885442a30782ccfcbf295d0c5c5d3f73c72bfbc030df23aed2201

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.7-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8d35717fb505df3ce7cc396da1845ab8b093d5311b0b80a4dda391ceeb74e49c
MD5 fa2ab0042e4b6a80bba2005f5f0523f3
BLAKE2b-256 f48fe99a2bd01e652b978245d5e35cf0c4490bbf5d62c320b411c38e967673ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.7-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 833.6 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f4148dfc832bb34f6235e40fab98b13e8b5b1a7e2f47d530ad01aae9df42267a
MD5 cc231fa8b3d297dd484a0479d433a8f1
BLAKE2b-256 04784e0aba64acfaaa99d3b3ccb1fa28110069dacacd51ceb9e24bd7fa6625c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp311-cp311-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.7-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 89e7f7a30b5c869e17843748482d68640d5af6f743f0d5341eb2ed306976aeb9
MD5 4d665cd9b20ca4a2992831e563b016af
BLAKE2b-256 283ce6def20b1f606c976272930104bb0b4990358f7107588daa0da7745a8c82

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.7-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6805f8657de514cf4838fd655ec7bff0363e9e92c3f614a188590e256f22e2f4
MD5 a1f849b34be0ac57d1de3c753c120d53
BLAKE2b-256 2ee924eadeb6420d656598490fc634c06fec7a001d512eee5cb8d0eb77b9b356

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.7-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 833.6 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.7-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7d3c3f4424302b7a276d0d091749943d574ebd1778640fb52ca392bda2b1bffc
MD5 0ad7a4900717e8c7685151427727331c
BLAKE2b-256 9ae63dc70dbd94096598bfe28971491c7fd8c27d239f1d6cbc658bf11ce48038

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp310-cp310-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.7-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 d92a0ae7884cacdf0ff37ed3f6aaf52ff2a08c23a13ac0a7236d9c4365ce5598
MD5 4f9c94f11eeffaa971933d027c24041d
BLAKE2b-256 419462fa7b43fabd54eb402a85626863ebcce5e8b06f9e57f4c4f7e3164f27bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.7-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.7-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 46df48ceb94cc0274bee66a0ca5c04946ff8ab6e712bd66e3af193ecfda5e12a
MD5 087a4472a00c67e2251cf90c8e8d414c
BLAKE2b-256 280c440a92b4966094f65a653835737ee08b5ce331e138dae30cc1c078c8c3de

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.7-cp310-cp310-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page