Skip to main content

A fast Rust-powered tool for extracting data from Excel forms into JSON

Project description

sheetio -- Fast Rust-Powered Excel Form Data Extraction

PyPI version CI Python 3.10+ License: MIT

Extract structured data from Excel forms (.xlsx, .xlsm) into JSON. Built in Rust with Python bindings via PyO3 for fast parallel processing of hundreds of files. Designed for standardized Excel forms that don't fit the typical CSV format -- government reports, engineering forms, financial templates, and survey spreadsheets.

Why sheetio?

  • Fast -- Rust core with async parallel file processing. Process hundreds of Excel forms in seconds.
  • Flexible -- Three extraction modes (single cells, row patterns, dataframes) handle any form layout.
  • Simple -- One function, JSON config, JSON output. No complex API to learn.
  • Robust -- Handles missing data, duplicate keys, wildcard sheet matching, and composite identifiers.

Quick Start

pip install sheetio
import sheetio
import json

files = ["form_001.xlsx", "form_002.xlsx", "form_003.xlsx"]

config = [
    {
        "sheets": ["Sheet1"],
        "extractions": [
            {
                "function": "single_cells",
                "label": "header",
                "instructions": {
                    "title": "b2",
                    "date": "d4",
                    "author": "b6"
                }
            },
            {
                "function": "multirow_patterns",
                "label": "items",
                "instructions": {
                    "row_range": [10, 100],
                    "unique_id": "A",
                    "stop_if_empty": "A",
                    "columns": {
                        "ID": "A",
                        "Description": "B",
                        "Value": "C"
                    }
                }
            }
        ]
    }
]

result = json.loads(sheetio.excel_extract(files, config, 5))

Config Builder

Use ExtractionConfig to build configs iteratively with method chaining, instead of writing raw dicts:

from sheetio import ExtractionConfig

config = ExtractionConfig()

# Add extractions with a fluent API
config.add_sheets(["Sheet1"]) \
    .single_cells("header", title="B2", date="D4", author="B6") \
    .multirow("items",
        row_range=(10, 100),
        unique_id="A",
        stop_if_empty="A",
        columns={"ID": "A", "Description": "B", "Value": "C"})

# Extract directly — returns parsed Python dicts (no json.loads needed)
result = config.extract(["form_001.xlsx", "form_002.xlsx"], workers=5)

# Save / load configs as JSON files
config.to_json("my_config.json")
config = ExtractionConfig.from_json("my_config.json")

# Inspect what you've built
config.summary()

# Or get the raw config list for manual use
raw = config.build()

All three extraction types are supported: .single_cells(), .multirow(), .dataframe(). See the Extraction Types Reference below for all available options.

Key Features

Feature Description
Parallel processing Process multiple files simultaneously with configurable worker count
Wildcard sheets Match sheets by pattern: "School_*" matches School_A, School_B, etc.
Composite keys Combine multiple columns as unique identifiers: ["Project", "Year"]
Gap tolerance Continue extraction through empty rows with stop_consecutive
Duplicate handling Automatic _1, _2 suffixes for duplicate keys
Multi-column merge Extract arrays from multiple columns per field: ["X", "Y", "Z"]
Multi-row headers Concatenate header rows for dataframe extraction

Requirements

  • Python 3.10 or higher
  • Supported platforms: Windows, macOS, Linux

Extraction Types Reference

Configuration Structure

The extraction_details parameter is a list of dictionaries that define the extraction rules for each Excel sheet. Each dictionary contains:

  • sheets: A list of sheet names to extract data from. Accepts patterns with *. Example: "School_*" will loop through sheets like School_A, School_B, etc.
  • skip_sheets: An optional list of sheet names to skip. Can be useful when using patterns in the list of sheets.
  • extractions: A list of extraction rules that will be applied to the sheets listed.

Each extraction rule contains:

  • function: Type of extraction function: single_cells, multirow_patterns, or dataframe.
  • label: Optional key string to store results under. If not specified the extracted key value pairs will be stored directly under the sheet name.
  • break_if_null: An optional check to skip sheet if specified cell is null.
  • instructions: Instructions for the extraction function. See details for each function type below.

Single Cells Extraction

Extracts individual cells from the Excel sheet.

Instructions:

  • instructions: A dictionary where the keys are the reference name and the values are the cell references (e.g., "a1", "b2").

Example:

{
    "sheets": ["Sheet1"],
    "extractions": [
        {
            "function": "single_cells",
            "label": "single",
            "break_if_null": "c3",
            "instructions": {
                "Value 1": "a1",
                "Value 2": "b2",
                "Value 3": "c3",
                "Date": "d4",
                "Datetime": "e5"
            }
        }
    ]
}

Multirow Patterns Extraction

Extracts data from multiple rows based on a pattern.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • unique_id (optional): The column(s) to use as a unique identifier. Can be either:
    • A single column as a string: "B"
    • Multiple columns as an array: ["B", "C"] for composite keys
    • When using composite keys, if ANY column contains null/empty values, the row is skipped
    • If omitted: Results are returned as an array/list instead of a dictionary
  • unique_id_separator (optional): The separator to use when joining multiple columns for composite keys. Defaults to "_".
  • columns: A dictionary where the keys are the column names and the values are the column letters (e.g., "B", "C").
  • stop_if_empty (optional): Controls when to stop processing rows. Can be:
    • A column string: "A" - Stop when this column is empty
    • An array of columns: ["A", "B"] - Stop when ALL specified columns are empty
    • The string "row" - Stop when the entire data row is empty
    • An object with detailed configuration:
      {"column": "A", "consecutive": 2}
      
      or
      {"mode": "row", "consecutive": 1}
      
  • stop_consecutive (optional): Used with simple stop_if_empty syntax to specify how many consecutive empty rows trigger a stop. Defaults to 1.

Example with single unique_id:

{
    "sheets": ["Sheet 1", "Sheet 2"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "deposits",
            "instructions": {
                "row_range": [1, 10],
                "unique_id": "B",
                "columns": {
                    "Title": "B",
                    "Description": "C",
                    "Estimate": "D",
                    "Chance": "E",
                }
            }
        }
    ]
}

Example with composite unique_id:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "projects",
            "instructions": {
                "row_range": [1, 50],
                "unique_id": ["B", "C"],
                "unique_id_separator": "-",
                "columns": {
                    "Project": "B",
                    "Year": "C",
                    "Budget": "D",
                    "Status": "E"
                }
            }
        }
    ]
}

Example without unique_id (returns array/list):

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "items",
            "instructions": {
                "row_range": [1, 1000],
                "stop_if_empty": "A",
                "columns": {
                    "Name": "A",
                    "Value": "B",
                    "Description": "C"
                }
            }
        }
    ]
}
# Returns: {"items": [{"Name": "...", "Value": ...}, {"Name": "...", "Value": ...}]}

Example with stop_if_empty and gap tolerance:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "data",
            "instructions": {
                "row_range": [1, 100],
                "stop_if_empty": "A",
                "stop_consecutive": 3,
                "columns": {
                    "ID": "A",
                    "Value": "B"
                }
            }
        }
    ]
}

Example with row-based empty detection:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "records",
            "instructions": {
                "row_range": [1, 500],
                "stop_if_empty": {
                    "mode": "row",
                    "consecutive": 2
                },
                "columns": {
                    "Field1": "A",
                    "Field2": "B",
                    "Field3": "C"
                }
            }
        }
    ]
}

Example with multiple column monitoring:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "transactions",
            "instructions": {
                "row_range": [1, 1000],
                "unique_id": "A",
                "stop_if_empty": ["A", "B"],
                "columns": {
                    "ID": "A",
                    "Date": "B",
                    "Amount": "C"
                }
            }
        }
    ]
}

Dataframe Extraction

Extracts tabular data with headers, returning JSON that can easily be converted to a Pandas DataFrame.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • column_range: A list of column letters to extract.
  • header_row: A list of row numbers to use as the header.
  • separator: Optional separator to use when combining header cells (default " ").

Example:

{
    "sheets": ["School_*"],
    "extractions": [
        {
            "function": "dataframe",
            "label": "DataFrame",
            "instructions": {
                "row_range": [5, 15],
                "column_range": ["B", "F"],
                "header_row": [2, 3, 4],
                "separator": " ",
            }
        }
    ]
}

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

# Development setup
pip install maturin pytest openpyxl ruff
maturin develop --release
make lint   # check formatting
make test   # run tests

License

sheetio is released under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sheetio-0.3.4-cp313-cp313-win_amd64.whl (793.1 kB view details)

Uploaded CPython 3.13Windows x86-64

sheetio-0.3.4-cp313-cp313-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

sheetio-0.3.4-cp313-cp313-macosx_10_12_x86_64.whl (940.6 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

sheetio-0.3.4-cp312-cp312-win_amd64.whl (793.1 kB view details)

Uploaded CPython 3.12Windows x86-64

sheetio-0.3.4-cp312-cp312-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

sheetio-0.3.4-cp312-cp312-macosx_10_12_x86_64.whl (940.6 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

sheetio-0.3.4-cp311-cp311-win_amd64.whl (791.2 kB view details)

Uploaded CPython 3.11Windows x86-64

sheetio-0.3.4-cp311-cp311-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

sheetio-0.3.4-cp311-cp311-macosx_10_12_x86_64.whl (940.3 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

sheetio-0.3.4-cp310-cp310-win_amd64.whl (791.1 kB view details)

Uploaded CPython 3.10Windows x86-64

sheetio-0.3.4-cp310-cp310-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

sheetio-0.3.4-cp310-cp310-macosx_10_12_x86_64.whl (940.7 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file sheetio-0.3.4-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.4-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 793.1 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.4-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 5a161465007c2b85a274a516b62c276483cb455d9b8af7aa6ce9b3c9fdd32e8a
MD5 33c631b8e0da9478b4302b41e807c4af
BLAKE2b-256 9c62317da2408338c8b713000312e99599d3b08481ad198520e0621b11c316ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp313-cp313-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.4-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 21258df591c510dbf29a33a33a4cd3667711160fff660a891191e39804028f94
MD5 57f657ab3295aa3a13c76dad98e82ee6
BLAKE2b-256 e897c48a22a2503f35291ac9e890da6e32d10b6eb1d3a2149cf195a217dd13f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.4-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 a45ef7d0417ef2c4b048f854edd74985b4ca1c48f562989ab1df9c5e309e3846
MD5 dc95402436fd727c75e0eb8075e80eb1
BLAKE2b-256 b35dbb3ae32a46d7e96d5d00877fac13fecbcf3a7fc98153c2da43d18b5b9d98

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp313-cp313-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.4-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 793.1 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.4-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4b24f7bbc783ccb92820a3e20bdbfacf3a9d0934f5682533be0bf44f7c55df72
MD5 42cea14fc27f1642c330fcadf3104698
BLAKE2b-256 2b19b48cda5ab998d3c7ecbb19622c920912a43af27282aebc1daba6c787c47a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp312-cp312-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.4-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4ecfcf049c202183cf4965addd0505eaa3eccfabe88c43fd9fb85dcbbda90efc
MD5 6fd08d6af252b7856c61a1c9df76021b
BLAKE2b-256 ffb6ed7d751797aa2c9f7fe8d413967c4d37b624d5540e28a40cf4259a999487

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.4-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 55fffbb9a60359e4481c7f11946555406e258a8e7a354804ca91207073ed8db5
MD5 5c1a645cf6fa2a1006afc7629c730345
BLAKE2b-256 32311c47c4f9798fcf7e625824a71a7b9aceeb480e94f9f350f4d11e069b3c2b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.4-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 791.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 499ca9a203de565d44eb532d5daa56bdd80598ec6c3ea98d8d00f4887aca99c0
MD5 b0a3d9428cbf4999a6139d3d726e0ffb
BLAKE2b-256 7b56ceb268d4c58dd19ba136b7254a0819cb41e96078bf175fa9b88fb32fb95a

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp311-cp311-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.4-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fabbd99a6f0245134d5d9554bc78cb949855a65332b51aee7d6aa6818f9a1e2b
MD5 e2690b3ac62257dda14e452032428807
BLAKE2b-256 ad2be16113e167d3d0dab3fb62acb5449c0bba3a60ec632829343b2cb820561c

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.4-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5a8ae9701613db9e49645c072a90e95862a1a79993ec9d1eddf82b10e62c96eb
MD5 a8962121d820838c45f50f8581e9fd52
BLAKE2b-256 0d3e0817465492cbd7fa1987d6ba38b26aea2e86d81b7d80df56ced4d0f47769

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.4-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 791.1 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 19e2a46c73c0cd17e8a6e895a5c0aecfc60f30ec415803421775f7ff894a848c
MD5 944cde01146e116c99b12c82c04c0cb9
BLAKE2b-256 c15faed335f43288144173453b169d2eaf0c170f5b4a29979f31fa32317ceb74

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp310-cp310-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.4-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ee8bd7b6b181e934ea39143537814416d7463fa1e78051e679aecb34f085f6d1
MD5 47703283e2a2af9689027764e2191275
BLAKE2b-256 d18d61efa4abb2205759d16d3b189fd21bb5c88d45aed58a3385c71c0f21e2cf

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.4-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.4-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c87fae425c446fc1b28a9e94736382b20ac1ac929ea8f9dcd09058d7cdaa55ce
MD5 b74ee2aa4b604d615fac43308de726e2
BLAKE2b-256 f809c70708c2ee376c53c6cc6c8c6c9570997fa39a702303545bb20872aaf483

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.4-cp310-cp310-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page