Skip to main content

A fast Rust-powered tool for extracting data from Excel forms into JSON

Project description

sheetio -- Fast Rust-Powered Excel Form Data Extraction

PyPI version CI Python 3.10+ License: MIT

Extract structured data from Excel forms (.xlsx, .xlsm) into JSON. Built in Rust with Python bindings via PyO3 for fast parallel processing of hundreds of files. Designed for standardized Excel forms that don't fit the typical CSV format -- government reports, engineering forms, financial templates, and survey spreadsheets.

Why sheetio?

  • Fast -- Rust core with async parallel file processing. Process hundreds of Excel forms in seconds.
  • Flexible -- Three extraction modes (single cells, row patterns, dataframes) handle any form layout.
  • Simple -- One function, JSON config, JSON output. No complex API to learn.
  • Robust -- Handles missing data, duplicate keys, wildcard sheet matching, and composite identifiers.

Quick Start

pip install sheetio
import sheetio
import json

files = ["form_001.xlsx", "form_002.xlsx", "form_003.xlsx"]

config = [
    {
        "sheets": ["Sheet1"],
        "extractions": [
            {
                "function": "single_cells",
                "label": "header",
                "instructions": {
                    "title": "b2",
                    "date": "d4",
                    "author": "b6"
                }
            },
            {
                "function": "multirow_patterns",
                "label": "items",
                "instructions": {
                    "row_range": [10, 100],
                    "unique_id": "A",
                    "stop_if_empty": "A",
                    "columns": {
                        "ID": "A",
                        "Description": "B",
                        "Value": "C"
                    }
                }
            }
        ]
    }
]

result = json.loads(sheetio.excel_extract(files, config, 5))

Config Builder

Use ExtractionConfig to build configs iteratively with method chaining, instead of writing raw dicts:

from sheetio import ExtractionConfig

config = ExtractionConfig()

# Add extractions with a fluent API
config.add_sheets(["Sheet1"]) \
    .single_cells("header", title="B2", date="D4", author="B6") \
    .multirow("items",
        row_range=(10, 100),
        unique_id="A",
        stop_if_empty="A",
        columns={"ID": "A", "Description": "B", "Value": "C"})

# Extract directly — returns parsed Python dicts (no json.loads needed)
result = config.extract(["form_001.xlsx", "form_002.xlsx"], workers=5)

# Save / load configs as JSON files
config.to_json("my_config.json")
config = ExtractionConfig.from_json("my_config.json")

# Inspect what you've built
config.summary()

# Or get the raw config list for manual use
raw = config.build()

All three extraction types are supported: .single_cells(), .multirow(), .dataframe(). See the Extraction Types Reference below for all available options.

Key Features

Feature Description
Parallel processing Process multiple files simultaneously with configurable worker count
Wildcard sheets Match sheets by pattern: "School_*" matches School_A, School_B, etc.
Composite keys Combine multiple columns as unique identifiers: ["Project", "Year"]
Gap tolerance Continue extraction through empty rows with stop_consecutive
Duplicate handling Automatic _1, _2 suffixes for duplicate keys
Multi-column merge Extract arrays from multiple columns per field: ["X", "Y", "Z"]
Multi-row headers Concatenate header rows for dataframe extraction

Requirements

  • Python 3.10 or higher
  • Supported platforms: Windows, macOS, Linux

Extraction Types Reference

Configuration Structure

The extraction_details parameter is a list of dictionaries that define the extraction rules for each Excel sheet. Each dictionary contains:

  • sheets: A list of sheet names to extract data from. Accepts patterns with *. Example: "School_*" will loop through sheets like School_A, School_B, etc.
  • skip_sheets: An optional list of sheet names to skip. Can be useful when using patterns in the list of sheets.
  • extractions: A list of extraction rules that will be applied to the sheets listed.

Each extraction rule contains:

  • function: Type of extraction function: single_cells, multirow_patterns, or dataframe.
  • label: Optional key string to store results under. If not specified the extracted key value pairs will be stored directly under the sheet name.
  • break_if_null: An optional check to skip sheet if specified cell is null.
  • instructions: Instructions for the extraction function. See details for each function type below.

Single Cells Extraction

Extracts individual cells from the Excel sheet.

Instructions:

  • instructions: A dictionary where the keys are the reference name and the values are the cell references (e.g., "a1", "b2").

Example:

{
    "sheets": ["Sheet1"],
    "extractions": [
        {
            "function": "single_cells",
            "label": "single",
            "break_if_null": "c3",
            "instructions": {
                "Value 1": "a1",
                "Value 2": "b2",
                "Value 3": "c3",
                "Date": "d4",
                "Datetime": "e5"
            }
        }
    ]
}

Multirow Patterns Extraction

Extracts data from multiple rows based on a pattern.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • unique_id (optional): The column(s) to use as a unique identifier. Can be either:
    • A single column as a string: "B"
    • Multiple columns as an array: ["B", "C"] for composite keys
    • When using composite keys, if ANY column contains null/empty values, the row is skipped
    • If omitted: Results are returned as an array/list instead of a dictionary
  • unique_id_separator (optional): The separator to use when joining multiple columns for composite keys. Defaults to "_".
  • columns: A dictionary where the keys are the column names and the values are the column letters (e.g., "B", "C").
  • stop_if_empty (optional): Controls when to stop processing rows. Can be:
    • A column string: "A" - Stop when this column is empty
    • An array of columns: ["A", "B"] - Stop when ALL specified columns are empty
    • The string "row" - Stop when the entire data row is empty
    • An object with detailed configuration:
      {"column": "A", "consecutive": 2}
      
      or
      {"mode": "row", "consecutive": 1}
      
  • stop_consecutive (optional): Used with simple stop_if_empty syntax to specify how many consecutive empty rows trigger a stop. Defaults to 1.

Example with single unique_id:

{
    "sheets": ["Sheet 1", "Sheet 2"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "deposits",
            "instructions": {
                "row_range": [1, 10],
                "unique_id": "B",
                "columns": {
                    "Title": "B",
                    "Description": "C",
                    "Estimate": "D",
                    "Chance": "E",
                }
            }
        }
    ]
}

Example with composite unique_id:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "projects",
            "instructions": {
                "row_range": [1, 50],
                "unique_id": ["B", "C"],
                "unique_id_separator": "-",
                "columns": {
                    "Project": "B",
                    "Year": "C",
                    "Budget": "D",
                    "Status": "E"
                }
            }
        }
    ]
}

Example without unique_id (returns array/list):

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "items",
            "instructions": {
                "row_range": [1, 1000],
                "stop_if_empty": "A",
                "columns": {
                    "Name": "A",
                    "Value": "B",
                    "Description": "C"
                }
            }
        }
    ]
}
# Returns: {"items": [{"Name": "...", "Value": ...}, {"Name": "...", "Value": ...}]}

Example with stop_if_empty and gap tolerance:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "data",
            "instructions": {
                "row_range": [1, 100],
                "stop_if_empty": "A",
                "stop_consecutive": 3,
                "columns": {
                    "ID": "A",
                    "Value": "B"
                }
            }
        }
    ]
}

Example with row-based empty detection:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "records",
            "instructions": {
                "row_range": [1, 500],
                "stop_if_empty": {
                    "mode": "row",
                    "consecutive": 2
                },
                "columns": {
                    "Field1": "A",
                    "Field2": "B",
                    "Field3": "C"
                }
            }
        }
    ]
}

Example with multiple column monitoring:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "transactions",
            "instructions": {
                "row_range": [1, 1000],
                "unique_id": "A",
                "stop_if_empty": ["A", "B"],
                "columns": {
                    "ID": "A",
                    "Date": "B",
                    "Amount": "C"
                }
            }
        }
    ]
}

Dataframe Extraction

Extracts tabular data with headers, returning JSON that can easily be converted to a Pandas DataFrame.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • column_range: A list of column letters to extract.
  • header_row: A list of row numbers to use as the header.
  • separator: Optional separator to use when combining header cells (default " ").

Example:

{
    "sheets": ["School_*"],
    "extractions": [
        {
            "function": "dataframe",
            "label": "DataFrame",
            "instructions": {
                "row_range": [5, 15],
                "column_range": ["B", "F"],
                "header_row": [2, 3, 4],
                "separator": " ",
            }
        }
    ]
}

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

# Development setup
pip install maturin pytest openpyxl ruff
maturin develop --release
make lint   # check formatting
make test   # run tests

License

sheetio is released under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sheetio-0.3.3-cp313-cp313-win_amd64.whl (793.1 kB view details)

Uploaded CPython 3.13Windows x86-64

sheetio-0.3.3-cp313-cp313-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

sheetio-0.3.3-cp313-cp313-macosx_10_12_x86_64.whl (940.7 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

sheetio-0.3.3-cp312-cp312-win_amd64.whl (793.1 kB view details)

Uploaded CPython 3.12Windows x86-64

sheetio-0.3.3-cp312-cp312-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

sheetio-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl (940.6 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

sheetio-0.3.3-cp311-cp311-win_amd64.whl (791.3 kB view details)

Uploaded CPython 3.11Windows x86-64

sheetio-0.3.3-cp311-cp311-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

sheetio-0.3.3-cp311-cp311-macosx_10_12_x86_64.whl (940.3 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

sheetio-0.3.3-cp310-cp310-win_amd64.whl (791.1 kB view details)

Uploaded CPython 3.10Windows x86-64

sheetio-0.3.3-cp310-cp310-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

sheetio-0.3.3-cp310-cp310-macosx_10_12_x86_64.whl (940.6 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file sheetio-0.3.3-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.3-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 793.1 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.3-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 0f1547793eaf15ab295780bd04e88b93f7a2520e0438b8229a7ef66298f1b5c1
MD5 e6590728072269eb3d8fe1daa48dce9a
BLAKE2b-256 3b5a5fd4652e1b5ea43d235ec1e137e80ced52fab7a7ea95789e7fc478ce8407

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp313-cp313-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.3-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 010946ae20f22319a0a98b963989186a17a150eb8075877e08895f35099375a1
MD5 7895df67607fb204939883ea865d0b49
BLAKE2b-256 c9e93c9b3ee730bac252ceb09fcb6650e58e9cfd319c96329091d1cceaaf1995

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.3-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2cce1ab3db025e11cf7f29900f54434f6a7ab5542461d93365c4cd9885be3585
MD5 e2da49ba35b99993c8caaa747c26a09d
BLAKE2b-256 51d3769e6fea52659b42ff516782baf632425bb0bb1226806fa4a1166ce31e47

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp313-cp313-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.3-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 793.1 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 f7fb502affba8a8509601c51fc7b7470fe6f3be527143b8f6d44e690a4a64023
MD5 74feccddde53a2b6a972892fe25c5a82
BLAKE2b-256 d6b2dfa86d23ffb97e5fae660f02895d59e436cd745deee90d1b243c795dc442

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp312-cp312-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.3-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c12d23a3cf01e867cd14cdf30e54607e46c528921d8ddf73c76829c3a7eca64e
MD5 b0b4c58d0ad75e6ccc4cdfcf4a45cbcc
BLAKE2b-256 e08705e09dbc7e2f12ed4601150e17d728b35131b85d96c4d20369cd8b6a9af6

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3bff9cb37d4d560a115361e6374a605e54eb4162e85925e1cfa7b10581757c7d
MD5 917150bff08dcb895512076a984c63d6
BLAKE2b-256 7e815596ce3a900ccb9b2cd48e9c5a5a341b34291ad16a26c019d916ea41bda6

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.3-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 791.3 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.3-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 bfbd2dc444ef738a3d936f21472c446b013433369dc9f90e5a0df9aa089ec835
MD5 69959177e86edd0ec7ed9d9a2ed7b727
BLAKE2b-256 37c6d449388b9ffa0ee45bcd94ac1a3f21a20b5409a68d6969031cf5b245be1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp311-cp311-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.3-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 74b53fa677ae975c5f1433ac9cd33ba060d161e348dbd94e70b37b47d56ed086
MD5 de98dee304925e37fcb5aa67c2a5d7e5
BLAKE2b-256 36f9be24fd6f313bab787aa49c7490753c244859769894463ef5717bd1a9deeb

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.3-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 872ff04fb0f3199ebac80fd0106f788267d4ce90ee9590b33fefd1af1500c10b
MD5 fe56ebde1a71ac42396ffbc092ea3965
BLAKE2b-256 13de4e8f2af194f6daa7d9819272b90ea0c6f9b46b6a1f7f306f708687cfc22c

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.3-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 791.1 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.3-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 6c71e39a061e474d716e868ca769e47d3438f20b112267f3483db097d32d0372
MD5 1fda42d659e055f576b281477556ef9f
BLAKE2b-256 8172e4ddd16b73fd4699910d4b49a6ac7c5a2e2f7a66768993989fedcefb6fe9

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp310-cp310-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.3-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 325c85cc9c54322242d1ca33cbc90adf1e3787b5470d9d76b8de2d341564713c
MD5 b62a68944c1fb3e565173fe5c0a8b5f7
BLAKE2b-256 aab8e0b58503bf0627428dada39134cc3496e670ed952605156da7d2e3f7977c

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.3-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.3-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ed6447d52ffdfb05cdec75e14530bf4c0025cbe51d8ac193d9dc61b0ea8b825d
MD5 18aec0502bbaabb7eaedd251f86ef8f9
BLAKE2b-256 f63be02a224197d224205eda4c64e928146dc5548161cd6f83ad08d4b8fdca31

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.3-cp310-cp310-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page