Skip to main content

A fast Rust-powered tool for extracting data from Excel forms into JSON

Project description

sheetio -- Fast Rust-Powered Excel Form Data Extraction

PyPI version CI Python 3.10+ License: MIT

Extract structured data from Excel forms (.xlsx, .xlsm) into JSON. Built in Rust with Python bindings via PyO3 for fast parallel processing of hundreds of files. Designed for standardized Excel forms that don't fit the typical CSV format -- government reports, engineering forms, financial templates, and survey spreadsheets.

Why sheetio?

  • Fast -- Rust core with async parallel file processing. Process hundreds of Excel forms in seconds.
  • Flexible -- Three extraction modes (single cells, row patterns, dataframes) handle any form layout.
  • Simple -- One function, JSON config, JSON output. No complex API to learn.
  • Robust -- Handles missing data, duplicate keys, wildcard sheet matching, and composite identifiers.

Quick Start

pip install sheetio
import sheetio
import json

files = ["form_001.xlsx", "form_002.xlsx", "form_003.xlsx"]

config = [
    {
        "sheets": ["Sheet1"],
        "extractions": [
            {
                "function": "single_cells",
                "label": "header",
                "instructions": {
                    "title": "b2",
                    "date": "d4",
                    "author": "b6"
                }
            },
            {
                "function": "multirow_patterns",
                "label": "items",
                "instructions": {
                    "row_range": [10, 100],
                    "unique_id": "A",
                    "stop_if_empty": "A",
                    "columns": {
                        "ID": "A",
                        "Description": "B",
                        "Value": "C"
                    }
                }
            }
        ]
    }
]

result = json.loads(sheetio.excel_extract(files, config, 5))

Config Builder

Use ExtractionConfig to build configs iteratively with method chaining, instead of writing raw dicts:

from sheetio import ExtractionConfig

config = ExtractionConfig()

# Add extractions with a fluent API
config.add_sheets(["Sheet1"]) \
    .single_cells("header", title="B2", date="D4", author="B6") \
    .multirow("items",
        row_range=(10, 100),
        unique_id="A",
        stop_if_empty="A",
        columns={"ID": "A", "Description": "B", "Value": "C"})

# Extract directly — returns parsed Python dicts (no json.loads needed)
result = config.extract(["form_001.xlsx", "form_002.xlsx"], workers=5)

# Save / load configs as JSON files
config.to_json("my_config.json")
config = ExtractionConfig.from_json("my_config.json")

# Inspect what you've built
config.summary()

# Or get the raw config list for manual use
raw = config.build()

All three extraction types are supported: .single_cells(), .multirow(), .dataframe(). See the Extraction Types Reference below for all available options.

Key Features

Feature Description
Parallel processing Process multiple files simultaneously with configurable worker count
Wildcard sheets Match sheets by pattern: "School_*" matches School_A, School_B, etc.
Composite keys Combine multiple columns as unique identifiers: ["Project", "Year"]
Gap tolerance Continue extraction through empty rows with stop_consecutive
Duplicate handling Automatic _1, _2 suffixes for duplicate keys
Multi-column merge Extract arrays from multiple columns per field: ["X", "Y", "Z"]
Multi-row headers Concatenate header rows for dataframe extraction

Requirements

  • Python 3.10 or higher
  • Supported platforms: Windows, macOS, Linux

Extraction Types Reference

Configuration Structure

The extraction_details parameter is a list of dictionaries that define the extraction rules for each Excel sheet. Each dictionary contains:

  • sheets: A list of sheet names to extract data from. Accepts patterns with *. Example: "School_*" will loop through sheets like School_A, School_B, etc.
  • skip_sheets: An optional list of sheet names to skip. Can be useful when using patterns in the list of sheets.
  • extractions: A list of extraction rules that will be applied to the sheets listed.

Each extraction rule contains:

  • function: Type of extraction function: single_cells, multirow_patterns, or dataframe.
  • label: Optional key string to store results under. If not specified the extracted key value pairs will be stored directly under the sheet name.
  • break_if_null: An optional check to skip sheet if specified cell is null.
  • instructions: Instructions for the extraction function. See details for each function type below.

Single Cells Extraction

Extracts individual cells from the Excel sheet.

Instructions:

  • instructions: A dictionary where the keys are the reference name and the values are the cell references (e.g., "a1", "b2").

Example:

{
    "sheets": ["Sheet1"],
    "extractions": [
        {
            "function": "single_cells",
            "label": "single",
            "break_if_null": "c3",
            "instructions": {
                "Value 1": "a1",
                "Value 2": "b2",
                "Value 3": "c3",
                "Date": "d4",
                "Datetime": "e5"
            }
        }
    ]
}

Multirow Patterns Extraction

Extracts data from multiple rows based on a pattern.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • unique_id (optional): The column(s) to use as a unique identifier. Can be either:
    • A single column as a string: "B"
    • Multiple columns as an array: ["B", "C"] for composite keys
    • When using composite keys, if ANY column contains null/empty values, the row is skipped
    • If omitted: Results are returned as an array/list instead of a dictionary
  • unique_id_separator (optional): The separator to use when joining multiple columns for composite keys. Defaults to "_".
  • columns: A dictionary where the keys are the column names and the values are the column letters (e.g., "B", "C").
  • stop_if_empty (optional): Controls when to stop processing rows. Can be:
    • A column string: "A" - Stop when this column is empty
    • An array of columns: ["A", "B"] - Stop when ALL specified columns are empty
    • The string "row" - Stop when the entire data row is empty
    • An object with detailed configuration:
      {"column": "A", "consecutive": 2}
      
      or
      {"mode": "row", "consecutive": 1}
      
  • stop_consecutive (optional): Used with simple stop_if_empty syntax to specify how many consecutive empty rows trigger a stop. Defaults to 1.

Example with single unique_id:

{
    "sheets": ["Sheet 1", "Sheet 2"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "deposits",
            "instructions": {
                "row_range": [1, 10],
                "unique_id": "B",
                "columns": {
                    "Title": "B",
                    "Description": "C",
                    "Estimate": "D",
                    "Chance": "E",
                }
            }
        }
    ]
}

Example with composite unique_id:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "projects",
            "instructions": {
                "row_range": [1, 50],
                "unique_id": ["B", "C"],
                "unique_id_separator": "-",
                "columns": {
                    "Project": "B",
                    "Year": "C",
                    "Budget": "D",
                    "Status": "E"
                }
            }
        }
    ]
}

Example without unique_id (returns array/list):

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "items",
            "instructions": {
                "row_range": [1, 1000],
                "stop_if_empty": "A",
                "columns": {
                    "Name": "A",
                    "Value": "B",
                    "Description": "C"
                }
            }
        }
    ]
}
# Returns: {"items": [{"Name": "...", "Value": ...}, {"Name": "...", "Value": ...}]}

Example with stop_if_empty and gap tolerance:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "data",
            "instructions": {
                "row_range": [1, 100],
                "stop_if_empty": "A",
                "stop_consecutive": 3,
                "columns": {
                    "ID": "A",
                    "Value": "B"
                }
            }
        }
    ]
}

Example with row-based empty detection:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "records",
            "instructions": {
                "row_range": [1, 500],
                "stop_if_empty": {
                    "mode": "row",
                    "consecutive": 2
                },
                "columns": {
                    "Field1": "A",
                    "Field2": "B",
                    "Field3": "C"
                }
            }
        }
    ]
}

Example with multiple column monitoring:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "transactions",
            "instructions": {
                "row_range": [1, 1000],
                "unique_id": "A",
                "stop_if_empty": ["A", "B"],
                "columns": {
                    "ID": "A",
                    "Date": "B",
                    "Amount": "C"
                }
            }
        }
    ]
}

Dataframe Extraction

Extracts tabular data with headers, returning JSON that can easily be converted to a Pandas DataFrame.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • column_range: A list of column letters to extract.
  • header_row: A list of row numbers to use as the header.
  • separator: Optional separator to use when combining header cells (default " ").

Example:

{
    "sheets": ["School_*"],
    "extractions": [
        {
            "function": "dataframe",
            "label": "DataFrame",
            "instructions": {
                "row_range": [5, 15],
                "column_range": ["B", "F"],
                "header_row": [2, 3, 4],
                "separator": " ",
            }
        }
    ]
}

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

# Development setup
pip install maturin pytest openpyxl ruff
maturin develop --release
make lint   # check formatting
make test   # run tests

License

sheetio is released under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sheetio-0.3.5-cp313-cp313-win_amd64.whl (833.7 kB view details)

Uploaded CPython 3.13Windows x86-64

sheetio-0.3.5-cp313-cp313-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

sheetio-0.3.5-cp313-cp313-macosx_10_12_x86_64.whl (955.7 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

sheetio-0.3.5-cp312-cp312-win_amd64.whl (833.8 kB view details)

Uploaded CPython 3.12Windows x86-64

sheetio-0.3.5-cp312-cp312-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

sheetio-0.3.5-cp312-cp312-macosx_10_12_x86_64.whl (955.6 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

sheetio-0.3.5-cp311-cp311-win_amd64.whl (833.7 kB view details)

Uploaded CPython 3.11Windows x86-64

sheetio-0.3.5-cp311-cp311-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

sheetio-0.3.5-cp311-cp311-macosx_10_12_x86_64.whl (957.0 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

sheetio-0.3.5-cp310-cp310-win_amd64.whl (833.6 kB view details)

Uploaded CPython 3.10Windows x86-64

sheetio-0.3.5-cp310-cp310-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

sheetio-0.3.5-cp310-cp310-macosx_10_12_x86_64.whl (957.2 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file sheetio-0.3.5-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.5-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 833.7 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.5-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 b644799ecc4829d9e263a905dc0dc41cd1681b183217d56d3c93b3c572778fd2
MD5 9a6c487143f9d9f9801c3ad79cf1b1b3
BLAKE2b-256 626d691c5a26067b8a0f1f862f649af56b4566f62daf7b28e06160216d35a292

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp313-cp313-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.5-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 f200201b99365b7c69014500dd46be722d6715fe72a89f50c59cf83c88df690e
MD5 26bc516e6fd99bd1dd4dd9f919c071a6
BLAKE2b-256 7eccbf509ec789e31d6d6d04be96cb2078dbdd36466cb486104efa9bb84dc2be

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.5-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b4f6541d750418e322e53962a92fba764f5bf4f70f9ccf0ab79646fb7ee4d8d4
MD5 7579f7fb6edb1a6a38f85d93d004623e
BLAKE2b-256 7c2f24533ff5389ca470dea954c448ce9a759ea3fda65e96742e089410773ec3

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp313-cp313-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.5-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 833.8 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.5-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 28b35f4c090de051449739593b0701c4d2d0f6bcecd8d0ae4bb1ef8607f49859
MD5 d45263fcb86b67de6c1a447d84661ad5
BLAKE2b-256 5879aa91ed1c49a9300bd89c091fd78997241d3583899ae0ccd3db37f2ceef43

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp312-cp312-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.5-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7e0ddb88b0dd5f1e0bb4adf703f95b9eb9e28d42e72d9baa515d47b950efa926
MD5 0ae9632a51f1e8a851d27c4b92628882
BLAKE2b-256 3e33dbd2c39f03266326bd16be73986692423394715eb5fc80f0f553ae78cef2

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.5-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 206b946c618406fe2461b205559f7fdfbbc71c2e9774b3f5ae9a9ed3d4190c2f
MD5 d90d8fd82a515df1613ea155bd4f3fed
BLAKE2b-256 239f6a6d063e0eca3124bf2835573361e46e89a8c8d8783a3f9047ad1af0d683

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.5-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 833.7 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 57df664570ee85f3109e8a87d8d725eada87908781e066a304556c4b63b2cbe3
MD5 7ccad3c5e577cc1018e51d2a5868bb3c
BLAKE2b-256 18f5987329a8d2c8b90c93eb7a2f35b3e67f41bc5a7be27986bd24de1789aa70

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp311-cp311-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.5-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 f96945cae9ea27df5f3a3118412edc26c44c66be4d2f43b2f7a121ba1b2d68df
MD5 39cfba7c93fb6eff526663e1283c4e01
BLAKE2b-256 c25c1d091835bcac5d41f81abdd10f0e6930aaf9ee12bb7847e2ec3478639050

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.5-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e1e01ac568041c360a8ee82a9bdac7c3d4d7d3d9f08d769a4f39361753654ef1
MD5 6ce3e09bb72dd78319bdd1db45e76d5a
BLAKE2b-256 4b7d142adea9a2818ee36d5cbdc93f4c2f6685182973f17bab0f615ef5cfa0d4

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.5-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 833.6 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 3e2f567b76bef3237092169450eb720076b450d169c71949e9bc2723ff4b723f
MD5 08291a63f48be9e30efe78502424df85
BLAKE2b-256 303d80899e8f211851671f9808da72eba50ddffecff0e50dd497b7ad445c1b1b

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp310-cp310-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.5-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5abc9186cd7772447fab8a61be4f1972c1fc10a7e6ed190853e69cb9df42d282
MD5 711ef17ba9d5c5f8246f119211780e0e
BLAKE2b-256 02b8bca1b77b65de350f9b555c25b9d412aa53c472dd4efab8344136aeabac50

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.5-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.5-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b77582bdd7c989c4cf91bd3a9b43d9c026203c6886da999e7b319d958d2dda8b
MD5 d0b42996b39125999a5acb10f78d097e
BLAKE2b-256 c6b3c71f357227d8fd81555d41a7f68dd58cb7f6c5c38436a4f639f7c98f339f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.5-cp310-cp310-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page