Skip to main content

A fast Rust-powered tool for extracting data from Excel forms into JSON

Project description

sheetio -- Fast Rust-Powered Excel Form Data Extraction

PyPI version CI Python 3.10+ License: MIT

Extract structured data from Excel forms (.xlsx, .xlsm) into JSON. Built in Rust with Python bindings via PyO3 for fast parallel processing of hundreds of files. Designed for standardized Excel forms that don't fit the typical CSV format -- government reports, engineering forms, financial templates, and survey spreadsheets.

Why sheetio?

  • Fast -- Rust core with async parallel file processing. Process hundreds of Excel forms in seconds.
  • Flexible -- Three extraction modes (single cells, row patterns, dataframes) handle any form layout.
  • Simple -- One function, JSON config, JSON output. No complex API to learn.
  • Robust -- Handles missing data, duplicate keys, wildcard sheet matching, and composite identifiers.

Quick Start

pip install sheetio
import sheetio
import json

files = ["form_001.xlsx", "form_002.xlsx", "form_003.xlsx"]

config = [
    {
        "sheets": ["Sheet1"],
        "extractions": [
            {
                "function": "single_cells",
                "label": "header",
                "instructions": {
                    "title": "b2",
                    "date": "d4",
                    "author": "b6"
                }
            },
            {
                "function": "multirow_patterns",
                "label": "items",
                "instructions": {
                    "row_range": [10, 100],
                    "unique_id": "A",
                    "stop_if_empty": "A",
                    "columns": {
                        "ID": "A",
                        "Description": "B",
                        "Value": "C"
                    }
                }
            }
        ]
    }
]

result = json.loads(sheetio.excel_extract(files, config, 5))

Config Builder

Use ExtractionConfig to build configs iteratively with method chaining, instead of writing raw dicts:

from sheetio import ExtractionConfig

config = ExtractionConfig()

# Add extractions with a fluent API
config.add_sheets(["Sheet1"]) \
    .single_cells("header", title="B2", date="D4", author="B6") \
    .multirow("items",
        row_range=(10, 100),
        unique_id="A",
        stop_if_empty="A",
        columns={"ID": "A", "Description": "B", "Value": "C"})

# Extract directly — returns parsed Python dicts (no json.loads needed)
result = config.extract(["form_001.xlsx", "form_002.xlsx"], workers=5)

# Save / load configs as JSON files
config.to_json("my_config.json")
config = ExtractionConfig.from_json("my_config.json")

# Inspect what you've built
config.summary()

# Or get the raw config list for manual use
raw = config.build()

All three extraction types are supported: .single_cells(), .multirow(), .dataframe(). See the Extraction Types Reference below for all available options.

Key Features

Feature Description
Parallel processing Process multiple files simultaneously with configurable worker count
Wildcard sheets Match sheets by pattern: "School_*" matches School_A, School_B, etc.
Composite keys Combine multiple columns as unique identifiers: ["Project", "Year"]
Gap tolerance Continue extraction through empty rows with stop_consecutive
Duplicate handling Automatic _1, _2 suffixes for duplicate keys
Multi-column merge Extract arrays from multiple columns per field: ["X", "Y", "Z"]
Multi-row headers Concatenate header rows for dataframe extraction

Requirements

  • Python 3.10 or higher
  • Supported platforms: Windows, macOS, Linux

Extraction Types Reference

Configuration Structure

The extraction_details parameter is a list of dictionaries that define the extraction rules for each Excel sheet. Each dictionary contains:

  • sheets: A list of sheet names to extract data from. Accepts patterns with *. Example: "School_*" will loop through sheets like School_A, School_B, etc.
  • skip_sheets: An optional list of sheet names to skip. Can be useful when using patterns in the list of sheets.
  • extractions: A list of extraction rules that will be applied to the sheets listed.

Each extraction rule contains:

  • function: Type of extraction function: single_cells, multirow_patterns, or dataframe.
  • label: Optional key string to store results under. If not specified the extracted key value pairs will be stored directly under the sheet name.
  • break_if_null: An optional check to skip sheet if specified cell is null.
  • instructions: Instructions for the extraction function. See details for each function type below.

Single Cells Extraction

Extracts individual cells from the Excel sheet.

Instructions:

  • instructions: A dictionary where the keys are the reference name and the values are the cell references (e.g., "a1", "b2").

Example:

{
    "sheets": ["Sheet1"],
    "extractions": [
        {
            "function": "single_cells",
            "label": "single",
            "break_if_null": "c3",
            "instructions": {
                "Value 1": "a1",
                "Value 2": "b2",
                "Value 3": "c3",
                "Date": "d4",
                "Datetime": "e5"
            }
        }
    ]
}

Multirow Patterns Extraction

Extracts data from multiple rows based on a pattern.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • unique_id (optional): The column(s) to use as a unique identifier. Can be either:
    • A single column as a string: "B"
    • Multiple columns as an array: ["B", "C"] for composite keys
    • When using composite keys, if ANY column contains null/empty values, the row is skipped
    • If omitted: Results are returned as an array/list instead of a dictionary
  • unique_id_separator (optional): The separator to use when joining multiple columns for composite keys. Defaults to "_".
  • columns: A dictionary where the keys are the column names and the values are the column letters (e.g., "B", "C").
  • stop_if_empty (optional): Controls when to stop processing rows. Can be:
    • A column string: "A" - Stop when this column is empty
    • An array of columns: ["A", "B"] - Stop when ALL specified columns are empty
    • The string "row" - Stop when the entire data row is empty
    • An object with detailed configuration:
      {"column": "A", "consecutive": 2}
      
      or
      {"mode": "row", "consecutive": 1}
      
  • stop_consecutive (optional): Used with simple stop_if_empty syntax to specify how many consecutive empty rows trigger a stop. Defaults to 1.

Example with single unique_id:

{
    "sheets": ["Sheet 1", "Sheet 2"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "deposits",
            "instructions": {
                "row_range": [1, 10],
                "unique_id": "B",
                "columns": {
                    "Title": "B",
                    "Description": "C",
                    "Estimate": "D",
                    "Chance": "E",
                }
            }
        }
    ]
}

Example with composite unique_id:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "projects",
            "instructions": {
                "row_range": [1, 50],
                "unique_id": ["B", "C"],
                "unique_id_separator": "-",
                "columns": {
                    "Project": "B",
                    "Year": "C",
                    "Budget": "D",
                    "Status": "E"
                }
            }
        }
    ]
}

Example without unique_id (returns array/list):

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "items",
            "instructions": {
                "row_range": [1, 1000],
                "stop_if_empty": "A",
                "columns": {
                    "Name": "A",
                    "Value": "B",
                    "Description": "C"
                }
            }
        }
    ]
}
# Returns: {"items": [{"Name": "...", "Value": ...}, {"Name": "...", "Value": ...}]}

Example with stop_if_empty and gap tolerance:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "data",
            "instructions": {
                "row_range": [1, 100],
                "stop_if_empty": "A",
                "stop_consecutive": 3,
                "columns": {
                    "ID": "A",
                    "Value": "B"
                }
            }
        }
    ]
}

Example with row-based empty detection:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "records",
            "instructions": {
                "row_range": [1, 500],
                "stop_if_empty": {
                    "mode": "row",
                    "consecutive": 2
                },
                "columns": {
                    "Field1": "A",
                    "Field2": "B",
                    "Field3": "C"
                }
            }
        }
    ]
}

Example with multiple column monitoring:

{
    "sheets": ["Sheet 1"],
    "extractions": [
        {
            "function": "multirow_patterns",
            "label": "transactions",
            "instructions": {
                "row_range": [1, 1000],
                "unique_id": "A",
                "stop_if_empty": ["A", "B"],
                "columns": {
                    "ID": "A",
                    "Date": "B",
                    "Amount": "C"
                }
            }
        }
    ]
}

Dataframe Extraction

Extracts tabular data with headers, returning JSON that can easily be converted to a Pandas DataFrame.

Instructions:

  • row_range: A list of two integers defining the row range to extract.
  • column_range: A list of column letters to extract.
  • header_row: A list of row numbers to use as the header.
  • separator: Optional separator to use when combining header cells (default " ").

Example:

{
    "sheets": ["School_*"],
    "extractions": [
        {
            "function": "dataframe",
            "label": "DataFrame",
            "instructions": {
                "row_range": [5, 15],
                "column_range": ["B", "F"],
                "header_row": [2, 3, 4],
                "separator": " ",
            }
        }
    ]
}

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

# Development setup
pip install maturin pytest openpyxl ruff
maturin develop --release
make lint   # check formatting
make test   # run tests

License

sheetio is released under the MIT License. See the LICENSE file for more details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

sheetio-0.3.6-cp313-cp313-win_amd64.whl (833.8 kB view details)

Uploaded CPython 3.13Windows x86-64

sheetio-0.3.6-cp313-cp313-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

sheetio-0.3.6-cp313-cp313-macosx_10_12_x86_64.whl (955.8 kB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

sheetio-0.3.6-cp312-cp312-win_amd64.whl (833.8 kB view details)

Uploaded CPython 3.12Windows x86-64

sheetio-0.3.6-cp312-cp312-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

sheetio-0.3.6-cp312-cp312-macosx_10_12_x86_64.whl (955.6 kB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

sheetio-0.3.6-cp311-cp311-win_amd64.whl (833.6 kB view details)

Uploaded CPython 3.11Windows x86-64

sheetio-0.3.6-cp311-cp311-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.34+ x86-64

sheetio-0.3.6-cp311-cp311-macosx_10_12_x86_64.whl (957.2 kB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

sheetio-0.3.6-cp310-cp310-win_amd64.whl (833.5 kB view details)

Uploaded CPython 3.10Windows x86-64

sheetio-0.3.6-cp310-cp310-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.34+ x86-64

sheetio-0.3.6-cp310-cp310-macosx_10_12_x86_64.whl (957.4 kB view details)

Uploaded CPython 3.10macOS 10.12+ x86-64

File details

Details for the file sheetio-0.3.6-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.6-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 833.8 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.6-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 8f3ef76e46fe2e30eb2d85054977a774fd7d144bb4da04a40dad0f2b3774f7c9
MD5 16cdaf795109870213ccc27165984469
BLAKE2b-256 4d9dfccc77f407382728c228ccaa62067a8699eb783b080b6c0b1de106df6702

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp313-cp313-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.6-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 2dbc43a176c3db2727f1d4594b9414bddb2aa1435fe15b13d8ed450151e441e8
MD5 ff498647d475001e2050e09198927288
BLAKE2b-256 a2f13fbbcf132df0742eb957ca8226e64014857159213b817debed9f787ca729

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.6-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 22f17119cf3a996a3ac4497402e915d1171c1ffe8f5ce4cccb53198a90ee971d
MD5 c9731cb406bda3a5441a3e3c32f66959
BLAKE2b-256 bef4e4877d096dac670453f4eeb0b8971aced5836654d18444539713dfad74c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp313-cp313-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.6-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 833.8 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.6-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 2b3ba3cfba0040007987a047398e57dbda9d9e8133170a93aaae67cf6b54957b
MD5 68f328e5b6e482ce1788bd369f33faa0
BLAKE2b-256 e8ea6d87b014f8b684ac61eccd3dd769102675b5f7bb6a327ec7be7b5d3d1e04

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp312-cp312-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.6-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5539a6f916e2947fd2475c2bbdbf29de4d3f18c037ae2b1fbd57df7652335d89
MD5 ad65a9909852e5cb4498d5e4e192e600
BLAKE2b-256 d691f9b1256b24ce1a65cf088204bcba2d8d0e96c70f170e29ef18ae4a62e3e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.6-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c6ee010a3460a8d37aa25bb6b7935d33157eddd7152dc4d38f77ab8962249435
MD5 bcc0df4f2a6c2e4388954b104de17d90
BLAKE2b-256 305c694493f57cdb4661dc768e4286908e4170685c56878b01b1a43eeb855dc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp312-cp312-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.6-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 833.6 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e9d1a7ffd9fd001245f783a9865b45f560338e9c6ac42e5c27d7ff45ec92ac7e
MD5 c8bd748f5f9e81f80862aadf08a8ed1a
BLAKE2b-256 6bedac9fa051049dbdf310ddc7b437a1b88303ffc4e1301520e79a7a338015ba

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp311-cp311-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp311-cp311-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.6-cp311-cp311-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 37479eea9d12c20ea5da516874dea9ad519aed6aa7f18add55621b6a39b1bea6
MD5 08fe1e4926050edc6eb6c5dd9ed309cd
BLAKE2b-256 2c18b9ca77df240373a62293867df919a152eb9993e52ea936ceaabfb70ae47f

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp311-cp311-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.6-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 923591fc67237ecd52f292d253c01bdfa444499697250bd34e8c67674b806671
MD5 2c4f363d5c012c5f0fcb633a307daadb
BLAKE2b-256 23f95304d8353bf12efda6379d9c896790226dfdfa21ca44361d0c8f642607ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp311-cp311-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: sheetio-0.3.6-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 833.5 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sheetio-0.3.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 ad4a17b3b4e1cc1cf39033eb47672bd164ff3e43c7d059f9ed59bb64ac1d574c
MD5 4ad5953869c91b4bb44c4dd6ede174ad
BLAKE2b-256 52e43cf07ceeaf22d57b5521a4c9f4c54f299a4817c450af4984905f55ccab94

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp310-cp310-win_amd64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp310-cp310-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.6-cp310-cp310-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 7dbd57ccfb897de12808e928781040fc2f2aabb9aac973a11ca0cded0eb8adf2
MD5 8d46481a7861dafbc8510e197793fa36
BLAKE2b-256 40668e8322e3cfd154ec757a9e9c694f18b89cdd3a1d7a02dbfe4cb01a6bbb72

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp310-cp310-manylinux_2_34_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sheetio-0.3.6-cp310-cp310-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for sheetio-0.3.6-cp310-cp310-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 532067fbfa3d14398002cc8da3fcb7e7833f5fd8ff14df1ed00bf01e39893087
MD5 80edd974c96a67aeb6fb2d727843deb7
BLAKE2b-256 6eae85cb58f89e0df1be508f90f323c24a420b49913a73d33d137fdd2d9d84e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for sheetio-0.3.6-cp310-cp310-macosx_10_12_x86_64.whl:

Publisher: build_wheels.yml on kkollsga/sheetio

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page