Skip to main content

Declarative Excel-to-JSON extraction and validation. Apply a compact YAML template to .xlsx files to produce canonical JSON and structured per-cell validation errors.

Project description

crease

Declarative Excel-to-JSON extraction and validation. Apply a compact YAML template to an .xlsx file, get canonical JSON out + structured per-cell errors. No spreadsheet-specific code in your pipeline.

📖 Docs: dev360.github.io/crease

pip install crease

Status: alpha. API may shift before 1.0. See ROADMAP.md.


Why this exists

Excel parsing has well-documented failure modes that quietly cost money and time. Crease is designed to make these visible and structured rather than silent:

Failure mode Cost when it happens How Crease handles it
Excel autoconverts SEPT2 (gene) to 2-Sep (date) — affects ~20% of genomics papers Wrong data downstream, no warning treat_as_text on the field; validator emits wrong_type with likely_cause: excel_autoconvert
Public Health England loses 15,706 COVID cases to silent row-overflow 8-day contact-tracing gap during a pandemic Per-template min_data_rows + column_count_mismatch detection; nothing fails silently
JPMorgan loses $6B because a VaR model required manual copy-paste between sheets The whole loss Canonical JSON flows from xlsx to downstream pipelines — the copy step disappears
Operator gets #N/A from VLOOKUP because "Acme Corp" had a trailing space Hours of debugging Always-on header normalization; per-field normalize: trim
N/A, TBD, - in cells trigger wrong_type everywhere False positives bury real issues Layered null_tokens — library defaults handle the common ones, templates and fields tighten or loosen
Headers move down a row when the customer adds a title line Every subsequent file fails locate.header_anchor: "Order ID" instead of header_row: 3
Operator hides "soft-deleted" rows; downstream consumers still process them Cf. Lehman/Barclays' 179 unwanted trading contracts (2008) locate.skip_hidden_rows: true

The full catalog of patterns and supporting sources lives in ANECDOTES.md. The design philosophy: fail loudly with row and field coordinates rather than swallowing the failure into the canonical output.


Quick start

A template describes where the data lives and what the fields mean. The same template drives both extraction (cells → canonical JSON) and validation (constraints → structured errors).

The API splits into three composable steps:

import crease

template = crease.Template.load("templates/orders.crease.yml")

# 1. Extract — turn cells into canonical JSON
result = crease.extract("incoming.xlsx", template)
result.canonical["orders"][0]
# {"order_id": "ORD-1001", "customer_email": "a@acme.com",
#  "order_date": "2025-01-15", "quantity": 10, "unit_price": 25.50}

# 2. Validate — independent inspection step
report = crease.validate(result, template)
report.is_valid              # bool — true iff zero errors
report.errors()              # list[Error] — pydantic-shaped

# Or do extract + validate together
result, report = crease.check("incoming.xlsx", template)

Template paths are plain relative paths resolved from your working directory — no implicit "same folder as the xlsx" convention.

The template that produced the output above (note: pattern:, minimum: etc. are both coercion hints for extraction and constraints for validation):

# templates/orders.crease.yml
template_id: orders
description: Order export from acme.

entities:
  - name: order
    cardinality: many
    locate:
      tab: Orders
      orientation: flat
      header_row: 0
    fields:
      - { name: order_id,       source_column: order_id,       type: string,  pattern: ^ORD-\d{4}$ }
      - { name: customer_email, source_column: customer_email, type: email }
      - { name: order_date,     source_column: order_date,     type: date }
      - { name: quantity,       source_column: quantity,       type: integer, minimum: 1 }
      - { name: unit_price,     source_column: unit_price,     type: number,  minimum: 0 }

Getting your data out

result.canonical is a plain dict — no extra dependencies, no opinions. When you want something richer, opt in:

# Iterate as dicts
for order in result.iter("order"):
    pipeline.send(order)

# Project into a Pydantic model. Field matching is opportunistic by attribute
# name: fields the model doesn't declare are dropped silently; type mismatches
# raise crease.ValidationError.
from pydantic import BaseModel

class Order(BaseModel):
    order_id: str
    quantity: int      # the model can be a subset of the template's fields

orders: list[Order] = result.to_pydantic("order", model=Order)

# Project into a pandas DataFrame
df = result.to_pandas("order")

By default, every projection method halts if extraction produced any errors — the library's whole pitch is "fail loudly with coordinates." To opportunistically recover and keep the rows that did map cleanly:

orders = result.to_pydantic("order", model=Order, allow_partial=True)
# rows that didn't validate are absent from `orders`.
# they're listed in result.report.errors() with row/field coordinates.

For cardinality: one entities, use result.get("company") / result.get("company", model=Company) instead — iteration over a single record is a category error.


Streaming large files

For multi-hundred-thousand-row files, stream instead of materializing. Streaming takes the same model= and allow_partial= arguments as the materialized projections, so the shape stays symmetric:

# Yields dicts
for order in crease.stream("big.xlsx", template, entity="order"):
    pipeline.send(order)

# Yields validated Pydantic instances
for order in crease.stream("big.xlsx", template, entity="order", model=Order):
    pipeline.send(order)

Memory stays bounded (~10MB) regardless of file size. Errors accumulate on the session report rather than being yielded inline — the iterator returns the happy path; the report owns the sad path.


Multi-entity files

When one file has multiple shapes (cover sheet + per-region data tabs + totals), declare each as its own entity:

entities:
  - name: company                       # one record from the cover tab
    cardinality: one
    locate: { tab: Cover, orientation: property_sheet, label_col: 0, value_col: 1 }
    fields:
      - { name: company_name,  source_label: Company,       type: string }
      - { name: period,        source_label: Period,        type: string, pattern: ^Q[1-4]\s\d{4}$ }
      - { name: contact_email, source_label: Contact,       type: email }

  - name: order                         # many records from every "Region - X" tab
    cardinality: many
    locate:
      tab_pattern: ^Region - (.+)$
      orientation: flat
      header_row: 3
    fields:
      - { name: order_id, source_column: Order ID, type: string,  pattern: ^ORD-\d{4}$ }
      - { name: customer, source_column: Customer, type: string }
      - { name: total,    source_column: Total,    type: number,  minimum: 0 }
    enrich:
      - { field: region, source: tab_name_regex_group, group: 1 }

ignore_tabs: [Notes]

Use a session when you want both eager and streaming reads against the same file:

with crease.open("incoming.xlsx", template) as session:
    company = session.get("company")                     # cardinality: one (eager)
    for order in session.stream("order", model=Order):   # cardinality: many (streaming)
        pipeline.send({**order.model_dump(), "_company": company["company_name"]})

    if not session.report().is_valid:
        log.warning(session.report().errors())

Scattered metadata (anchored layout)

Some cover sheets sprinkle properties at irregular positions. Anchor each field by the label text near it:

entities:
  - name: report
    cardinality: one
    locate: { tab: Cover, orientation: anchored }
    fields:
      - name: period
        type: string
        anchor: { label_match: "Reporting Period", value_at: right, offset: 1 }
      - name: contact_email
        type: email
        anchor: { label_match: "Contact", value_at: right, offset: 1 }
      - name: submitted_on
        type: date
        anchor: { label_match: "Date sent", value_at: right, offset: 1 }

Survives the customer adding or removing rows between properties.


Field types and constraints

Type Notes
string Free text. Add pattern: for regex enforcement
integer Coerced from int or float-with-no-fractional
number int or float
boolean Customize with true_values: [Yes, Y, 1], false_values: [No, N, 0]
date Use date_format: "%m/%d/%Y" for ambiguous formats
datetime Same
email Built-in regex
uuid Built-in regex
url Built-in regex

Per-field options:

fields:
  - name: customer_email
    source_column: Email
    type: email
    nullable: true                              # blanks allowed
    null_tokens: [N/A, TBD, "-"]                # also treat these strings as null
    normalize: trim                             # trim | lower | trim_lower

null_tokens is layered: library defaults (N/A, TBD, -, , (blank), n/a, NaN) → template-level → field-level. Override any layer, including setting null_tokens: [] to disable.


CLI

# Extract to JSON
crease extract incoming.xlsx --template templates/orders.crease.yml > out.json

# Validate only (exit 0 if valid, 1 if cell-level errors, 2 if structural; tune with --fail-on)
crease validate incoming.xlsx --template templates/orders.crease.yml

# Extract + validate together
crease check incoming.xlsx --template templates/orders.crease.yml --json

# Stream a single entity to JSONL (true streaming, low memory)
crease stream incoming.xlsx --template templates/orders.crease.yml --entity order > orders.jsonl

# Batch over a folder — emits per-file JSON outputs plus an error report
crease batch ./inbox/ --template templates/orders.crease.yml \
  --out ./extracted/ --report ./report.csv

# Run the test corpus (developer command)
crease test test_cases/

Installation

pip install crease                  # core: extract + validate, returns dicts
pip install crease[pandas]          # adds result.to_pandas() (also pulls Pandera)

Core deps: openpyxl, pydantic, pyyaml, python-calamine. Pandas and Pandera are optional extras — if you only use extract and to_pydantic, you don't pay for pandas. No LLM, no network calls at runtime.

Local development

The repo uses Poetry and the src/ layout.

# 1. Clone
git clone git@github.com:dev360/crease.git
cd crease

# 2. Install (core + extras + test deps) into a Poetry-managed venv
poetry install --with test --all-extras

# 3. Run the corpus
poetry run pytest

# 4. Optional: build the docs site locally
poetry run mkdocs serve     # http://localhost:8000

# 5. Hook up pre-commit (runs ruff + conventional-commit on every commit)
poetry run pre-commit install
poetry run pre-commit run --all-files

If you're not using Poetry, plain pip works fine against the venv of your choice:

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[pandas]"           # editable install with the pandas extra
pip install pytest faker             # test deps
pytest

Template authoring (by hand, by an LLM tool you build, by import from another schema language) is out of scope for this library.


Errors and validation

Errors are pydantic-shaped — the same vocabulary anyone using Pydantic already knows. Every constraint declared on a field is enforced at validation time, with row and field coordinates attached.

report = crease.validate(result, template)

report.is_valid                   # bool — true iff zero errors
report.error_count()              # int
report.errors()                   # list[Error]

err = report.errors()[0]
err.type        # "wrong_type" — stable machine code, safe to route on
err.loc         # ("order", 47, "customer_email") — (entity, row, field)
err.msg         # human-readable
err.input       # the offending value
err.ctx         # extra context, e.g. {"likely_cause": "excel_autoconvert"}
err.severity    # "cell" | "structural"

The halt-by-default projection methods raise crease.ValidationError, which carries the same data:

try:
    orders = result.to_pydantic("order", model=Order)
except crease.ValidationError as e:
    e.errors()         # same list as report.errors() would have produced
    e.error_count()

Severity

Severity Meaning What you typically do
structural The template can't even map the file (missing tab, header mapping failed, column count mismatch). Bounce back to sender — the file is unusable as-is.
cell Per-row problem (missing value, wrong type, constraint violation). Send to a human review queue with bad rows highlighted, or recover with allow_partial=True.

Error type codes

Cell-level (severity: "cell"):

error.type Triggers when
missing_required A non-nullable field has a blank value (after null_tokens collapse)
wrong_type Value can't coerce to the declared type. Includes ctx.likely_cause: excel_autoconvert when applicable
pattern_mismatch String doesn't match pattern:
enum_violation Value not in declared enum:
below_minimum, above_maximum Numeric range violation
empty_row Mid-data blank row
duplicate_row Row identical to a previous one
anchor_not_found Anchored field's label text not present in tab
boolean_alias_unknown Value didn't match true_values/false_values
model_field_missing_in_canonical A Pydantic model passed to to_pydantic requires a field the template doesn't produce
model_type_mismatch A Pydantic model's field type doesn't match the canonical value's type

Structural (severity: "structural"):

error.type Triggers when
missing_tab Template's tab: doesn't exist
tab_pattern_no_match tab_pattern: matched zero tabs
column_count_mismatch Header row has wrong number of columns
header_mapping_failed source_column/source_label not found
entity_missing Locate found nothing
multiple_rows_for_cardinality_one cardinality: one entity found >1 row

Documentation

License

BSD 3-Clause. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crease-1.0.0.tar.gz (33.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crease-1.0.0-py3-none-any.whl (31.8 kB view details)

Uploaded Python 3

File details

Details for the file crease-1.0.0.tar.gz.

File metadata

  • Download URL: crease-1.0.0.tar.gz
  • Upload date:
  • Size: 33.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.0 CPython/3.12.13 Linux/6.17.0-1013-azure

File hashes

Hashes for crease-1.0.0.tar.gz
Algorithm Hash digest
SHA256 694defe3ce683f2ea78a393506258595451f05b481fa89d3b6c06281d98645f5
MD5 af4536341cc12ec5605fdb52267902d8
BLAKE2b-256 c6d669a0c8f5271b4bbd6cb4c90ede40bde92bdc8c49b0be57c0cffe55fe8d94

See more details on using hashes here.

File details

Details for the file crease-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: crease-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.0 CPython/3.12.13 Linux/6.17.0-1013-azure

File hashes

Hashes for crease-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 655fdb3861ae43c4689161752cd3164a897b758a597d1e8b5f66874baf32bde9
MD5 0d5bc32d970d6c8181f37c41779eaf94
BLAKE2b-256 fc10c1ff77a0e3f5e00d582ff4f4dd07bb7e3f0c27dc94b592295cc4f43365d0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page