Declarative Excel-to-JSON extraction and validation. Apply a compact YAML template to .xlsx files to produce canonical JSON and structured per-cell validation errors.
Project description
crease
Declarative Excel-to-JSON extraction and validation. Apply a compact YAML
template to a spreadsheet file — .xlsx, .xls, .xlsb, or .ods — get
canonical JSON out plus structured per-cell errors. No spreadsheet-specific
code in your pipeline.
📖 Docs: dev360.github.io/crease
pip install crease
Status: 1.0 — published to PyPI as
crease. API surface is stable; breaking changes will be markedfeat!:and bump the major version. See ROADMAP.md for what's next.
Why this exists
Excel parsing has well-documented failure modes that quietly cost money and time. Crease is designed to make these visible and structured rather than silent:
| Failure mode | Cost when it happens | How Crease handles it |
|---|---|---|
Excel autoconverts SEPT2 (gene) to 2-Sep (date) — affects ~20% of genomics papers |
Wrong data downstream, no warning | treat_as_text on the field; validator emits wrong_type with likely_cause: excel_autoconvert |
| Public Health England loses 15,706 COVID cases to silent row-overflow | 8-day contact-tracing gap during a pandemic | Per-template min_data_rows + column_count_mismatch detection; nothing fails silently |
| JPMorgan loses $6B because a VaR model required manual copy-paste between sheets | The whole loss | Canonical JSON flows from xlsx to downstream pipelines — the copy step disappears |
Operator gets #N/A from VLOOKUP because "Acme Corp" had a trailing space |
Hours of debugging | Always-on header normalization; per-field normalize: trim |
N/A, TBD, - in cells trigger wrong_type everywhere |
False positives bury real issues | Layered null_tokens — library defaults handle the common ones, templates and fields tighten or loosen |
| Headers move down a row when the customer adds a title line | Every subsequent file fails | locate.header_anchor: "Order ID" instead of header_row: 3 |
| Operator hides "soft-deleted" rows; downstream consumers still process them | Cf. Lehman/Barclays' 179 unwanted trading contracts (2008) | locate.skip_hidden_rows: true |
Customer sends .xls / .xlsb / .ods instead of .xlsx |
Pipeline rejects the file; manual re-export | All four formats read out of the box — calamine handles the legacy ones |
The full catalog of patterns and supporting sources lives in ANECDOTES.md. The design philosophy: fail loudly with row and field coordinates rather than swallowing the failure into the canonical output.
Quick start
A template describes where the data lives and what the fields mean. The same template drives both extraction (cells → canonical JSON) and validation (constraints → structured errors).
The API splits into three composable steps:
import crease
template = crease.Template.load("templates/orders.crease.yml")
# 1. Extract — turn cells into canonical JSON
result = crease.extract("incoming.xlsx", template)
result.canonical["orders"][0]
# {"order_id": "ORD-1001", "customer_email": "a@acme.com",
# "order_date": "2025-01-15", "quantity": 10, "unit_price": 25.50}
# 2. Validate — independent inspection step
report = crease.validate(result, template)
report.is_valid # bool — true iff zero errors
report.errors() # list[Error] — pydantic-shaped
# Or do extract + validate together
result, report = crease.check("incoming.xlsx", template)
Template paths are plain relative paths resolved from your working directory — no implicit "same folder as the xlsx" convention.
The template that produced the output above (note: pattern:, minimum: etc.
are both coercion hints for extraction and constraints for validation):
# templates/orders.crease.yml
template_id: orders
description: Order export from acme.
entities:
- name: order
cardinality: many
locate:
tab: Orders
orientation: flat
header_row: 0
fields:
- { name: order_id, source_column: order_id, type: string, pattern: ^ORD-\d{4}$ }
- { name: customer_email, source_column: customer_email, type: email }
- { name: order_date, source_column: order_date, type: date }
- { name: quantity, source_column: quantity, type: integer, minimum: 1 }
- { name: unit_price, source_column: unit_price, type: number, minimum: 0 }
Getting your data out
result.canonical is a plain dict — no extra dependencies, no opinions. When
you want something richer, opt in:
# Iterate as dicts
for order in result.iter("order"):
pipeline.send(order)
# Project into a Pydantic model. Field matching is opportunistic by attribute
# name: fields the model doesn't declare are dropped silently; type mismatches
# raise crease.ValidationError.
from pydantic import BaseModel
class Order(BaseModel):
order_id: str
quantity: int # the model can be a subset of the template's fields
orders: list[Order] = result.to_pydantic("order", model=Order)
# Project into a pandas DataFrame
df = result.to_pandas("order")
By default, every projection method halts if extraction produced any errors — the library's whole pitch is "fail loudly with coordinates." To opportunistically recover and keep the rows that did map cleanly:
orders = result.to_pydantic("order", model=Order, allow_partial=True)
# rows that didn't validate are absent from `orders`.
# they're listed in result.report.errors() with row/field coordinates.
For cardinality: one entities, use result.get("company") /
result.get("company", model=Company) instead — iteration over a single
record is a category error.
Streaming large files
For multi-hundred-thousand-row files, stream instead of materializing.
Streaming takes the same model= and allow_partial= arguments as the
materialized projections, so the shape stays symmetric:
# Yields dicts
for order in crease.stream("big.xlsx", template, entity="order"):
pipeline.send(order)
# Yields validated Pydantic instances
for order in crease.stream("big.xlsx", template, entity="order", model=Order):
pipeline.send(order)
Memory stays bounded (~10MB) regardless of file size. Errors accumulate on the session report rather than being yielded inline — the iterator returns the happy path; the report owns the sad path.
Multi-entity files
When one file has multiple shapes (cover sheet + per-region data tabs + totals), declare each as its own entity:
entities:
- name: company # one record from the cover tab
cardinality: one
locate: { tab: Cover, orientation: property_sheet, label_col: 0, value_col: 1 }
fields:
- { name: company_name, source_label: Company, type: string }
- { name: period, source_label: Period, type: string, pattern: ^Q[1-4]\s\d{4}$ }
- { name: contact_email, source_label: Contact, type: email }
- name: order # many records from every "Region - X" tab
cardinality: many
locate:
tab_pattern: ^Region - (.+)$
orientation: flat
header_row: 3
fields:
- { name: order_id, source_column: Order ID, type: string, pattern: ^ORD-\d{4}$ }
- { name: customer, source_column: Customer, type: string }
- { name: total, source_column: Total, type: number, minimum: 0 }
enrich:
- { field: region, source: tab_name_regex_group, group: 1 }
ignore_tabs: [Notes]
Use a session when you want both eager and streaming reads against the same file:
with crease.open("incoming.xlsx", template) as session:
company = session.get("company") # cardinality: one (eager)
for order in session.stream("order", model=Order): # cardinality: many (streaming)
pipeline.send({**order.model_dump(), "_company": company["company_name"]})
if not session.report().is_valid:
log.warning(session.report().errors())
Repeating sections within one tab
Some reports pack multiple sub-sections into a single tab — a weekly
schedule with one sub-table per day, separated by a ===== row or a
recurring title. The blocks: grammar (template version: 2) lets you
declare the repeating region once, anchor each instance with start /
end patterns, and capture per-section metadata that gets merged onto
every row in that section:
template_id: weekly_orders
version: 2
blocks:
- name: daily_section
tab_pattern: ^W-\d+$
starts_at: { column: D, cell_pattern: ^ORDER SCHEDULE$ }
ends_at: { column: A, cell_pattern: ^={3,}$ }
captures:
- field: order_date
from: { column: D, cell_pattern: ^DAY (\d+-\d+-\d+)$, regex_group: 1 }
type: date
date_formats: ['%m-%d-%Y']
entities:
- name: order
block: daily_section # ← scope this entity to each block instance
cardinality: many
locate:
orientation: flat
header_anchor: { text: ORDER_ID, match_mode: exact }
fields:
- { name: order_id, source_column: ORDER_ID, type: string, pattern: ^ORD-\d{4}$ }
- { name: customer, source_column: CUSTOMER, type: string }
- { name: quantity, source_column: QUANTITY, type: integer, minimum: 1 }
Output is flat — order_date from each section's DAY-row is merged
onto every order row from that section. See
Repeating sections for the full grammar.
Scattered metadata (anchored layout)
Some cover sheets sprinkle properties at irregular positions. Anchor each field by the label text near it:
entities:
- name: report
cardinality: one
locate: { tab: Cover, orientation: anchored }
fields:
- name: period
type: string
anchor: { label_match: "Reporting Period", value_at: right, offset: 1 }
- name: contact_email
type: email
anchor: { label_match: "Contact", value_at: right, offset: 1 }
- name: submitted_on
type: date
anchor: { label_match: "Date sent", value_at: right, offset: 1 }
Survives the customer adding or removing rows between properties.
Field types and constraints
| Type | Notes |
|---|---|
string |
Free text. Add pattern: for regex enforcement |
integer |
Coerced from int or float-with-no-fractional |
number |
int or float |
boolean |
Customize with true_values: [Yes, Y, 1], false_values: [No, N, 0] |
date |
Use date_format: "%m/%d/%Y" for ambiguous formats |
datetime |
Same |
email |
Built-in regex |
uuid |
Built-in regex |
url |
Built-in regex |
Per-field options:
fields:
- name: customer_email
source_column: Email
type: email
nullable: true # blanks allowed
null_tokens: [N/A, TBD, "-"] # also treat these strings as null
normalize: trim # trim | lower | trim_lower
null_tokens is layered: library defaults (N/A, TBD, -, —, (blank),
n/a, NaN) → template-level → field-level. Override any layer, including
setting null_tokens: [] to disable.
CLI
# Extract to JSON
crease extract incoming.xlsx --template templates/orders.crease.yml > out.json
# Validate only (exit 0 if valid, 1 if cell-level errors, 2 if structural; tune with --fail-on)
crease validate incoming.xlsx --template templates/orders.crease.yml
# Extract + validate together
crease check incoming.xlsx --template templates/orders.crease.yml --json
# Stream a single entity to JSONL (true streaming, low memory)
crease stream incoming.xlsx --template templates/orders.crease.yml --entity order > orders.jsonl
# Batch over a folder — emits per-file JSON outputs plus an error report
crease batch ./inbox/ --template templates/orders.crease.yml \
--out ./extracted/ --report ./report.csv
# Run the test corpus (developer command)
crease test test_cases/
Installation
pip install crease # core: extract + validate, returns dicts
pip install crease[pandas] # adds result.to_pandas()
Core deps: python-calamine, openpyxl, pydantic, pyyaml. Pandas is an
optional extra — if you only use extract and to_pydantic,
you don't pay for pandas. No LLM, no network calls at runtime.
Read backends
Crease reads spreadsheets through two interchangeable backends:
| Backend | Formats | When it's used |
|---|---|---|
| calamine (default) | .xlsx, .xls, .xlsb, .ods |
Picked automatically. Fast (Rust under the hood) and GIL-releasing, so a ThreadPoolExecutor parallelizes multi-file reads. |
| openpyxl | .xlsx only |
Picked automatically when the template declares locate.skip_hidden_rows: true — only openpyxl exposes row-hidden cell metadata. |
Override the auto-selection with engine="calamine" or engine="openpyxl"
on extract, get, stream, check, and crease.open. Forcing calamine
on a skip_hidden_rows template emits a UserWarning and silently
degrades that feature to a no-op (calamine can't see the flag); use this
only when you're reading a non-xlsx file and have already verified hidden
rows aren't present.
Local development
The repo uses uv and the src/ layout.
# 1. Clone
git clone git@github.com:dev360/crease.git
cd crease
# 2. Install (core + extras + test deps) into a uv-managed venv
uv sync --all-extras --group test
# 3. Run the corpus
uv run pytest
# 4. Optional: build the docs site locally
uv run mkdocs serve # http://localhost:8000
# 5. Hook up pre-commit (runs ruff + conventional-commit on every commit)
uv run pre-commit install
uv run pre-commit run --all-files
If you're not using uv, plain pip works fine against the venv of
your choice. PEP 735 [dependency-groups] requires pip ≥ 25.1, so we
install the dev/test tools by name instead:
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[pandas]" # editable install with the pandas extra
pip install pytest faker pre-commit ruff # test + dev tools
pytest
Template authoring (by hand, by an LLM tool you build, by import from another schema language) is out of scope for this library.
Errors and validation
Errors are pydantic-shaped — the same vocabulary anyone using Pydantic already knows. Every constraint declared on a field is enforced at validation time, with row and field coordinates attached.
report = crease.validate(result, template)
report.is_valid # bool — true iff zero errors
report.error_count() # int
report.errors() # list[Error]
err = report.errors()[0]
err.type # "wrong_type" — stable machine code, safe to route on
err.loc # ("order", 47, "customer_email") — (entity, row, field)
err.msg # human-readable
err.input # the offending value
err.ctx # extra context, e.g. {"likely_cause": "excel_autoconvert"}
err.severity # "cell" | "structural"
The halt-by-default projection methods raise crease.ValidationError, which
carries the same data:
try:
orders = result.to_pydantic("order", model=Order)
except crease.ValidationError as e:
e.errors() # same list as report.errors() would have produced
e.error_count()
Severity
| Severity | Meaning | What you typically do |
|---|---|---|
structural |
The template can't even map the file (missing tab, header mapping failed, column count mismatch). | Bounce back to sender — the file is unusable as-is. |
cell |
Per-row problem (missing value, wrong type, constraint violation). | Send to a human review queue with bad rows highlighted, or recover with allow_partial=True. |
Error type codes
Cell-level (severity: "cell"):
error.type |
Triggers when |
|---|---|
missing_required |
A non-nullable field has a blank value (after null_tokens collapse) |
wrong_type |
Value can't coerce to the declared type. Includes ctx.likely_cause: excel_autoconvert when applicable |
pattern_mismatch |
String doesn't match pattern: |
enum_violation |
Value not in declared enum: |
below_minimum, above_maximum |
Numeric range violation |
empty_row |
Mid-data blank row |
duplicate_row |
Row identical to a previous one |
anchor_not_found |
Anchored field's label text not present in tab |
header_duplicated |
source_column matches multiple header cells in the same row; bind picked the first. Set source_column_index: on the field to choose a specific occurrence (0-indexed across the matches). |
header_above_nonblank |
The row immediately above header_row has non-blank text in a column that also has a header. Surfaces the case where the operator pointed at the bottom of a two-row header. The column geometry is in ctx.columns. |
boolean_alias_unknown |
Value didn't match true_values/false_values |
model_field_missing_in_canonical |
A Pydantic model passed to to_pydantic requires a field the template doesn't produce |
model_type_mismatch |
A Pydantic model's field type doesn't match the canonical value's type |
Structural (severity: "structural"):
error.type |
Triggers when |
|---|---|
missing_tab |
Template's tab: doesn't exist |
tab_pattern_no_match |
tab_pattern: matched zero tabs |
column_count_mismatch |
Header row has wrong number of columns |
header_mapping_failed |
source_column/source_label not found |
entity_missing |
Locate found nothing |
multiple_rows_for_cardinality_one |
cardinality: one entity found >1 row |
Documentation
ROADMAP.md— what's in v1, what's deferredCOVERAGE.md— layouts and validation errors supportedCONVENTIONS.md— Excel patterns we handle, with examplestest_cases/— labeled fixtures that double as the spec
License
BSD 3-Clause. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crease-1.6.0.tar.gz.
File metadata
- Download URL: crease-1.6.0.tar.gz
- Upload date:
- Size: 277.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29cf1d0bd583bf286bdfa029ba0a6214eff8d53f4033cc47c253e7c5a1779df1
|
|
| MD5 |
cb78c8faa4b43c58fd4c77f0a4e27c15
|
|
| BLAKE2b-256 |
88fedfbe9f268027ace8b6cd72140f28d7a0d14ac7f409abd761586e0537c8a2
|
File details
Details for the file crease-1.6.0-py3-none-any.whl.
File metadata
- Download URL: crease-1.6.0-py3-none-any.whl
- Upload date:
- Size: 45.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc315d2fae980c869ed05bb22a3c15c6c23d9d7b058ada38b18e410d74668d79
|
|
| MD5 |
71156129a8323677159be1a0d7b152fb
|
|
| BLAKE2b-256 |
1b66dc52ebd887867622466ff01a78d4f4e606ab1f3028cabd1c5932ba2ac0cd
|