Composable, expression-first preprocessing for polars DataFrames: small atomic transforms, declarative recipes, and one function that composes them onto a column.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

andrewjordan3

These details have not been verified by PyPI

Project description

framesmith

A preprocessing library for cleaning messy data in polars DataFrames. Composable atomic transforms, declarative recipes, expression-first design.

Real-world data arrives dirty — smart quotes, currency symbols, accounting parentheses for negatives, mainframe trailing-minus, fullwidth digits, inconsistent whitespace, and placeholder strings that quietly mean missing. framesmith cleans it, and replaces the one-off cleaning code that scatters across notebooks and silently disagrees about edge cases. It gives you small, single-purpose transforms; named recipes that bundle them into reusable pipelines as plain data; and one composition function that turns a column name and a recipe into a polars expression. Every transform returns a pl.Expr and never mutates your frame — you apply it with df.with_columns(...) or df.filter(...), so the same code runs eagerly or lazily.

Quick example

import polars as pl
import framesmith as fs

raw = pl.DataFrame({
    'customer_name': ['  ACME® Corp  ', "O'Brien & Co.", '   '],
    'amount':        ['($1,234.56)',    '$2,500-',       'N/A'],
})

cleaned = raw.with_columns(
    fs.compose_column('customer_name', fs.TEXT_NORMALIZE),
    fs.compose_column('amount',        fs.NUMERIC_STRING_TO_FLOAT),
)
print(cleaned)
# shape: (3, 2)
# ┌───────────────┬──────────┐
# │ customer_name ┆ amount   │
# │ ---           ┆ ---      │
# │ str           ┆ f64      │
# ╞═══════════════╪══════════╡
# │ ACME Corp     ┆ -1234.56 │
# │ OBrien and Co ┆ -2500.0  │
# │ null          ┆ null     │
# └───────────────┴──────────┘

Recipes are plain tuple[ExpressionTransform, ...] — splice them to extend:

from framesmith.transforms import to_snake_case

normalize_and_snake = (*fs.TEXT_NORMALIZE, to_snake_case)
df_snake = raw.with_columns(
    fs.compose_column('customer_name', normalize_and_snake)
)
# 'OBrien and Co' becomes 'obrien_and_co', etc.
# (this exact pipeline also ships ready-made as fs.TEXT_NORMALIZE_TO_SNAKE_CASE)

Installation

pip install framesmith

Or install from source:

git clone https://github.com/andrewjordan3/framesmith.git
cd framesmith
uv sync --group dev   # or: pip install -e '.[dev]'

It depends only on polars.

Key concepts

The library is organized into three tiers plus two supporting patterns.

Transforms

A transform is a pure pl.Expr → pl.Expr function. Each does exactly one thing — collapse_whitespace collapses interior whitespace runs; strip_whitespace trims the ends; normalize_unicode_nfkc applies NFKC. Transforms never call pl.col(...) themselves and never call .alias(...); the composition layer owns those boundaries, so the same transform composes into any pipeline without ceremony.

from framesmith import compose_column
from framesmith.transforms import collapse_whitespace

df.with_columns(compose_column('description', [collapse_whitespace]))

The full set of transforms lives in framesmith.transforms — see the reference for every one, grouped by domain.

Recipes

A recipe is an ordered tuple of transforms: tuple[ExpressionTransform, ...]. All recipes live in framesmith.recipes and are re-exported at the top level, so from framesmith import TEXT_NORMALIZE works directly. They follow a naming protocol so the name states what the recipe does:

<INPUT>_CANONICALIZE — meaning-preserving representation cleanup (whitespace, case, Unicode form).
<INPUT>_NORMALIZE — domain-aware cleanup that interprets the value (an address, a name, a number).
<INPUT>_TO_<FORM> — a conversion whose output form or dtype differs (TO_FLOAT, TO_SNAKE_CASE, TO_TITLE, …).

Because recipes are plain tuples, they compose by splicing:

my_recipe = (*fs.TEXT_NORMALIZE, to_snake_case)

And a recipe can include another recipe the same way — TEXT_NORMALIZE builds on TEXT_CANONICALIZE, which itself splices UNICODE_TO_ASCII, so the canonicalization order has exactly one source of truth.

`compose_column`

The single entry point that turns a column name and a recipe into an expression. Signature:

def compose_column(
    source_column_name: str,
    expression_transforms: Sequence[ExpressionTransform],
    output_column_name: str | None = None,
) -> pl.Expr: ...

It builds pl.col(source_column_name), applies each transform in order, and aliases the result back to the source column name (or to output_column_name if given). An empty transform sequence raises ValueError immediately — silent no-ops hide bugs.

Factories (configured transforms)

When configuration is genuinely data-dependent — for example, which strings count as "missing" varies by source — a transform factory takes the configuration and returns a configured ExpressionTransform. Validation and any precomputation happen once, in the factory body, so the per-call work stays cheap. Several transforms are factories — nullify_sentinels (configurable missing-value tokens), map_categories (a label remap), pad_left (fixed-width padding), and the address standardizers among them; the reference shows which transforms take configuration.

from framesmith.transforms import DEFAULT_MISSING_SENTINELS, nullify_sentinels

recipe = (*fs.TEXT_NORMALIZE, nullify_sentinels(DEFAULT_MISSING_SENTINELS))

Sentinel handling is opt-in by design and never appears in a default recipe — defaulting it on would silently null valid values (e.g. 'NA' as Namibia).

Filters (row selection)

Row selection follows the same expression-returning shape as column transforms, but the user applies the expression via df.filter(...):

from framesmith.filters import within_complete_month

monthly = df.filter(within_complete_month('transaction_date'))

Filters compose with other boolean expressions through the usual & and | — no framesmith abstraction is needed for that.

Reference

The complete reference — every transform, recipe, filter, and frame-level helper, with its signature, a short description, and an example — lives in docs/reference.md. It is organized by package: recipes first, then transforms grouped by domain (whitespace, case, unicode, numeric, names, addresses, dates, outliers, categorical, and more), then the filter, combine, group, validate, schema, and canonicalize helpers.

What's under consideration

Areas the library may grow into. None of these are commitments.

A polars/pandas interop layer for bridging legacy pandas pipelines.
Frame-level transforms beyond filters (column renaming, schema standardization, multi-column conditionals).
"Plans" — a layer above recipes that handles multi-column pipelines as units, so a single object can describe an entire frame's preprocessing.
Declarative YAML configuration for pipelines.
Additional filter families (null-pattern filters, numeric range filters, categorical inclusion).

Development

Engineering conventions live in CLAUDE.md. The repo uses uv for environment management.

uv run pytest                 # full suite, including src doctests
uv run ruff check src/ tests/
uv run mypy src/

The test suite covers the atomic transforms, recipes, factories, filters, the composition layer, and the regex / pattern primitives, with positive and negative cases, plus the docstring examples run as doctests.

License

Apache 2.0. See LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

andrewjordan3

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

framesmith-0.1.0.tar.gz (72.0 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

framesmith-0.1.0-py3-none-any.whl (77.6 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file framesmith-0.1.0.tar.gz.

File metadata

Download URL: framesmith-0.1.0.tar.gz
Upload date: Jun 1, 2026
Size: 72.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for framesmith-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e94bbd6e4f361953b95b17513f7eadfd7a2c8f39383dc833e1c5219db845848a`
MD5	`22a2a7853e4c57268d96e77c7cf33bf3`
BLAKE2b-256	`5f6232deb3757dbf414e164b46fdbb814f7820d00561c1aa5d9986d4d45d5618`

See more details on using hashes here.

File details

Details for the file framesmith-0.1.0-py3-none-any.whl.

File metadata

Download URL: framesmith-0.1.0-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 77.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for framesmith-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b84746fbd160c590247576698b9594a8f9e3b9c304b6208644976d1d3e2603d4`
MD5	`a251c13e72d60a906d926dc716a725a0`
BLAKE2b-256	`d0fbc85826d6689005040a4591bfda0ec0b4b9b6ea9f752a8a8f2cbf78c34105`

See more details on using hashes here.

framesmith 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

framesmith

Quick example

Installation

Key concepts

Transforms

Recipes

compose_column

Factories (configured transforms)

Filters (row selection)

Reference

What's under consideration

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`compose_column`