Capture the authored dataframe transformation pipeline (Narwhals + PySpark) and profile data at each step to produce lineage diagrams and data-quality docs.
Project description
Conformare
Live demo and reports: explore real interactive reports generated by the examples, and read the docs.
Tie data-pipeline governance to the code that implements it. Conformare captures the authored transformation pipeline of your dataframe code, profiles the data at each step, and records the risks, mitigations, owners and business definitions behind it, then renders the whole thing as one self-contained, interactive HTML report.
Works with Narwhals, PySpark, and native pandas.
What it does
Conformare does two jobs for a data-processing pipeline:
- Governance : track the risks, mitigations and owners, and the business definitions implemented by each step.
- Process & data : document the end-to-end process, flag PII / sensitive data, and profile the data (counts, distributions, null rates, outliers, expectations) at each step.
It exists to support the development, diagnostics, and governance of pipeline implementations: not the data platform as a whole, but the specific code that does the work.
Mission
Governance usually lives away from the code. Risk reviews raise every assumption and implementation consideration; each must be owned, approved and tracked, and it is almost always done after the fact, in documents that quietly drift out of sync with the implementation.
Conformare's mission is to bring that governance into the code: author risks, mitigations, owners and definitions next to the logic they describe, so the documentation is generated from the implementation and stays current with it.
Two ways to use it
- Integrated (intrusive) : add profilers and
describe()/risk()context in your code. Unlocks the full feature set: profiling, lineage, sensitivity, expectations and governance. - Non-intrusive : get governance and process documentation without rewriting your pipeline, via docstring tagging (risk / mitigation / owner / purpose declared in docstrings) or bootstrapping (instrument an unmodified script from a separate entry point).
See Choosing an integration style for the trade-offs.
Install
pip install conformare # core: Narwhals + executing
pip install "conformare[spark]" # + PySpark
pip install "conformare[gx]" # + Great Expectations (optional validation profiler)
Quick start
import narwhals as nw, pandas as pd
import conformare as cf
cf.trackNarwhals()
cf.set_profiles({"*": [cf.rowCount, cf.dataSize, cf.histogram(columns="all")]})
with cf.describe("Clean customers", purpose="Keep UK adults only",
definition_owner="data-governance",
risks=cf.risk("privacy.pii_exposure", "compliance.gdpr",
mitigation="Drop email before export", owner="data-governance")):
customers = nw.from_native(pd.read_csv("customers.csv"))
adults = customers.filter(nw.col("age") >= 18)
cf.to_html("report.html", title="Customer pipeline") # open in any browser
Existing PySpark or native pandas code uses the same API: call cf.trackSpark() or
cf.trackPandas() and run your pipeline unchanged:
cf.trackSpark() # the only line you add
active = df.filter(df.status == "active") # tracked automatically
enriched = active.join(orders, on="id") # tracked, two parents
Choosing an integration style
| Integrated (explicit) | Docstring tagging | Bootstrapping | |
|---|---|---|---|
| Change to pipeline code | profilers + describe/risk inline |
docstrings only | none (separate entry point) |
| What you get | everything | risk / mitigation / owner / purpose docs | process tracking + grouping + profiling |
| Best for | new code, deep diagnostics, full report | adding governance with little code change | documenting a script you cannot or will not edit |
- Explicit integration gives the richest result: every profiler, the complete lineage and column-level detail, data-quality checkpoints, and governance, all in one report. Best when you own the code and want both diagnostics and a full governance artifact.
- Indirect integration trades feature coverage for zero or low intrusion:
- Docstrings keep governance (risks, owners, definitions) literally inside the
function that implements the concept, so it cannot drift and needs no imports in hot
paths. With
track_functions()on, aConformare:docstring block is applied automatically; see the docstring tagging example. - Bootstrapping documents and profiles an unmodified production script from the outside, ideal for legacy or third-party pipelines and for audits.
- Docstrings keep governance (risks, owners, definitions) literally inside the
function that implements the concept, so it cannot drift and needs no imports in hot
paths. With
Features at a glance
- Process map & lineage : a diagram of the pipeline as authored (no engine plan required), with column-level lineage, a created-column catalog, and each node's operation logic shown inline.
- Per-step profiling : row/column counts, data size, histograms, null fractions and IQR outliers at each step, plus a distribution follower to watch a column evolve.
- Data-quality checkpoints : drop Great Expectations in at any step and see exactly where a contract starts failing (with severities).
- PII / sensitivity : name-based heuristics flag candidate PII, and the report shows whether each sensitive column reaches a written output.
- Governance : risks, mitigations, owners, business definition owners, Markdown context details, a process-wide description, and a governance ranking (owned means low-concern). Surfaced as a risk register and a context register.
- Self-contained HTML report : one interactive page (diagram, column highlighter, KPIs, dark mode); no CDN, no build step.
- Formal risk checklist : export the risk register as a sign-off-ready Markdown
document (
to_risk_checklist) with blank columns and a sign-off block for a governance team to review, comment, and date, keeping an auditable trail. - Three backends, one report : Narwhals (new code), PySpark and native pandas (existing code, tracked in place).
Backends
trackNarwhals(): for new, dataframe-agnostic code on Narwhals; patches thenw.from_nativechokepoint.trackSpark(): for existing PySpark code, tracked in place with zero changes.trackPandas(): for existing native pandas code; tracks idiomatic indexing likedf[df.col == 1],df[["a","b"]],query,merge,groupby.trackAll(): adapters plus automatic function-boundary tracking, for mixed codebases.
Public API
| Call | Purpose |
|---|---|
trackNarwhals() / trackSpark() / trackPandas() / trackAll() |
Start tracking the chosen backend(s). |
set_profiles({...}) / with profile(...) |
Op-to-profilers registry / scoped overlay. |
with force_profile(..., cache=) |
Opt-in profiling at a chosen point (optionally cache on Spark). |
with describe(...) / with risk(...) |
Annotate code with purpose / governance risks (owner, mitigation, definition owner, Markdown details). |
describe_process(description, risks=...) |
Process-wide description and risks. |
register_risk(...) |
Extend the built-in risk catalog. |
mark_sensitive() / classify_column() |
Manual / heuristic sensitivity tagging. |
@opaque / opaque_module(*prefixes) |
Record a function/library call as one node, suppressing its internals (pyspark.ml opaque by default). |
@track_step / track_functions() |
Function-boundary tracking (explicit / automatic, including docstring tagging). |
environment() / in_notebook() / mark_user_packages(*names) |
Detect the runtime (Databricks / Jupyter / IPython / Python); opt your own pip-installed pipeline code into user-code tracking. |
bootstrap(run, docs=[doc(...)], ...) |
Instrument an unmodified script from the outside, run it, write a report. |
to_mermaid() / to_json() / to_html() |
Export the lineage (Mermaid / JSON / interactive HTML). |
to_risk_checklist(path, process=, reviewers=) |
Export the risk register as a formal, sign-off-ready Markdown checklist for a governance team. |
restore() |
Unpatch everything (captured lineage is kept). |
Profilers
A profiler measures something about the data at a step (for example a row count or a column's distribution) and attaches the result to that node in the report. You choose which profilers run on which operations, and they execute as the pipeline runs.
Built-in profilers:
rowCount: number of rows.columnCount: number of columns.dataSize: approximate in-memory size (with the column count).histogram(columns=...): per-column distribution (numeric bins, or top values for categorical columns).nullFraction(columns=...): fraction of nulls per column.iqrOutliers(columns="all", k=1.5): flags values outside Tukey's IQR fences and summarises the outliers per column.greatExpectations(*expectations, hard_severities=()): runs Great Expectations checks as a validation checkpoint at a step, showing which pass or fail (with severities). Accepts native GX objects or portable dicts. Optional dependency (pip install "conformare[gx]"); degrades to a status note if absent.whylogs(columns=...): optional whylogs profile summary (requires whylogs).
On Spark, counts and aggregates are full jobs, so prefer profiling chosen steps with force_profile:
cf.set_profiles({}) # profile nothing by default
with cf.force_profile(cf.rowCount, cf.histogram("amount"), cache=True, only="last"):
enriched = adults.join(orders, on="id") # only this step is profiled (and cached)
Under the hood: a profiler is a callable (frame, backend) -> value. Conditions
(contains_columns, schema_has, min_rows) compose with & | ~. Configuration is
layered, an upfront set_profiles registry overridable by scoped with profile(...)
overlays. Counts never sample; distribution profilers default to 10,000 rows.
Related projects
Conformare overlaps with the data-lineage / governance ecosystem but occupies a different niche: it maps the inside of a specific process implementation and binds governance documentation to the code, rather than cataloguing datasets across a platform.
- OpenLineage / Marquez : an open standard and service for emitting run / dataset / job lineage across a stack (Airflow, Spark, dbt). It answers "which datasets and jobs feed which" at the platform level. Conformare instead documents the authored steps inside one pipeline and the governance behind them, as a single self-contained artifact rather than a metadata service.
- dbt : model-level lineage, docs, contracts and access governance for SQL/warehouse transformations. Conformare targets imperative dataframe code (Narwhals / pandas / Spark) and centres on risk / owner / definition governance rather than SQL model graphs.
- Spline : captures Spark execution-plan lineage (column-level) from logical plans. Conformare captures the pipeline as authored (no engine plan, so it also works for pandas/Narwhals) and adds the governance layer Spline does not.
- DataHub / OpenMetadata : enterprise metadata catalogs: ownership, glossaries, column lineage, policies and data contracts, served centrally. Conformare is lightweight and code-proximate: governance authored alongside the implementation and rendered per run, not a central catalog populated separately.
- Great Expectations / Soda / whylogs : data validation / profiling. Conformare uses Great Expectations as an optional checkpoint profiler; its own contribution is placing those checks (and profiles) on the process map, next to the governance.
In short: tools like OpenLineage and DataHub focus on higher-order data lineage and central cataloging across a platform; Conformare zooms in on one pipeline's implementation and ties its governance documentation to the code. Complementary, not competing.
Examples & tests
Browse the examples gallery, where each example shows its code next to the live report it produces. Highlights:
example_streaming.py/example_streaming_spark.py: a full pipeline, Narwhals vs PySpark.example_pandas.py: native-pandas idiomatic tracking (df[df.col==1],merge,groupby).example_great_expectations.py/_spark.py: validation checkpoints that pinpoint where data breaks its contract.example_docstring_tagging.py: governance declared purely in docstrings (no decorators).bootstrap/: instrument a pure, unmodified PySpark script from the outside.
python -m pytest # Spark tests skip automatically if no JVM is available
Versioning
Conformare follows Semantic Versioning. While in 0.x, the API
may still change: breaking changes can land in a minor release (0.1 to 0.2),
patch releases are bug fixes, and 1.0.0 will mark a commitment to backward
compatibility. The installed version is available as conformare.__version__, and
changes are recorded in the changelog.
License
Conformare is licensed under the PolyForm Noncommercial License 1.0.0: free to use for any noncommercial purpose (personal, research, education, non-profits, public-sector and similar). It is source-available, not open-source.
Commercial use requires a separate license. For commercial licensing, contact Kaelon Lloyd at kaelonlloyd@gmail.com.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file conformare-0.1.1.tar.gz.
File metadata
- Download URL: conformare-0.1.1.tar.gz
- Upload date:
- Size: 162.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70940d1bbfdc569f6fe15955440007c37c0870fb67cc4a6b3737c0fdd6b46b50
|
|
| MD5 |
2665f458835b27218c371677832b8e5e
|
|
| BLAKE2b-256 |
a6f3dd48d0177e2bb29e91605b0fc548901ec8c1efe13a2195dfe06e3e32ce45
|
Provenance
The following attestation bundles were made for conformare-0.1.1.tar.gz:
Publisher:
publish.yml on kaelonlloyd/conformare
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
conformare-0.1.1.tar.gz -
Subject digest:
70940d1bbfdc569f6fe15955440007c37c0870fb67cc4a6b3737c0fdd6b46b50 - Sigstore transparency entry: 1914212378
- Sigstore integration time:
-
Permalink:
kaelonlloyd/conformare@ef7ca13927779bf0fbd06cab58d4471ae15d1454 -
Branch / Tag:
- Owner: https://github.com/kaelonlloyd
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ef7ca13927779bf0fbd06cab58d4471ae15d1454 -
Trigger Event:
release
-
Statement type:
File details
Details for the file conformare-0.1.1-py3-none-any.whl.
File metadata
- Download URL: conformare-0.1.1-py3-none-any.whl
- Upload date:
- Size: 101.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1816465d12b22f66c9f7af929c9d97341120fe98a0e77307fbef1d6837ec7a9b
|
|
| MD5 |
6d7cb5c9ecdfd643bfe30ffecdcafdaa
|
|
| BLAKE2b-256 |
d02c060124083c8ba8851e5d791d05930e877c70a7f17b300bd8ee553d4f3600
|
Provenance
The following attestation bundles were made for conformare-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on kaelonlloyd/conformare
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
conformare-0.1.1-py3-none-any.whl -
Subject digest:
1816465d12b22f66c9f7af929c9d97341120fe98a0e77307fbef1d6837ec7a9b - Sigstore transparency entry: 1914212469
- Sigstore integration time:
-
Permalink:
kaelonlloyd/conformare@ef7ca13927779bf0fbd06cab58d4471ae15d1454 -
Branch / Tag:
- Owner: https://github.com/kaelonlloyd
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ef7ca13927779bf0fbd06cab58d4471ae15d1454 -
Trigger Event:
release
-
Statement type: