High-performance reconciliation engine for SQL tables, queries, CSV, and Parquet using DuckDB, Polars, and Arrow.

These details have not been verified by PyPI

Project links

Project description

fastrecon

A focused, high-performance reconciliation engine for comparing SQL tables, SQL queries, CSV files, and Parquet files at scale. Built on DuckDB, Polars, and Apache Arrow.

fastrecon is not a pandas replacement. It is a reconciliation engine — built specifically for proving that two datasets are (or aren't) the same.

Why fastrecon

Most data teams hand-roll reconciliation with pandas, ad-hoc SQL, or shell scripts. None scale. fastrecon gives you one consistent API across every common combination:

Left	Right
SQL table	SQL table
SQL table	SQL query
SQL query	SQL query
SQL table/query	CSV / Parquet
CSV / Parquet	CSV / Parquet

Everything is normalized into a single internal relation (a DuckDB view), then compared with pushdown-friendly SQL — no whole-dataset materialization in Python.

Install

pip install fastrecon                 # core
pip install "fastrecon[postgres]"     # + psycopg
pip install "fastrecon[mysql]"        # + pymysql

Requires Python 3.9+.

Quick start

from fastrecon import compare, SqlTable, ParquetFile

result = compare(
    left=SqlTable(conn="postgresql://user:pw@host/db", table="public.orders"),
    right=ParquetFile(path="orders.parquet"),
    keys=["order_id"],
    compare_mode="keyed",
    exclude_columns=["load_ts"],
    tolerances={"amount": 0.01},
)

print(result.summary())
print(result.to_json(indent=True))

Sample output:

status               : MISMATCH
compare_mode         : keyed
row_count_left       : 1,000,001
row_count_right      : 1,000,000
schema_match         : True
data_match           : False
missing_in_left      : 0
missing_in_right     : 1
changed_rows         : 4
duplicate_keys_left  : 0
duplicate_keys_right : 0
elapsed_sec          : 1.842
engine               : duckdb+polars

Compare modes

Mode	What it does
`schema`	Column names, types, missing/extra columns
`rowcount`	Schema + row counts on both sides
`keyed`	Schema + counts + key-based diff (missing / changed / dup keys)
`profile`	Schema + counts + per-column null/distinct/min/max

keyed mode is the default and supports partition-wise execution for big-data workloads — see below.

Partition-wise compare (big data)

Joining 100M+ rows in one shot is dangerous. fastrecon can split a keyed compare into independent partitions and aggregate the results. Each partition runs as its own filtered SQL job inside DuckDB, so memory stays bounded by the partition size, not the dataset size.

from fastrecon import compare, SqlTable, ParquetFile, PartitionSpec

# Partition by a low-cardinality column (e.g. country, status, load_date)
result = compare(
    left=SqlTable(conn=SRC, table="orders"),
    right=ParquetFile("orders/*.parquet"),
    keys=["order_id"],
    partition=PartitionSpec(column="region", strategy="value"),
)

# Or hash-bucket any column (works for high-cardinality keys too)
result = compare(
    left=..., right=..., keys=["order_id"],
    partition=PartitionSpec(column="order_id", strategy="hash", buckets=64),
)

# Or explicit ranges (great for dates / sequential ids)
result = compare(
    left=..., right=..., keys=["order_id"],
    partition=PartitionSpec(
        column="order_dt", strategy="range",
        boundaries=[("2026-01-01", "2026-02-01"),
                    ("2026-02-01", "2026-03-01"),
                    ("2026-03-01", "2026-04-01")],
    ),
)

print(result.summary())
for p in result.column_stats["partitions"]:
    print(p)   # per-partition counts + match flag

Strategies

Strategy	Best for	Notes
`value`	Low-cardinality partition keys (region, status, load_date)	Auto-discovers distinct values from both sides; capped by `max_partitions` (default 1000)
`hash`	Any column, especially high-cardinality keys	`buckets=N` controls partition count and memory footprint
`range`	Ordered columns (dates, sequential ids)	Half-open `[lo, hi)` boundaries; you supply them

What you get back

When you pass partition=..., the result includes a per-partition breakdown under column_stats:

result.column_stats["partitioned_by"]
# {"column": "region", "strategy": "value", "n_partitions": 5}

result.column_stats["partitions"]
# [
#   {"partition": "EU", "row_count_left": 312_054, "row_count_right": 312_054,
#    "missing_in_left": 0, "missing_in_right": 0, "changed_rows": 2,
#    "duplicate_keys_left": 0, "duplicate_keys_right": 0, "match": False},
#   ...
# ]

Top-level counts (missing_in_left, changed_rows, etc.) are aggregated across partitions; sample_mismatches is a globally capped sample drawn from any partition.

Choosing a strategy

You know the data has natural partitions (load_date, region, tenant_id) → use value.
You don't, and just want bounded memory → use hash with buckets ≈ dataset_rows / 5_000_000.
The data is time-series and you want to reconcile per window → use range with date boundaries.

Configuration & normalization

Reconciliation is mostly about handling the messy reality of "the same" data:

from fastrecon import ReconConfig, compare

cfg = ReconConfig(
    trim_strings=True,
    case_sensitive=False,
    null_equals_empty=True,
    decimal_scale=2,
    timestamp_tz="UTC",
    column_mapping={"orderId": "order_id"},   # left -> right rename
    exclude_columns=["load_ts", "etl_batch"],
    tolerances={"amount": 0.01, "tax": 0.01},
    sample_limit=200,
)

result = compare(left, right, keys=["order_id"], config=cfg)

Result object

compare() returns a ReconResult with:

status — MATCH / MISMATCH / ERROR
row_count_left, row_count_right
schema_match, data_match, schema_diff
missing_in_left, missing_in_right, changed_rows
duplicate_keys_left, duplicate_keys_right
sample_mismatches — sample rows for each mismatch class
column_stats — populated in profile mode
execution_metrics — elapsed_sec, engine

Use result.summary() for a printable report or result.to_json() / result.to_dict() to ship it to a logger, dashboard, or CI gate.

Sources

from fastrecon import SqlTable, SqlQuery, CsvFile, ParquetFile

SqlTable(conn="postgresql://...", table="schema.orders")
SqlQuery(conn="postgresql://...", query="SELECT * FROM orders WHERE dt >= '2026-01-01'")
CsvFile("/path/to/orders.csv", options={"delim": ","})
ParquetFile("/path/to/orders.parquet")        # also supports DuckDB globs: 'data/*.parquet'

Architecture

fastrecon/
├── api.py                  # public compare()
├── config.py               # ReconConfig
├── sources/                # SqlTable / SqlQuery / CsvFile / ParquetFile
├── engines/                # DuckDB execution engine
├── compare/                # schema / rowcount / keyed / profile
├── output/                 # ReconResult (summary, to_dict, to_json)
└── utils/                  # normalization, logging

Internally:

Each source is registered into an in-memory DuckDB connection as a view (zero-copy from Arrow when possible).
Schema is read with DESCRIBE.
Row counts, anti-joins, and inner joins run in DuckDB — no full Python materialization.
Mismatch samples are pulled lazily, capped by sample_limit.

Roadmap

✅ MVP: package, sources, schema/rowcount/keyed/profile compare, JSON result, tests
✅ Partition-wise compare (value / hash / range strategies)
⏳ HTML and JSON report generators
⏳ Parallel partition execution (thread pool)
⏳ Rust extension (PyO3) for hashing / normalization hot path
⏳ Distributed mode (S3 + Spark connector)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.2

Apr 23, 2026

0.5.1

Apr 23, 2026

0.5.0

Apr 23, 2026

0.4.0

Apr 23, 2026

0.3.8

Apr 23, 2026

0.3.7

Apr 23, 2026

0.3.6

Apr 23, 2026

0.3.5

Apr 23, 2026

0.3.4

Apr 23, 2026

0.3.3

Apr 23, 2026

0.3.2

Apr 23, 2026

0.3.1

Apr 23, 2026

0.3.0

Apr 23, 2026

This version

0.2.0

Apr 23, 2026

0.1.0

Apr 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastrecon-0.2.0.tar.gz (18.5 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fastrecon-0.2.0-py3-none-any.whl (25.9 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file fastrecon-0.2.0.tar.gz.

File metadata

Download URL: fastrecon-0.2.0.tar.gz
Upload date: Apr 23, 2026
Size: 18.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for fastrecon-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`785100d7f848035a2b38b7486e01ccb3f4ca8e518eb54cbcbd5ac1303431ff05`
MD5	`149baa77ea43533289a63da9089152b0`
BLAKE2b-256	`21c438e24c94e9fc7126f730f561f4a735b343d81dcb6a3993ad36e960fa226c`

See more details on using hashes here.

File details

Details for the file fastrecon-0.2.0-py3-none-any.whl.

File metadata

Download URL: fastrecon-0.2.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 25.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for fastrecon-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d09acc54eff9859ac1ff2aed11fc0429ca54399f99791075eeee96a5d61b2518`
MD5	`ccc15f698760d13ed68f279724ab89e7`
BLAKE2b-256	`c5a356fe1e85b1bfb6e19426341861fcf6a0ac67db403177bdba2049563328f1`

See more details on using hashes here.

fastrecon 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fastrecon

Why fastrecon

Install

Quick start

Compare modes

Partition-wise compare (big data)

Strategies

What you get back

Choosing a strategy

Configuration & normalization

Result object

Sources

Architecture

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes