High-performance reconciliation engine for SQL tables, queries, CSV, and Parquet using DuckDB, Polars, and Arrow.
Project description
fastrecon
A focused, high-performance reconciliation engine for comparing SQL tables, SQL queries, CSV files, and Parquet files at scale. Built on DuckDB, Polars, and Apache Arrow.
fastrecon is not a pandas replacement. It is a reconciliation engine — built specifically for proving that two datasets are (or aren't) the same.
Why fastrecon
Most data teams hand-roll reconciliation with pandas, ad-hoc SQL, or shell scripts. None scale. fastrecon gives you one consistent API across every common combination:
| Left | Right |
|---|---|
| SQL table | SQL table |
| SQL table | SQL query |
| SQL query | SQL query |
| SQL table/query | CSV / Parquet |
| CSV / Parquet | CSV / Parquet |
Everything is normalized into a single internal relation (a DuckDB view), then compared with pushdown-friendly SQL — no whole-dataset materialization in Python.
Install
pip install fastrecon # core
pip install "fastrecon[postgres]" # + psycopg
pip install "fastrecon[mysql]" # + pymysql
Requires Python 3.9+.
Quick start
from fastrecon import compare, SqlTable, ParquetFile
result = compare(
left=SqlTable(conn="postgresql://user:pw@host/db", table="public.orders"),
right=ParquetFile(path="orders.parquet"),
keys=["order_id"],
compare_mode="keyed",
exclude_columns=["load_ts"],
tolerances={"amount": 0.01},
)
print(result.summary())
print(result.to_json(indent=True))
Sample output:
status : MISMATCH
compare_mode : keyed
row_count_left : 1,000,001
row_count_right : 1,000,000
schema_match : True
data_match : False
missing_in_left : 0
missing_in_right : 1
changed_rows : 4
duplicate_keys_left : 0
duplicate_keys_right : 0
elapsed_sec : 1.842
engine : duckdb+polars
Compare modes
| Mode | What it does |
|---|---|
schema |
Column names, types, missing/extra columns |
rowcount |
Schema + row counts on both sides |
keyed |
Schema + counts + key-based diff (missing / changed / dup keys) |
profile |
Schema + counts + per-column null/distinct/min/max |
Configuration & normalization
Reconciliation is mostly about handling the messy reality of "the same" data:
from fastrecon import ReconConfig, compare
cfg = ReconConfig(
trim_strings=True,
case_sensitive=False,
null_equals_empty=True,
decimal_scale=2,
timestamp_tz="UTC",
column_mapping={"orderId": "order_id"}, # left -> right rename
exclude_columns=["load_ts", "etl_batch"],
tolerances={"amount": 0.01, "tax": 0.01},
sample_limit=200,
)
result = compare(left, right, keys=["order_id"], config=cfg)
Result object
compare() returns a ReconResult with:
status—MATCH/MISMATCH/ERRORrow_count_left,row_count_rightschema_match,data_match,schema_diffmissing_in_left,missing_in_right,changed_rowsduplicate_keys_left,duplicate_keys_rightsample_mismatches— sample rows for each mismatch classcolumn_stats— populated inprofilemodeexecution_metrics—elapsed_sec,engine
Use result.summary() for a printable report or result.to_json() / result.to_dict() to ship it to a logger, dashboard, or CI gate.
Sources
from fastrecon import SqlTable, SqlQuery, CsvFile, ParquetFile
SqlTable(conn="postgresql://...", table="schema.orders")
SqlQuery(conn="postgresql://...", query="SELECT * FROM orders WHERE dt >= '2026-01-01'")
CsvFile("/path/to/orders.csv", options={"delim": ","})
ParquetFile("/path/to/orders.parquet") # also supports DuckDB globs: 'data/*.parquet'
Architecture
fastrecon/
├── api.py # public compare()
├── config.py # ReconConfig
├── sources/ # SqlTable / SqlQuery / CsvFile / ParquetFile
├── engines/ # DuckDB execution engine
├── compare/ # schema / rowcount / keyed / profile
├── output/ # ReconResult (summary, to_dict, to_json)
└── utils/ # normalization, logging
Internally:
- Each source is registered into an in-memory DuckDB connection as a view (zero-copy from Arrow when possible).
- Schema is read with
DESCRIBE. - Row counts, anti-joins, and inner joins run in DuckDB — no full Python materialization.
- Mismatch samples are pulled lazily, capped by
sample_limit.
Roadmap
- ✅ MVP: package, sources, schema/rowcount/keyed/profile compare, JSON result, tests
- ⏳ Partition-wise compare (date / id / hash buckets)
- ⏳ HTML and JSON report generators
- ⏳ Rust extension (PyO3) for hashing / normalization hot path
- ⏳ Distributed mode (S3 + Spark connector)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fastrecon-0.1.0.tar.gz.
File metadata
- Download URL: fastrecon-0.1.0.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96ab8ad903f139090674b4a6a9492fa2fecaa6464f09404f2781b955662286d2
|
|
| MD5 |
ae26cc99f16aa969912c775ebf495da1
|
|
| BLAKE2b-256 |
f017a38c44e66a65865cffadd65991b59ea503748ba2ea32e4c9940ec6176869
|
File details
Details for the file fastrecon-0.1.0-py3-none-any.whl.
File metadata
- Download URL: fastrecon-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
712bf9457e1acf767c2b6ffb46c780b7734b900922e5c188b78928c01d8e09ca
|
|
| MD5 |
3cea4dde41cea74539c744a1c8794f2a
|
|
| BLAKE2b-256 |
3bb5684dd78338ba4d1426304561f218b2f447097a5f0c6bf62244b872ecb842
|