Skip to main content

High-performance reconciliation engine for SQL tables, queries, CSV, and Parquet using DuckDB, Polars, and Arrow.

Project description

fastrecon

A focused, high-performance reconciliation engine for comparing SQL tables, SQL queries, CSV files, and Parquet files at scale. Built on DuckDB, Polars, and Apache Arrow.

fastrecon is not a pandas replacement. It is a reconciliation engine — built specifically for proving that two datasets are (or aren't) the same.

Why fastrecon

Most data teams hand-roll reconciliation with pandas, ad-hoc SQL, or shell scripts. None scale. fastrecon gives you one consistent API across every common combination:

Left Right
SQL table SQL table
SQL table SQL query
SQL query SQL query
SQL table/query CSV / Parquet
CSV / Parquet CSV / Parquet

Everything is normalized into a single internal relation (a DuckDB view), then compared with pushdown-friendly SQL — no whole-dataset materialization in Python.

Install

pip install fastrecon                 # core
pip install "fastrecon[postgres]"     # + psycopg
pip install "fastrecon[mysql]"        # + pymysql

Requires Python 3.9+.

Quick start

from fastrecon import compare, SqlTable, ParquetFile

result = compare(
    left=SqlTable(conn="postgresql://user:pw@host/db", table="public.orders"),
    right=ParquetFile(path="orders.parquet"),
    keys=["order_id"],
    compare_mode="keyed",
    exclude_columns=["load_ts"],
    tolerances={"amount": 0.01},
)

print(result.summary())
print(result.to_json(indent=True))

Sample output:

status               : MISMATCH
compare_mode         : keyed
row_count_left       : 1,000,001
row_count_right      : 1,000,000
schema_match         : True
data_match           : False
missing_in_left      : 0
missing_in_right     : 1
changed_rows         : 4
duplicate_keys_left  : 0
duplicate_keys_right : 0
elapsed_sec          : 1.842
engine               : duckdb+polars

Compare modes

Mode What it does
schema Column names, types, missing/extra columns
rowcount Schema + row counts on both sides
keyed Schema + counts + key-based diff (missing / changed / dup keys)
profile Schema + counts + per-column null/distinct/min/max

Configuration & normalization

Reconciliation is mostly about handling the messy reality of "the same" data:

from fastrecon import ReconConfig, compare

cfg = ReconConfig(
    trim_strings=True,
    case_sensitive=False,
    null_equals_empty=True,
    decimal_scale=2,
    timestamp_tz="UTC",
    column_mapping={"orderId": "order_id"},   # left -> right rename
    exclude_columns=["load_ts", "etl_batch"],
    tolerances={"amount": 0.01, "tax": 0.01},
    sample_limit=200,
)

result = compare(left, right, keys=["order_id"], config=cfg)

Result object

compare() returns a ReconResult with:

  • statusMATCH / MISMATCH / ERROR
  • row_count_left, row_count_right
  • schema_match, data_match, schema_diff
  • missing_in_left, missing_in_right, changed_rows
  • duplicate_keys_left, duplicate_keys_right
  • sample_mismatches — sample rows for each mismatch class
  • column_stats — populated in profile mode
  • execution_metricselapsed_sec, engine

Use result.summary() for a printable report or result.to_json() / result.to_dict() to ship it to a logger, dashboard, or CI gate.

Sources

from fastrecon import SqlTable, SqlQuery, CsvFile, ParquetFile

SqlTable(conn="postgresql://...", table="schema.orders")
SqlQuery(conn="postgresql://...", query="SELECT * FROM orders WHERE dt >= '2026-01-01'")
CsvFile("/path/to/orders.csv", options={"delim": ","})
ParquetFile("/path/to/orders.parquet")        # also supports DuckDB globs: 'data/*.parquet'

Architecture

fastrecon/
├── api.py                  # public compare()
├── config.py               # ReconConfig
├── sources/                # SqlTable / SqlQuery / CsvFile / ParquetFile
├── engines/                # DuckDB execution engine
├── compare/                # schema / rowcount / keyed / profile
├── output/                 # ReconResult (summary, to_dict, to_json)
└── utils/                  # normalization, logging

Internally:

  1. Each source is registered into an in-memory DuckDB connection as a view (zero-copy from Arrow when possible).
  2. Schema is read with DESCRIBE.
  3. Row counts, anti-joins, and inner joins run in DuckDB — no full Python materialization.
  4. Mismatch samples are pulled lazily, capped by sample_limit.

Roadmap

  • ✅ MVP: package, sources, schema/rowcount/keyed/profile compare, JSON result, tests
  • ⏳ Partition-wise compare (date / id / hash buckets)
  • ⏳ HTML and JSON report generators
  • ⏳ Rust extension (PyO3) for hashing / normalization hot path
  • ⏳ Distributed mode (S3 + Spark connector)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastrecon-0.1.0.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fastrecon-0.1.0-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file fastrecon-0.1.0.tar.gz.

File metadata

  • Download URL: fastrecon-0.1.0.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for fastrecon-0.1.0.tar.gz
Algorithm Hash digest
SHA256 96ab8ad903f139090674b4a6a9492fa2fecaa6464f09404f2781b955662286d2
MD5 ae26cc99f16aa969912c775ebf495da1
BLAKE2b-256 f017a38c44e66a65865cffadd65991b59ea503748ba2ea32e4c9940ec6176869

See more details on using hashes here.

File details

Details for the file fastrecon-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fastrecon-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for fastrecon-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 712bf9457e1acf767c2b6ffb46c780b7734b900922e5c188b78928c01d8e09ca
MD5 3cea4dde41cea74539c744a1c8794f2a
BLAKE2b-256 3bb5684dd78338ba4d1426304561f218b2f447097a5f0c6bf62244b872ecb842

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page