Convert SAS/CSV files to Parquet, detect unique keys, and compare datasets — chunk-by-chunk, memory-safe.

These details have not been verified by PyPI

Project links

Project description

sas-parquet

Convert SAS/CSV files to Parquet, detect unique keys, and compare datasets — chunk-by-chunk, memory-safe.

Built for working with large SAS datasets (.sas7bdat) that don't fit in memory. Streams through 100+ GB files on a 16 GB laptop without breaking a sweat.

Why

If you've ever tried to:

Convert a 50 GB .sas7bdat file to Parquet and watched your machine OOM
Compare a SAS-exported Parquet against a "ground truth" Parquet and not know which column is the primary key
Validate a SAS-to-Python pipeline migration row-by-row without loading both datasets into memory

…this tool is for you.

What's inside

Three composable command-line tools and a Streamlit web UI:

Command	What it does
`sas-parquet convert`	Stream a `.sas7bdat`, `.csv`, `.csv.gz`, or `.parquet` file into one or more Parquet chunks. Smart chunk sizing aims for a target MB per file.
`sas-parquet detect-keys`	Find the smallest combinations of columns whose values are unique. Useful when you don't know the primary key of a dataset.
`sas-parquet compare`	Compare two Parquet datasets chunk-by-chunk on a composite key. Reports row counts, schema differences, per-column value mismatches with sample rows.
`sas-parquet ui`	Launch the four-tab Streamlit UI: Convert · Detect Keys · Compare · Full Pipeline.

Install

pip install sas-parquet              # core CLIs
pip install "sas-parquet[ui]"        # adds the Streamlit web UI
pip install "sas-parquet[ui,dev]"    # everything (ruff, pytest, etc.)

Requires Python 3.10+.

Quick start

Convert SAS → Parquet

# SAS → chunked Parquet (~256 MB per chunk)
sas-parquet convert data/big.sas7bdat --mb-size 256

# SAS → single Parquet file
sas-parquet convert data/big.sas7bdat --single-file

# CSV → chunked Parquet
sas-parquet convert data/big.csv --mb-size 256

# Pick specific columns
sas-parquet convert data/big.sas7bdat --columns loan_id period balance

Detect unique composite keys

Don't know which columns form the primary key of your dataset? Let the tool find them:

sas-parquet detect-keys ./data/loans_parquet/ --first-only

[detect] Total rows: 4,562,891
[detect] Auto-selected 12 candidate(s):
   - loan_id, period, balance, status, origination_date, ...
[key-detect] Searching up to size 4 across 12 candidate column(s)...
--- size 1: 12 combo(s)
  ['loan_id'] -> 1,140,723 dup keys
--- size 2: 66 combo(s)
  ['loan_id', 'period'] -> UNIQUE
[detect] Found 1 minimal unique key(s):
   1. ['loan_id', 'period']

Compare two Parquet datasets row-by-row

sas-parquet compare \
    --source-dir ./data/sas_chunks/ \
    --target-path ./data/prod.parquet \
    --keys loan_id period \
    --output-dir ./reports/

A report file is written with:

Source and target row counts (from Parquet metadata — milliseconds even on TB-scale files)
Columns present only on one side
Data type differences
Per-column value mismatch counts, sorted by severity
Sample mismatched rows with key + source value + target value
Per-chunk timing

Launch the web UI

sas-parquet ui
# Opens http://localhost:8501

Four tabs let you run any single step or chain them end-to-end:

Convert — point at a .sas7bdat / .csv / .csv.gz / .parquet file, set a target MB per chunk, click run.
Detect Keys — point at a Parquet file or directory, see the minimal unique composite key(s).
Compare — pick source and target Parquet, multiselect the composite key, get a per-column mismatch report with downloadable text output.
Full Pipeline — chain all three: SAS → Parquet → key detection → comparison, with an optional key override if you'd rather skip auto-detection.

Runnable examples

The three examples below are self-contained: each one generates its own data, runs the relevant command(s), and shows the expected output. All you need is pip install sas-parquet.

Example 1 — Convert CSV → Parquet

mkdir -p /tmp/sp_demo && cd /tmp/sp_demo

# Generate a 1,000-row CSV
python -c "
import polars as pl
pl.DataFrame({
    'loan_id': list(range(1000)),
    'period':  ['Q1' if i % 2 == 0 else 'Q2' for i in range(1000)],
    'balance': [round(i * 1.5, 2) for i in range(1000)],
    'status':  ['ACTIVE' if i % 3 else 'CLOSED' for i in range(1000)],
}).write_csv('loans.csv')
"

sas-parquet convert loans.csv --mb-size 5

Expected output:

[convert] Input  : /tmp/sp_demo/loans.csv  (csv)
[convert] Output : /tmp/sp_demo/loans  (chunked)
[convert] Sampling for chunk-size estimate...
[convert] Chunk size: ~500,000 rows per chunk
  chunk 000 -> loans_000.parquet (1,000 rows, 0.0 MB)
[convert] Done. 1,000 rows in 1 chunk(s) -> 0.0 MB in 0.2s

Example 2 — Detect a composite key

Generate a dataset where neither loan_id nor period alone is unique, but the pair (loan_id, period) is. The tool should figure that out without hints.

mkdir -p /tmp/sp_demo2 && cd /tmp/sp_demo2

python -c "
import polars as pl
pl.DataFrame({
    'loan_id': [i // 2 for i in range(1000)],                                  # each loan appears in 2 periods
    'period':  ['Q1' if i % 2 == 0 else 'Q2' for i in range(1000)],
    'balance': [round(((i // 2) % 100 + 1) * 100.0, 2) for i in range(1000)],  # cycles every 100 loans
    'status':  ['ACTIVE' if i % 3 else 'CLOSED' for i in range(1000)],
}).write_csv('loans.csv')
"

sas-parquet convert loans.csv --mb-size 5
sas-parquet detect-keys loans/

Expected detect-keys output (abridged):

[detect] Auto-selected 4 candidate(s):
   - loan_id, balance, period, status

[key-detect] Searching up to size 4 across 4 candidate column(s)...
--- size 1: 4 combo(s)
  ['loan_id'] -> 500 dup keys
  ['balance'] -> 100 dup keys
  ['period']  -> 2 dup keys
  ['status']  -> 2 dup keys
--- size 2: 6 combo(s)
  ['loan_id', 'period'] -> UNIQUE
  ...

[detect] Found 1 minimal unique key(s):
   1. ['loan_id', 'period']

Example 3 — Compare two datasets with known mismatches

Mutate three rows in the target and verify the tool catches exactly those mismatches.

cd /tmp/sp_demo2   # reusing the dataset from example 2

# Build a target with 6 deliberate mismatches (3 loans × 2 periods)
python -c "
import polars as pl
df = pl.read_parquet('loans/loans_000.parquet')
df = df.with_columns(
    pl.when(pl.col('loan_id').is_in([10, 20, 30]))
      .then(pl.col('balance') + 0.99)
      .otherwise(pl.col('balance'))
      .alias('balance')
)
df.write_parquet('target.parquet')
"

sas-parquet compare \
    --source-dir loans/ \
    --target-path target.parquet \
    --keys loan_id --keys period \
    --output-dir reports/

Expected output:

[compare] Source chunks : 1 (1,000 rows total)
[compare] Target rows   : 1,000
[compare] Columns to compare : 2
[compare] Chunk 1/1: loans_000.parquet
   rows=1,000  matched=1,000  source_only=0  (0.2s)
[compare] Matched: 1,000 / 1,000
[compare] 1 column(s) have value mismatches.
[compare] Report: /tmp/sp_demo2/reports/compare_loans_<timestamp>.txt

The report file shows the 6 mismatched rows with their key, source value, and target value:

[balance] — first 6 mismatch(es):
   {'loan_id': 10, 'period': 'Q1', 'balance__source': 1100.0, 'balance__target': 1100.99}
   {'loan_id': 10, 'period': 'Q2', 'balance__source': 1100.0, 'balance__target': 1100.99}
   {'loan_id': 20, 'period': 'Q1', 'balance__source': 2100.0, 'balance__target': 2100.99}
   {'loan_id': 20, 'period': 'Q2', 'balance__source': 2100.0, 'balance__target': 2100.99}
   {'loan_id': 30, 'period': 'Q1', 'balance__source': 3100.0, 'balance__target': 3100.99}
   {'loan_id': 30, 'period': 'Q2', 'balance__source': 3100.0, 'balance__target': 3100.99}

Design notes

Smart chunk sizing — samples 10K rows, writes a probe Parquet, measures compressed bytes/row, then derives a chunk size targeting your --mb-size (default 50 MB).
Streaming everywhere — uses pyreadstat.read_file_in_chunks, polars.scan_* lazy frames, and pyarrow.ParquetWriter for incremental writes. No data ever fully materializes in memory.
Chunk-by-chunk comparison — loads one source chunk, semi-joins target down to matching keys (streaming), compares with null-safe equality, frees memory, repeats. Works on multi-TB datasets.
SAS empty-string handling — SAS exports missing character values as "". The compare tool normalizes these to null before comparing strings, so they don't show up as false mismatches against true null in Parquet.
Anti-join source-only detection — keys present in source but missing from target are counted and surfaced separately.
Metadata-only row counts — pyarrow.parquet.read_metadata gives exact row counts in milliseconds, even on TB-scale datasets, without scanning data.
Schema-drift-aware reading — chunked datasets with column dtype drift (one chunk has a column as Null, another as Float64) are unified at read time.

What this is not

A full SAS replacement. It reads .sas7bdat. It does not run PROC SQL, macros, or formats. Use SAS Viya or pysas if you need those.
Spark or Dask. Single-machine tool optimized for "fits on a beefy laptop or VM" workloads. If you have a cluster and 10+ TB of data, use Spark.
A diff tool for arbitrary files. Source and target must share a composite key. Schema-drift handling exists but is intentionally minimal.

Limitations and known issues

SAS numerics are Float64. SAS stores all numbers as 8-byte floats. After conversion, columns that are conceptually integers (IDs, counts) arrive as f64. A future release will auto-downcast int-like floats.
Float equality is exact. Comparisons use null-safe ==. No fuzzy tolerance for floats yet — 1.0 vs 1.0000001 is a mismatch. --rel-tol is planned for 0.2.0.
pyreadstat quirks on Windows. Some .sas7bdat files have out-of-range file timestamps that crash pyreadstat 1.3.4's internal datetime.fromtimestamp call. We patch around this transparently in the SAS reader.
Conversion is single-threaded. Polars parallelizes the column-level work, but the chunk loop itself is serial. For 100+ GB files this is usually I/O bound anyway.

Development

git clone https://github.com/kulbshar/sas-parquet.git
cd sas-parquet
python -m venv .venv
source .venv/bin/activate           # or  .venv\Scripts\activate  on Windows
pip install -e ".[ui,dev]"

pytest                               # 105 tests
ruff check src tests

The test suite covers every module end-to-end with synthetic Parquet fixtures, including the Streamlit UI (driven via streamlit.testing.v1.AppTest).

Changelog

See CHANGELOG.md.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

May 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sas_parquet-0.1.0.tar.gz (37.8 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sas_parquet-0.1.0-py3-none-any.whl (32.4 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file sas_parquet-0.1.0.tar.gz.

File metadata

Download URL: sas_parquet-0.1.0.tar.gz
Upload date: May 22, 2026
Size: 37.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for sas_parquet-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a685cf7e46ab1f87ddead3d8c94ec5db89c627223ec16d4993da4fb7140f1b3f`
MD5	`3075968d669a952dbac6c3671f730de8`
BLAKE2b-256	`363a09490c8c6f7b49af3875519045520f4d8b271d0dededc6586c8d61fd3608`

See more details on using hashes here.

File details

Details for the file sas_parquet-0.1.0-py3-none-any.whl.

File metadata

Download URL: sas_parquet-0.1.0-py3-none-any.whl
Upload date: May 22, 2026
Size: 32.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for sas_parquet-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`32f6830b8882a4abeb9bfebaf175eaee2c74d28c5fae16cf7708585309ad08d1`
MD5	`3aa43c8723a8680763378fa44312710f`
BLAKE2b-256	`b99787a9377a229efaae5850c739f7b6e1f542acb7681d85cb89670f5f8e6f9d`

See more details on using hashes here.

sas-parquet 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sas-parquet

Why

What's inside

Install

Quick start

Convert SAS → Parquet

Detect unique composite keys

Compare two Parquet datasets row-by-row

Launch the web UI

Runnable examples

Example 1 — Convert CSV → Parquet

Example 2 — Detect a composite key

Example 3 — Compare two datasets with known mismatches

Design notes

What this is not

Limitations and known issues

Development

Changelog

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes