Convert SAS/CSV files to Parquet, detect unique keys, and compare datasets — chunk-by-chunk, memory-safe.
Project description
sas-parquet
Convert SAS/CSV files to Parquet, detect unique keys, and compare datasets — chunk-by-chunk, memory-safe.
Built for working with large SAS datasets (.sas7bdat) that don't fit in memory. Streams through 100+ GB files on a 16 GB laptop without breaking a sweat.
Why
If you've ever tried to:
- Convert a 50 GB
.sas7bdatfile to Parquet and watched your machine OOM - Compare a SAS-exported Parquet against a "ground truth" Parquet and not know which column is the primary key
- Validate a SAS-to-Python pipeline migration row-by-row without loading both datasets into memory
…this tool is for you.
What's inside
Three composable command-line tools and a Streamlit web UI:
| Command | What it does |
|---|---|
sas-parquet convert |
Stream a .sas7bdat, .csv, .csv.gz, or .parquet file into one or more Parquet chunks. Smart chunk sizing aims for a target MB per file. |
sas-parquet detect-keys |
Find the smallest combinations of columns whose values are unique. Useful when you don't know the primary key of a dataset. |
sas-parquet compare |
Compare two Parquet datasets chunk-by-chunk on a composite key. Reports row counts, schema differences, per-column value mismatches with sample rows. |
sas-parquet ui |
Launch the four-tab Streamlit UI: Convert · Detect Keys · Compare · Full Pipeline. |
Install
pip install sas-parquet # core CLIs
pip install "sas-parquet[ui]" # adds the Streamlit web UI
pip install "sas-parquet[ui,dev]" # everything (ruff, pytest, etc.)
Requires Python 3.10+.
Quick start
Convert SAS → Parquet
# SAS → chunked Parquet (~256 MB per chunk)
sas-parquet convert data/big.sas7bdat --mb-size 256
# SAS → single Parquet file
sas-parquet convert data/big.sas7bdat --single-file
# CSV → chunked Parquet
sas-parquet convert data/big.csv --mb-size 256
# Pick specific columns
sas-parquet convert data/big.sas7bdat --columns loan_id period balance
Detect unique composite keys
Don't know which columns form the primary key of your dataset? Let the tool find them:
sas-parquet detect-keys ./data/loans_parquet/ --first-only
[detect] Total rows: 4,562,891
[detect] Auto-selected 12 candidate(s):
- loan_id, period, balance, status, origination_date, ...
[key-detect] Searching up to size 4 across 12 candidate column(s)...
--- size 1: 12 combo(s)
['loan_id'] -> 1,140,723 dup keys
--- size 2: 66 combo(s)
['loan_id', 'period'] -> UNIQUE
[detect] Found 1 minimal unique key(s):
1. ['loan_id', 'period']
Compare two Parquet datasets row-by-row
sas-parquet compare \
--source-dir ./data/sas_chunks/ \
--target-path ./data/prod.parquet \
--keys loan_id period \
--output-dir ./reports/
A report file is written with:
- Source and target row counts (from Parquet metadata — milliseconds even on TB-scale files)
- Columns present only on one side
- Data type differences
- Per-column value mismatch counts, sorted by severity
- Sample mismatched rows with key + source value + target value
- Per-chunk timing
Launch the web UI
sas-parquet ui
# Opens http://localhost:8501
Four tabs let you run any single step or chain them end-to-end:
- Convert — point at a
.sas7bdat/.csv/.csv.gz/.parquetfile, set a target MB per chunk, click run. - Detect Keys — point at a Parquet file or directory, see the minimal unique composite key(s).
- Compare — pick source and target Parquet, multiselect the composite key, get a per-column mismatch report with downloadable text output.
- Full Pipeline — chain all three: SAS → Parquet → key detection → comparison, with an optional key override if you'd rather skip auto-detection.
Runnable examples
The three examples below are self-contained: each one generates its own data, runs the relevant command(s), and shows the expected output. All you need is pip install sas-parquet.
Example 1 — Convert CSV → Parquet
mkdir -p /tmp/sp_demo && cd /tmp/sp_demo
# Generate a 1,000-row CSV
python -c "
import polars as pl
pl.DataFrame({
'loan_id': list(range(1000)),
'period': ['Q1' if i % 2 == 0 else 'Q2' for i in range(1000)],
'balance': [round(i * 1.5, 2) for i in range(1000)],
'status': ['ACTIVE' if i % 3 else 'CLOSED' for i in range(1000)],
}).write_csv('loans.csv')
"
sas-parquet convert loans.csv --mb-size 5
Expected output:
[convert] Input : /tmp/sp_demo/loans.csv (csv)
[convert] Output : /tmp/sp_demo/loans (chunked)
[convert] Sampling for chunk-size estimate...
[convert] Chunk size: ~500,000 rows per chunk
chunk 000 -> loans_000.parquet (1,000 rows, 0.0 MB)
[convert] Done. 1,000 rows in 1 chunk(s) -> 0.0 MB in 0.2s
Example 2 — Detect a composite key
Generate a dataset where neither loan_id nor period alone is unique, but the pair (loan_id, period) is. The tool should figure that out without hints.
mkdir -p /tmp/sp_demo2 && cd /tmp/sp_demo2
python -c "
import polars as pl
pl.DataFrame({
'loan_id': [i // 2 for i in range(1000)], # each loan appears in 2 periods
'period': ['Q1' if i % 2 == 0 else 'Q2' for i in range(1000)],
'balance': [round(((i // 2) % 100 + 1) * 100.0, 2) for i in range(1000)], # cycles every 100 loans
'status': ['ACTIVE' if i % 3 else 'CLOSED' for i in range(1000)],
}).write_csv('loans.csv')
"
sas-parquet convert loans.csv --mb-size 5
sas-parquet detect-keys loans/
Expected detect-keys output (abridged):
[detect] Auto-selected 4 candidate(s):
- loan_id, balance, period, status
[key-detect] Searching up to size 4 across 4 candidate column(s)...
--- size 1: 4 combo(s)
['loan_id'] -> 500 dup keys
['balance'] -> 100 dup keys
['period'] -> 2 dup keys
['status'] -> 2 dup keys
--- size 2: 6 combo(s)
['loan_id', 'period'] -> UNIQUE
...
[detect] Found 1 minimal unique key(s):
1. ['loan_id', 'period']
Example 3 — Compare two datasets with known mismatches
Mutate three rows in the target and verify the tool catches exactly those mismatches.
cd /tmp/sp_demo2 # reusing the dataset from example 2
# Build a target with 6 deliberate mismatches (3 loans × 2 periods)
python -c "
import polars as pl
df = pl.read_parquet('loans/loans_000.parquet')
df = df.with_columns(
pl.when(pl.col('loan_id').is_in([10, 20, 30]))
.then(pl.col('balance') + 0.99)
.otherwise(pl.col('balance'))
.alias('balance')
)
df.write_parquet('target.parquet')
"
sas-parquet compare \
--source-dir loans/ \
--target-path target.parquet \
--keys loan_id --keys period \
--output-dir reports/
Expected output:
[compare] Source chunks : 1 (1,000 rows total)
[compare] Target rows : 1,000
[compare] Columns to compare : 2
[compare] Chunk 1/1: loans_000.parquet
rows=1,000 matched=1,000 source_only=0 (0.2s)
[compare] Matched: 1,000 / 1,000
[compare] 1 column(s) have value mismatches.
[compare] Report: /tmp/sp_demo2/reports/compare_loans_<timestamp>.txt
The report file shows the 6 mismatched rows with their key, source value, and target value:
[balance] — first 6 mismatch(es):
{'loan_id': 10, 'period': 'Q1', 'balance__source': 1100.0, 'balance__target': 1100.99}
{'loan_id': 10, 'period': 'Q2', 'balance__source': 1100.0, 'balance__target': 1100.99}
{'loan_id': 20, 'period': 'Q1', 'balance__source': 2100.0, 'balance__target': 2100.99}
{'loan_id': 20, 'period': 'Q2', 'balance__source': 2100.0, 'balance__target': 2100.99}
{'loan_id': 30, 'period': 'Q1', 'balance__source': 3100.0, 'balance__target': 3100.99}
{'loan_id': 30, 'period': 'Q2', 'balance__source': 3100.0, 'balance__target': 3100.99}
Design notes
- Smart chunk sizing — samples 10K rows, writes a probe Parquet, measures compressed bytes/row, then derives a chunk size targeting your
--mb-size(default 50 MB). - Streaming everywhere — uses
pyreadstat.read_file_in_chunks,polars.scan_*lazy frames, andpyarrow.ParquetWriterfor incremental writes. No data ever fully materializes in memory. - Chunk-by-chunk comparison — loads one source chunk, semi-joins target down to matching keys (streaming), compares with null-safe equality, frees memory, repeats. Works on multi-TB datasets.
- SAS empty-string handling — SAS exports missing character values as
"". The compare tool normalizes these tonullbefore comparing strings, so they don't show up as false mismatches against truenullin Parquet. - Anti-join source-only detection — keys present in source but missing from target are counted and surfaced separately.
- Metadata-only row counts —
pyarrow.parquet.read_metadatagives exact row counts in milliseconds, even on TB-scale datasets, without scanning data. - Schema-drift-aware reading — chunked datasets with column dtype drift (one chunk has a column as
Null, another asFloat64) are unified at read time.
What this is not
- A full SAS replacement. It reads
.sas7bdat. It does not run PROC SQL, macros, or formats. Use SAS Viya or pysas if you need those. - Spark or Dask. Single-machine tool optimized for "fits on a beefy laptop or VM" workloads. If you have a cluster and 10+ TB of data, use Spark.
- A diff tool for arbitrary files. Source and target must share a composite key. Schema-drift handling exists but is intentionally minimal.
Limitations and known issues
- SAS numerics are Float64. SAS stores all numbers as 8-byte floats. After conversion, columns that are conceptually integers (IDs, counts) arrive as
f64. A future release will auto-downcast int-like floats. - Float equality is exact. Comparisons use null-safe
==. No fuzzy tolerance for floats yet —1.0vs1.0000001is a mismatch.--rel-tolis planned for 0.2.0. pyreadstatquirks on Windows. Some.sas7bdatfiles have out-of-range file timestamps that crashpyreadstat1.3.4's internaldatetime.fromtimestampcall. We patch around this transparently in the SAS reader.- Conversion is single-threaded. Polars parallelizes the column-level work, but the chunk loop itself is serial. For 100+ GB files this is usually I/O bound anyway.
Development
git clone https://github.com/kulbshar/sas-parquet.git
cd sas-parquet
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -e ".[ui,dev]"
pytest # 105 tests
ruff check src tests
The test suite covers every module end-to-end with synthetic Parquet fixtures, including the Streamlit UI (driven via streamlit.testing.v1.AppTest).
Changelog
See CHANGELOG.md.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sas_parquet-0.1.0.tar.gz.
File metadata
- Download URL: sas_parquet-0.1.0.tar.gz
- Upload date:
- Size: 37.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a685cf7e46ab1f87ddead3d8c94ec5db89c627223ec16d4993da4fb7140f1b3f
|
|
| MD5 |
3075968d669a952dbac6c3671f730de8
|
|
| BLAKE2b-256 |
363a09490c8c6f7b49af3875519045520f4d8b271d0dededc6586c8d61fd3608
|
File details
Details for the file sas_parquet-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sas_parquet-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32f6830b8882a4abeb9bfebaf175eaee2c74d28c5fae16cf7708585309ad08d1
|
|
| MD5 |
3aa43c8723a8680763378fa44312710f
|
|
| BLAKE2b-256 |
b99787a9377a229efaae5850c739f7b6e1f542acb7681d85cb89670f5f8e6f9d
|