Skip to main content

Convert SAS/CSV files to Parquet, detect unique keys, and compare datasets — chunk-by-chunk, memory-safe.

Project description

sas-parquet

PyPI Python License: MIT Tests

Convert SAS/CSV files to Parquet, detect unique keys, and compare datasets — chunk-by-chunk, memory-safe.

Built for working with large SAS datasets (.sas7bdat) that don't fit in memory. Streams through 100+ GB files on a 16 GB laptop without breaking a sweat.


Why

If you've ever tried to:

  • Convert a 50 GB .sas7bdat file to Parquet and watched your machine OOM
  • Compare a SAS-exported Parquet against a "ground truth" Parquet and not know which column is the primary key
  • Validate a SAS-to-Python pipeline migration row-by-row without loading both datasets into memory

…this tool is for you.

What's inside

Three composable command-line tools and a Streamlit web UI:

Command What it does
sas-parquet convert Stream a .sas7bdat, .csv, .csv.gz, or .parquet file into one or more Parquet chunks. Smart chunk sizing aims for a target MB per file.
sas-parquet detect-keys Find the smallest combinations of columns whose values are unique. Useful when you don't know the primary key of a dataset.
sas-parquet compare Compare two Parquet datasets chunk-by-chunk on a composite key. Reports row counts, schema differences, per-column value mismatches with sample rows.
sas-parquet ui Launch the four-tab Streamlit UI: Convert · Detect Keys · Compare · Full Pipeline.

Install

pip install sas-parquet              # core CLIs
pip install "sas-parquet[ui]"        # adds the Streamlit web UI
pip install "sas-parquet[ui,dev]"    # everything (ruff, pytest, etc.)

Requires Python 3.10+.

Quick start

Convert SAS → Parquet

# SAS → chunked Parquet (~256 MB per chunk)
sas-parquet convert data/big.sas7bdat --mb-size 256

# SAS → single Parquet file
sas-parquet convert data/big.sas7bdat --single-file

# CSV → chunked Parquet
sas-parquet convert data/big.csv --mb-size 256

# Pick specific columns
sas-parquet convert data/big.sas7bdat --columns loan_id period balance

Detect unique composite keys

Don't know which columns form the primary key of your dataset? Let the tool find them:

sas-parquet detect-keys ./data/loans_parquet/ --first-only
[detect] Total rows: 4,562,891
[detect] Auto-selected 12 candidate(s):
   - loan_id, period, balance, status, origination_date, ...
[key-detect] Searching up to size 4 across 12 candidate column(s)...
--- size 1: 12 combo(s)
  ['loan_id'] -> 1,140,723 dup keys
--- size 2: 66 combo(s)
  ['loan_id', 'period'] -> UNIQUE
[detect] Found 1 minimal unique key(s):
   1. ['loan_id', 'period']

Compare two Parquet datasets row-by-row

sas-parquet compare \
    --source-dir ./data/sas_chunks/ \
    --target-path ./data/prod.parquet \
    --keys loan_id period \
    --output-dir ./reports/

A report file is written with:

  • Source and target row counts (from Parquet metadata — milliseconds even on TB-scale files)
  • Columns present only on one side
  • Data type differences
  • Per-column value mismatch counts, sorted by severity
  • Sample mismatched rows with key + source value + target value
  • Per-chunk timing

Launch the web UI

sas-parquet ui
# Opens http://localhost:8501

Four tabs let you run any single step or chain them end-to-end:

  • Convert — point at a .sas7bdat / .csv / .csv.gz / .parquet file, set a target MB per chunk, click run.
  • Detect Keys — point at a Parquet file or directory, see the minimal unique composite key(s).
  • Compare — pick source and target Parquet, multiselect the composite key, get a per-column mismatch report with downloadable text output.
  • Full Pipeline — chain all three: SAS → Parquet → key detection → comparison, with an optional key override if you'd rather skip auto-detection.

Runnable examples

The three examples below are self-contained: each one generates its own data, runs the relevant command(s), and shows the expected output. All you need is pip install sas-parquet.

Example 1 — Convert CSV → Parquet

mkdir -p /tmp/sp_demo && cd /tmp/sp_demo

# Generate a 1,000-row CSV
python -c "
import polars as pl
pl.DataFrame({
    'loan_id': list(range(1000)),
    'period':  ['Q1' if i % 2 == 0 else 'Q2' for i in range(1000)],
    'balance': [round(i * 1.5, 2) for i in range(1000)],
    'status':  ['ACTIVE' if i % 3 else 'CLOSED' for i in range(1000)],
}).write_csv('loans.csv')
"

sas-parquet convert loans.csv --mb-size 5

Expected output:

[convert] Input  : /tmp/sp_demo/loans.csv  (csv)
[convert] Output : /tmp/sp_demo/loans  (chunked)
[convert] Sampling for chunk-size estimate...
[convert] Chunk size: ~500,000 rows per chunk
  chunk 000 -> loans_000.parquet (1,000 rows, 0.0 MB)
[convert] Done. 1,000 rows in 1 chunk(s) -> 0.0 MB in 0.2s

Example 2 — Detect a composite key

Generate a dataset where neither loan_id nor period alone is unique, but the pair (loan_id, period) is. The tool should figure that out without hints.

mkdir -p /tmp/sp_demo2 && cd /tmp/sp_demo2

python -c "
import polars as pl
pl.DataFrame({
    'loan_id': [i // 2 for i in range(1000)],                                  # each loan appears in 2 periods
    'period':  ['Q1' if i % 2 == 0 else 'Q2' for i in range(1000)],
    'balance': [round(((i // 2) % 100 + 1) * 100.0, 2) for i in range(1000)],  # cycles every 100 loans
    'status':  ['ACTIVE' if i % 3 else 'CLOSED' for i in range(1000)],
}).write_csv('loans.csv')
"

sas-parquet convert loans.csv --mb-size 5
sas-parquet detect-keys loans/

Expected detect-keys output (abridged):

[detect] Auto-selected 4 candidate(s):
   - loan_id, balance, period, status

[key-detect] Searching up to size 4 across 4 candidate column(s)...
--- size 1: 4 combo(s)
  ['loan_id'] -> 500 dup keys
  ['balance'] -> 100 dup keys
  ['period']  -> 2 dup keys
  ['status']  -> 2 dup keys
--- size 2: 6 combo(s)
  ['loan_id', 'period'] -> UNIQUE
  ...

[detect] Found 1 minimal unique key(s):
   1. ['loan_id', 'period']

Example 3 — Compare two datasets with known mismatches

Mutate three rows in the target and verify the tool catches exactly those mismatches.

cd /tmp/sp_demo2   # reusing the dataset from example 2

# Build a target with 6 deliberate mismatches (3 loans × 2 periods)
python -c "
import polars as pl
df = pl.read_parquet('loans/loans_000.parquet')
df = df.with_columns(
    pl.when(pl.col('loan_id').is_in([10, 20, 30]))
      .then(pl.col('balance') + 0.99)
      .otherwise(pl.col('balance'))
      .alias('balance')
)
df.write_parquet('target.parquet')
"

sas-parquet compare \
    --source-dir loans/ \
    --target-path target.parquet \
    --keys loan_id --keys period \
    --output-dir reports/

Expected output:

[compare] Source chunks : 1 (1,000 rows total)
[compare] Target rows   : 1,000
[compare] Columns to compare : 2
[compare] Chunk 1/1: loans_000.parquet
   rows=1,000  matched=1,000  source_only=0  (0.2s)
[compare] Matched: 1,000 / 1,000
[compare] 1 column(s) have value mismatches.
[compare] Report: /tmp/sp_demo2/reports/compare_loans_<timestamp>.txt

The report file shows the 6 mismatched rows with their key, source value, and target value:

[balance] — first 6 mismatch(es):
   {'loan_id': 10, 'period': 'Q1', 'balance__source': 1100.0, 'balance__target': 1100.99}
   {'loan_id': 10, 'period': 'Q2', 'balance__source': 1100.0, 'balance__target': 1100.99}
   {'loan_id': 20, 'period': 'Q1', 'balance__source': 2100.0, 'balance__target': 2100.99}
   {'loan_id': 20, 'period': 'Q2', 'balance__source': 2100.0, 'balance__target': 2100.99}
   {'loan_id': 30, 'period': 'Q1', 'balance__source': 3100.0, 'balance__target': 3100.99}
   {'loan_id': 30, 'period': 'Q2', 'balance__source': 3100.0, 'balance__target': 3100.99}

Design notes

  • Smart chunk sizing — samples 10K rows, writes a probe Parquet, measures compressed bytes/row, then derives a chunk size targeting your --mb-size (default 50 MB).
  • Streaming everywhere — uses pyreadstat.read_file_in_chunks, polars.scan_* lazy frames, and pyarrow.ParquetWriter for incremental writes. No data ever fully materializes in memory.
  • Chunk-by-chunk comparison — loads one source chunk, semi-joins target down to matching keys (streaming), compares with null-safe equality, frees memory, repeats. Works on multi-TB datasets.
  • SAS empty-string handling — SAS exports missing character values as "". The compare tool normalizes these to null before comparing strings, so they don't show up as false mismatches against true null in Parquet.
  • Anti-join source-only detection — keys present in source but missing from target are counted and surfaced separately.
  • Metadata-only row countspyarrow.parquet.read_metadata gives exact row counts in milliseconds, even on TB-scale datasets, without scanning data.
  • Schema-drift-aware reading — chunked datasets with column dtype drift (one chunk has a column as Null, another as Float64) are unified at read time.

What this is not

  • A full SAS replacement. It reads .sas7bdat. It does not run PROC SQL, macros, or formats. Use SAS Viya or pysas if you need those.
  • Spark or Dask. Single-machine tool optimized for "fits on a beefy laptop or VM" workloads. If you have a cluster and 10+ TB of data, use Spark.
  • A diff tool for arbitrary files. Source and target must share a composite key. Schema-drift handling exists but is intentionally minimal.

Limitations and known issues

  • SAS numerics are Float64. SAS stores all numbers as 8-byte floats. After conversion, columns that are conceptually integers (IDs, counts) arrive as f64. A future release will auto-downcast int-like floats.
  • Float equality is exact. Comparisons use null-safe ==. No fuzzy tolerance for floats yet — 1.0 vs 1.0000001 is a mismatch. --rel-tol is planned for 0.2.0.
  • pyreadstat quirks on Windows. Some .sas7bdat files have out-of-range file timestamps that crash pyreadstat 1.3.4's internal datetime.fromtimestamp call. We patch around this transparently in the SAS reader.
  • Conversion is single-threaded. Polars parallelizes the column-level work, but the chunk loop itself is serial. For 100+ GB files this is usually I/O bound anyway.

Development

git clone https://github.com/kulbshar/sas-parquet.git
cd sas-parquet
python -m venv .venv
source .venv/bin/activate           # or  .venv\Scripts\activate  on Windows
pip install -e ".[ui,dev]"

pytest                               # 105 tests
ruff check src tests

The test suite covers every module end-to-end with synthetic Parquet fixtures, including the Streamlit UI (driven via streamlit.testing.v1.AppTest).

Changelog

See CHANGELOG.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sas_parquet-0.1.0.tar.gz (37.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sas_parquet-0.1.0-py3-none-any.whl (32.4 kB view details)

Uploaded Python 3

File details

Details for the file sas_parquet-0.1.0.tar.gz.

File metadata

  • Download URL: sas_parquet-0.1.0.tar.gz
  • Upload date:
  • Size: 37.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for sas_parquet-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a685cf7e46ab1f87ddead3d8c94ec5db89c627223ec16d4993da4fb7140f1b3f
MD5 3075968d669a952dbac6c3671f730de8
BLAKE2b-256 363a09490c8c6f7b49af3875519045520f4d8b271d0dededc6586c8d61fd3608

See more details on using hashes here.

File details

Details for the file sas_parquet-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sas_parquet-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 32.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.13

File hashes

Hashes for sas_parquet-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 32f6830b8882a4abeb9bfebaf175eaee2c74d28c5fae16cf7708585309ad08d1
MD5 3aa43c8723a8680763378fa44312710f
BLAKE2b-256 b99787a9377a229efaae5850c739f7b6e1f542acb7681d85cb89670f5f8e6f9d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page