Tools for mapping out SAS7BDAT files

These details have not been verified by PyPI

Project description

sas7bdat-cli

Utilities for converting and profiling SAS7BDAT data.

The workspace contains:

the sas7bdat-simd library crate
the sas7bdat-cli package, which ships the sas7bdat-convert, sas7bdat-inspect, and profiling binaries
the sas7bdat-polars Python extension package

The package ships command-line executables including:

sas7bdat-corpus-catalog
sas7bdat-corpus-profile
sas7bdat-fixture-profile
sas7bdat-fixture-string-profile

Typical usage after installation:

sas7bdat-corpus-profile <root> --format csv --out corpus_profile.csv

Current progress

The workspace now has a split benchmark for the Python plugin and the raw Rust path, with cold-start and warm steady-state measurements separated.

Recent results on fixtures/ahs2013n.sas7bdat:

batch_reader warm steady-state is close to the raw Rust scan path, at about 1.1x the raw average.
scan_sas warm steady-state is still slower, at about 1.5x the raw average.
Cold-start timings are still dominated by dataset open and first-scan setup, which is expected for the lazy descriptor cache.

This means the remaining optimization work is in the Polars/DataFrame conversion and Python handoff path, not the core SAS page scan.

Roadmap: see docs/ROADMAP.md for the repo-wide plan and docs/PLUGIN_ROADMAP.md for the plugin execution plan and benchmark matrix.

Historical notes and superseded reports are kept in docs/archive/README.md.

For full-scan profiling:

sas7bdat-corpus-profile <root> --mode typed_batches --projection full --out corpus_profile.csv

For local development from the workspace root:

cargo run -p sas7bdat-cli --bin sas7bdat-corpus-profile -- <root> --format csv --out corpus_profile.csv

Testing

Use the just targets to keep the Rust core and the Python plugin tests separate:

just test-core
just test-polars-plugin-rust
just test-polars-plugin
just test

just test-core runs the core Rust workspace tests with cargo nextest.
just test-polars-plugin-rust runs the plugin crate's Rust tests with the PyO3 extension-module feature disabled.
just test-polars-plugin builds the extension module and runs the Python smoke tests.
just test runs the full sequence.

Standalone CLI

The workspace also ships small sas7bdat-convert and sas7bdat-inspect commands in the sas7bdat-cli package.

Examples:

cargo run -p sas7bdat-cli --bin sas7bdat-convert -- fixtures/ahs2013n.sas7bdat --sink parquet --out /tmp/out.parquet
cargo run -p sas7bdat-cli --bin sas7bdat-inspect -- fixtures/ahs2013n.sas7bdat --json
just convert fixtures/ahs2013n.sas7bdat --sink parquet --out /tmp/out.parquet
just inspect fixtures/ahs2013n.sas7bdat --json

It supports directory recursion, column projection, row limits, parquet, CSV, or TSV output, and an optional .sas7bcat companion via --catalog so parquet output can carry value-label metadata.

Production corpus target profile

Based on Transfer_708245_310326/corpus_profile_nosizebytes.csv and local corpus_*.csv analysis:

Files discovered: 1242
Profiled: 1019
Historical failures: 223 (dominant classes were RLE compression decode errors)
Encoding on all profiled production files: WINDOWS-1252 (legacy)
Compression mix (profiled files): 554 uncompressed, 465 compressed
Row-weighted workload mix: 53.45% uncompressed, 46.55% compressed
Row-weighted content mix: 49.68% string-heavy, 41.26% mixed, 9.06% numeric-heavy
Row-weighted width mix: 41.62% medium, 39.33% narrow, 19.05% wide
Row-weighted size mix: 74.54% huge

Throughput from the production corpus profiling run (profiled files only):

Uncompressed: ~9.71M rows/s
Compressed: ~3.33M rows/s

This gap means compressed-path performance remains the highest-impact optimization target.

Current failed-only re-run status from corpus_failed_only.csv:

Previously failing files re-run: 223
Now profiling successfully: 214
Remaining failures: 9
Remaining errors are all unsupported subheader compression modes (79, 88, 89, 105, 118, 167)

Optimization priority for this project:

Compressed WINDOWS-1252 string-heavy and mixed huge files.
Medium/narrow width hot paths first, with wide-path support kept performant.
Correctness support for the remaining compression modes (last 9 files).
Uncompressed macro-fixtures (for example ahs2013n) kept as secondary validation targets.

Top3 target benchmark snapshot

Command used:

cargo bench --bench compression_matrix -- 'top3_target/'

filename	runtime	thrpt	commit-id
`nysdoh_brfss_surveydata_2018_ad5548ba`	48.75 ms [48.64 ms; 48.87 ms]	733.65 Kelem/s [731.85 Kelem/s; 735.34 Kelem/s]	`4ca487e`
`nyyts_2000_2018_publicuse_aec3d115`	154.19 ms [154.05 ms; 154.34 ms]	764.20 Kelem/s [763.47 Kelem/s; 764.89 Kelem/s]	`4ca487e`
`nyyts_2000_2020_publicuse_c85e9144`	179.69 ms [179.55 ms; 179.84 ms]	677.43 Kelem/s [676.89 Kelem/s; 677.96 Kelem/s]	`4ca487e`

Typed-batches hotpath profiling

Store hotpath output next to Criterion artifacts:

just hotpath-typed-batches-target

This writes a JSON profile to:

target/criterion/hotpath/typed_batches_target.json

Override defaults with environment variables, for example:

MAX_FILES=3 BATCH_ROWS=256 HOTPATH_OUTPUT_PATH=target/criterion/hotpath/custom.json just hotpath-typed-batches-target

For a quick pass over only the 3 largest target files:

just hotpath-typed-batches-top3

Project details

These details have not been verified by PyPI

Operating System
- Microsoft :: Windows
Programming Language

Release history Release notifications | RSS feed

This version

0.1.0

May 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sas7bdat_dir_mapper-0.1.0-py3-none-win_amd64.whl (5.2 MB view details)

Uploaded May 6, 2026 Python 3Windows x86-64

File details

Details for the file sas7bdat_dir_mapper-0.1.0-py3-none-win_amd64.whl.

File metadata

Download URL: sas7bdat_dir_mapper-0.1.0-py3-none-win_amd64.whl
Upload date: May 6, 2026
Size: 5.2 MB
Tags: Python 3, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for sas7bdat_dir_mapper-0.1.0-py3-none-win_amd64.whl
Algorithm	Hash digest
SHA256	`4c67a49ad7fd552289ad46b031030fb6e39e65cc24ff9ae12161cf048a218293`
MD5	`f236712a7fc87f6b8d88d1d48f70b9fe`
BLAKE2b-256	`baa1fc6011442cedfe3d241456b15275a088225cd319682148c01dc5f5b2fea8`

See more details on using hashes here.

sas7bdat-dir-mapper 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers