Tools for profiling SAS7BDAT corpora
Project description
sas7bdat-profiler
Utilities for profiling SAS7BDAT corpora and scan behavior.
The workspace contains:
- the
sas7bdat-simdlibrary crate - the
sas7bdat-profilerbinary package
The profiler package ships Windows command-line executables including:
corpus_profilefixture_catalogfixture_profilefixture_string_profile
Typical usage after installation:
corpus_profile <root> --format csv --out corpus_profile.csv
For full-scan profiling:
corpus_profile <root> --mode typed_batches --projection full --out corpus_profile.csv
For local development from the workspace root:
cargo run -p sas7bdat-profiler --bin corpus_profile -- <root> --format csv --out corpus_profile.csv
Production corpus target profile
Based on Transfer_708245_310326/corpus_profile_nosizebytes.csv and local corpus_*.csv analysis:
- Files discovered:
1242 - Profiled:
1019 - Historical failures:
223(dominant classes were RLE compression decode errors) - Encoding on all profiled production files:
WINDOWS-1252(legacy) - Compression mix (profiled files):
554uncompressed,465compressed - Row-weighted workload mix:
53.45%uncompressed,46.55%compressed - Row-weighted content mix:
49.68%string-heavy,41.26%mixed,9.06%numeric-heavy - Row-weighted width mix:
41.62%medium,39.33%narrow,19.05%wide - Row-weighted size mix:
74.54%huge
Throughput from the production corpus profiling run (profiled files only):
- Uncompressed:
~9.71Mrows/s - Compressed:
~3.33Mrows/s
This gap means compressed-path performance remains the highest-impact optimization target.
Current failed-only re-run status from corpus_failed_only.csv:
- Previously failing files re-run:
223 - Now profiling successfully:
214 - Remaining failures:
9 - Remaining errors are all unsupported subheader compression modes (
79,88,89,105,118,167)
Optimization priority for this project:
- Compressed
WINDOWS-1252string-heavy and mixed huge files. - Medium/narrow width hot paths first, with wide-path support kept performant.
- Correctness support for the remaining compression modes (last 9 files).
- Uncompressed macro-fixtures (for example
ahs2013n) kept as secondary validation targets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file marys_cdef_sas7bdat_profiler_tkragholm-0.1.0-py3-none-win_amd64.whl.
File metadata
- Download URL: marys_cdef_sas7bdat_profiler_tkragholm-0.1.0-py3-none-win_amd64.whl
- Upload date:
- Size: 1.7 MB
- Tags: Python 3, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
054ed3132332b5b91760672b4cdf7feb81e8515367efd471d8f92a58741a2bf9
|
|
| MD5 |
0ebed1ee506e68cb7dc4eadcb4f0bc4b
|
|
| BLAKE2b-256 |
066aadee6ddbb7aa19226684bc596c9d0261c8665b93524097ef36bccd186726
|