Skip to main content

Tools for profiling SAS7BDAT corpora

Project description

sas7bdat-profiler

Utilities for profiling SAS7BDAT corpora and scan behavior.

The workspace contains:

  • the sas7bdat-simd library crate
  • the sas7bdat-profiler binary package

The profiler package ships Windows command-line executables including:

  • corpus_profile
  • fixture_catalog
  • fixture_profile
  • fixture_string_profile

Typical usage after installation:

corpus_profile <root> --format csv --out corpus_profile.csv

For full-scan profiling:

corpus_profile <root> --mode typed_batches --projection full --out corpus_profile.csv

For local development from the workspace root:

cargo run -p sas7bdat-profiler --bin corpus_profile -- <root> --format csv --out corpus_profile.csv

Production corpus target profile

Based on Transfer_708245_310326/corpus_profile_nosizebytes.csv and local corpus_*.csv analysis:

  • Files discovered: 1242
  • Profiled: 1019
  • Historical failures: 223 (dominant classes were RLE compression decode errors)
  • Encoding on all profiled production files: WINDOWS-1252 (legacy)
  • Compression mix (profiled files): 554 uncompressed, 465 compressed
  • Row-weighted workload mix: 53.45% uncompressed, 46.55% compressed
  • Row-weighted content mix: 49.68% string-heavy, 41.26% mixed, 9.06% numeric-heavy
  • Row-weighted width mix: 41.62% medium, 39.33% narrow, 19.05% wide
  • Row-weighted size mix: 74.54% huge

Throughput from the production corpus profiling run (profiled files only):

  • Uncompressed: ~9.71M rows/s
  • Compressed: ~3.33M rows/s

This gap means compressed-path performance remains the highest-impact optimization target.

Current failed-only re-run status from corpus_failed_only.csv:

  • Previously failing files re-run: 223
  • Now profiling successfully: 214
  • Remaining failures: 9
  • Remaining errors are all unsupported subheader compression modes (79, 88, 89, 105, 118, 167)

Optimization priority for this project:

  1. Compressed WINDOWS-1252 string-heavy and mixed huge files.
  2. Medium/narrow width hot paths first, with wide-path support kept performant.
  3. Correctness support for the remaining compression modes (last 9 files).
  4. Uncompressed macro-fixtures (for example ahs2013n) kept as secondary validation targets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

marys_cdef_sas7bdat_profiler_tkragholm-0.1.0-py3-none-win_amd64.whl (1.7 MB view details)

Uploaded Python 3Windows x86-64

File details

Details for the file marys_cdef_sas7bdat_profiler_tkragholm-0.1.0-py3-none-win_amd64.whl.

File metadata

File hashes

Hashes for marys_cdef_sas7bdat_profiler_tkragholm-0.1.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 054ed3132332b5b91760672b4cdf7feb81e8515367efd471d8f92a58741a2bf9
MD5 0ebed1ee506e68cb7dc4eadcb4f0bc4b
BLAKE2b-256 066aadee6ddbb7aa19226684bc596c9d0261c8665b93524097ef36bccd186726

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page