Skip to main content

Pure Rust SPSS .sav/.zsav reader with Polars DataFrame output

Project description

ambers

ambers banner

Crates.io PyPI License: MIT

Pure Rust SPSS .sav/.zsav reader and writer — Arrow-native, zero C dependencies.

Features

  • Blazing fast read and write for SPSS .sav (bytecode) and .zsav (zlib) files
  • Rich metadata: variable labels, value labels, missing values, MR sets, measure levels, and more
  • Lazy reader via scan_sav() — Polars LazyFrame with projection and row limit pushdown
  • Pure Rust with a native Python API — native Arrow integration, no C dependencies
  • Benchmarked up to 3–10x faster reads and 4–20x faster writes compared to current popular SPSS I/O libraries

Installation

Python:

uv add ambers

Rust:

cargo add ambers

Python

import ambers as am
import polars as pl

# Eager read — returns SavFile with .data and .meta
sav = am.read_sav("survey.sav")
df, meta = sav.data, sav.meta

# Lazy read — .data is a Polars LazyFrame
sav = am.scan_sav("survey.sav")
lf, meta = sav.data, sav.meta
df = lf.select(["Q1", "Q2", "age"]).head(1000).collect()

# Explore metadata
meta.summary()
meta.describe("Q1")
meta.value("Q1")

# Read metadata only (fast, skips data)
meta = am.read_sav_meta("survey.sav")

# Write back — roundtrip with full metadata
sav = am.read_sav("input.sav")
df, meta = sav.data, sav.meta
df = df.filter(pl.col("age") > 18)
am.write_sav(df, "filtered.sav", meta=meta)                        # bytecode (default for .sav)
am.write_sav(df, "compressed.zsav", meta=meta)                     # zlib (default for .zsav)
am.write_sav(df, "raw.sav", meta=meta, compression="uncompressed") # no compression
am.write_sav(df, "fast.zsav", meta=meta, compression_level=1)      # fast zlib

# From scratch — metadata is optional, inferred from DataFrame schema
am.write_sav(df, "new.sav")

# Apply value labels — replace codes with labels for export/analysis
df, meta = sav.data, sav.meta
labeled = am.apply_labels(df, meta)                          # Enum dtype (ordered, strict)
labeled.write_excel("survey.xlsx")                            # Enum auto-casts to String
labeled = am.apply_labels(df, meta, output="string")          # String dtype for export
labeled = am.apply_labels(df, meta, output="enum_null")       # Enum, unmapped → null
labeled = am.apply_labels(df, meta, exclude=["weight", "id"])  # skip specific columns

# Apply missing values — nullify SPSS user-defined missing codes
clean = am.apply_missing(df, meta)                             # all columns with specs
clean = am.apply_missing(df, meta, columns=["Q1", "Q2"])       # specific columns only
clean = am.apply_missing(df, meta, exclude=["age"])            # skip specific columns

# Validate — check value label quality before analysis
report = am.validate(df, meta)
print(report)                                                   # box-drawing summary
report.is_valid                                                 # True if no errors
report.raise_if_invalid()                                       # raise if errors found
report.to_frame()                                               # DataFrame for export

.sav uses bytecode compression by default, .zsav uses zlib. Pass compression= to override ("uncompressed", "bytecode", "zlib"). Pass meta= to preserve all metadata from a prior read_sav(), or omit it to infer formats from the DataFrame.

SavFile

read_sav() and scan_sav() return a SavFile object with file-level metadata alongside the data:

>>> sav = am.read_sav("survey_2025.sav")
>>> sav
┌─ SavFile ──────────────────────────┐
│ Data        DataFrame (polars)     │
│ Shape       22,070 rows x 677 cols │
│ Source      survey_2025.sav        │
│ File size   146.5 MB, bytecode     │
│ Read time   0.286s                 │
└────────────────────────────────────┘
Attribute Type Description
sav.data DataFrame or LazyFrame The data (eager from read_sav, lazy from scan_sav)
sav.meta SpssMetadata All variable metadata (labels, formats, value labels, etc.)
sav.source str | None Source file path
sav.shape tuple[int, int] | None (n_rows, n_cols)
sav.file_size int | None File size in bytes
sav.read_time float | None Wall-clock read time in seconds
sav.compression str "uncompressed", "bytecode", or "zlib"

For scan_sav(), read_time measures metadata/schema reading only (not lazy collection).

apply_labels

Replace numeric/string codes with their SPSS value labels. By default produces Polars Enum columns that preserve SPSS definition order — crucial for Likert scales and survey analysis.

sav = am.read_sav("survey.sav")
df, meta = sav.data, sav.meta

# Default: Enum output, strict validation
labeled = am.apply_labels(df, meta)
labeled.group_by("satisfaction").agg(pl.len())  # sorted by definition order
labeled.write_excel("survey.xlsx")              # Enum auto-casts to String

# String output for quick export
labeled = am.apply_labels(df, meta, output="string")

# Enum output with unmapped values as null
labeled = am.apply_labels(df, meta, output="enum_null")
output= Dtype Unmapped values Best for
"enum" (default) pl.Enum (ordered) Error Analysis — strict, validated categories
"string" pl.String Stringify (3.0"3") Export — readable text for Excel/CSV
"enum_null" pl.Enum (ordered) Null Analysis — exclude unknowns from base

Numeric columns without value labels are skipped. String columns always pass through unmapped text. See apply_labels.md for full documentation.

validate

Check value label quality before analysis — catch unlabeled values and duplicate labels upfront.

sav = am.read_sav("survey.sav")
df, meta = sav.data, sav.meta

report = am.validate(df, meta)
print(report)           # box-drawing summary
report.is_valid         # True if no errors (warnings OK)
report.raise_if_invalid()  # raise ValueError if errors

# Programmatic access
for error in report.errors:
    print(f"{error.column}: {error.details['unlabeled_values']}")

# Export as DataFrame
report.to_frame().write_csv("validation_issues.csv")

See validate.md for full documentation.

Rust

use ambers::{read_sav, read_sav_metadata};

// Read data + metadata
let (batch, meta) = read_sav("survey.sav")?;
println!("{} rows, {} cols", batch.num_rows(), meta.number_columns);

// Read metadata only
let meta = read_sav_metadata("survey.sav")?;
println!("{}", meta.label("Q1").unwrap_or("(no label)"));

Metadata API (Python)

Method Description
meta.summary() Formatted overview: file info, type distribution, annotations
meta.describe("Q1") Deep-dive into a single variable (or list of variables)
meta.diff(other) Compare two metadata objects, returns MetaDiff
meta.label("Q1") Variable label
meta.value("Q1") Value labels dict
meta.format("Q1") SPSS format string (e.g. "F8.2", "A50")
meta.measure("Q1") Measurement level ("nominal", "ordinal", "scale")
meta.role("Q1") Variable role ("input", "target", "both", "none", "partition", "split")
meta.attribute("Q1", "CustomNote") Custom attribute values (list[str] or None)
meta.schema Full metadata as a nested Python dict

All variable-name methods raise KeyError for unknown variables.

Metadata Fields

All fields returned by the reader. Fields marked Write are preserved when passed via meta= to write_sav(). Read-only fields are set automatically (encoding, timestamps, row/column counts, etc.).

Note: This is a first pass — field names and behavior may change without warning in future releases.

Field Read Write Type
file_label yes yes str
file_format yes str
file_encoding yes str
creation_time yes str
compression yes str
number_columns yes int
number_rows yes int | None
weight_variable yes yes str | None
notes yes yes list[str]
variable_names yes list[str]
variable_labels yes yes dict[str, str]
variable_value_labels yes yes dict[str, dict[float|str, str]]
variable_formats yes yes dict[str, str]
variable_measures yes yes dict[str, str]
variable_alignments yes yes dict[str, str]
variable_storage_widths yes dict[str, int]
variable_display_widths yes yes dict[str, int]
variable_roles yes yes dict[str, str]
variable_missing_values yes yes dict[str, dict]
variable_attributes yes yes dict[str, dict[str, list[str]]]
mr_sets yes yes dict[str, dict]
arrow_data_types yes dict[str, str]

Creating metadata from scratch:

meta = am.SpssMetadata(
    file_label="Customer Survey 2026",
    variable_labels={"Q1": "Satisfaction", "Q2": "Loyalty"},
    variable_value_labels={"Q1": {1: "Low", 5: "High"}},
    variable_measures={"Q1": "ordinal", "Q2": "nominal"},
)
am.write_sav(df, "output.sav", meta=meta)

Modifying existing metadata (from read_sav() or a previously created SpssMetadata):

# .update() — bulk update multiple fields at once, merges dicts, replaces scalars
meta2 = meta.update(
    file_label="Updated Survey",
    variable_labels={"Q3": "NPS"},        # Q1/Q2 labels preserved, Q3 added
    variable_measures={"Q3": "scale"},
)

# .with_*() — chainable single-field setters, with full IDE autocomplete and type hints
meta3 = (meta
    .with_file_label("Updated Survey")
    .with_variable_labels({"Q3": "NPS"})
    .with_variable_measures({"Q3": "scale"})
)

Immutability: SpssMetadata is immutable. .update() and .with_*() always return a new instance — the original is never modified. Assign to a new variable if you need to keep both copies.

Update logic:

  • Dict fields (labels, formats, measures, etc.) merge as an overlay — new keys are added, existing keys are overwritten, all other keys are preserved. Pass {key: None} to remove a key.
  • Scalar fields (file_label, weight_variable) and notes are replaced entirely.
  • Column renames are not tracked. If you rename "Q1" to "Q1a" in your DataFrame, metadata for "Q1" does not carry over — you must explicitly provide metadata for "Q1a".

See metadata.md for the full API reference including update logic details, missing values, MR sets, and validation rules.

SPSS tip: Custom variable attributes are not shown in SPSS's Variable View by default. Go to View > Customize Variable View and click OK, or run DISPLAY ATTRIBUTES in SPSS syntax.

Streaming Reader (Rust)

let mut scanner = ambers::scan_sav("survey.sav")?;
scanner.select(&["age", "gender"])?;
scanner.limit(1000);

while let Some(batch) = scanner.next_batch()? {
    println!("Batch: {} rows", batch.num_rows());
}

Performance

Eager Read

All results return a Polars DataFrame. Best of 3–5 runs (with warmup) on Windows 11, Python 3.13, Intel Core Ultra 9 275HX (24C), 64 GB RAM (6400 MT/s).

File Size Rows Cols ambers polars_readstat pyreadstat vs prs vs pyreadstat
test_1 (bytecode) 0.2 MB 1,500 75 < 0.01s < 0.01s 0.011s
test_2 (bytecode) 147 MB 22,070 677 0.286s 0.897s 3.524s 3.1x 12x
test_3 (uncompressed) 1.1 GB 79,066 915 0.322s 1.150s 4.918s 3.6x 15x
test_4 (uncompressed) 0.6 MB 201 158 0.002s 0.003s 0.012s 1.5x 6x
test_5 (uncompressed) 0.6 MB 203 136 0.002s 0.003s 0.016s 1.5x 8x
test_6 (uncompressed) 5.4 GB 395,330 916 1.600s 1.752s 25.214s 1.1x 16x
  • Faster than polars_readstat on all tested files — 1.1–3.6x faster
  • 6–16x faster than pyreadstat across all file sizes
  • No PyArrow dependency — uses Arrow PyCapsule Interface for zero-copy transfer

Lazy Read with Pushdown

scan_sav() returns a Polars LazyFrame. Unlike eager reads, it only reads the data you ask for:

File (size) Full collect Select 5 cols Head 1000 rows Select 5 + head 1000
test_2 (147 MB, 22K × 677) 0.903s 0.363s (2.5x) 0.181s (5.0x) 0.157s (5.7x)
test_3 (1.1 GB, 79K × 915) 0.700s 0.554s (1.3x) 0.020s (35x) 0.012s (58x)
test_6 (5.4 GB, 395K × 916) 3.062s 2.343s (1.3x) 0.022s (139x) 0.013s (236x)

On the 5.4 GB file, selecting 5 columns and 1000 rows completes in 13ms — 236x faster than reading the full dataset.

Write

write_sav() writes a Polars DataFrame + metadata back to .sav (bytecode) or .zsav (zlib). Best of 5 runs on the same machine.

File Size Rows Cols Mode ambers pyreadstat Speedup
test_1 (bytecode) 0.2 MB 1,500 75 .sav 0.001s 0.019s 13x
.zsav 0.004s 0.025s 6x
test_2 (bytecode) 147 MB 22,070 677 .sav 0.539s 3.622s 7x
.zsav 0.386s 4.174s 11x
test_3 (uncompressed) 1.1 GB 79,066 915 .sav 0.439s 13.963s 32x
.zsav 0.436s 17.991s 41x
test_4 (uncompressed) 0.6 MB 201 158 .sav 0.002s 0.027s 16x
.zsav 0.004s 0.035s 9x
test_5 (uncompressed) 0.6 MB 203 136 .sav 0.001s 0.023s 17x
.zsav 0.003s 0.027s 9x
test_6 (uncompressed) 5.4 GB 395,330 916 .sav 2.511s 84.836s 34x
.zsav 2.255s 90.499s 40x
  • 6–41x faster than pyreadstat on writes across all files and compression modes
  • Full metadata roundtrip: variable labels, value labels, missing values, MR sets, display properties
  • Bytecode (.sav) and zlib (.zsav) compression

Roadmap

  • apply_missing_values() — apply SPSS missing value definitions to DataFrames
  • meta.validate(df) — validate metadata against a DataFrame
  • Codebook export — generate variable documentation from metadata
  • Continued I/O performance optimization
  • Currently Polars-only — pandas/other DataFrame libraries may be added later

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ambers-0.4.2.tar.gz (171.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ambers-0.4.2-cp314-cp314-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.14Windows x86-64

ambers-0.4.2-cp314-cp314-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.34+ x86-64

ambers-0.4.2-cp314-cp314-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

ambers-0.4.2-cp313-cp313-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.13Windows x86-64

ambers-0.4.2-cp313-cp313-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.34+ x86-64

ambers-0.4.2-cp313-cp313-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

ambers-0.4.2-cp312-cp312-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.12Windows x86-64

ambers-0.4.2-cp312-cp312-manylinux_2_34_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.34+ x86-64

ambers-0.4.2-cp312-cp312-macosx_11_0_arm64.whl (1.1 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

File details

Details for the file ambers-0.4.2.tar.gz.

File metadata

  • Download URL: ambers-0.4.2.tar.gz
  • Upload date:
  • Size: 171.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ambers-0.4.2.tar.gz
Algorithm Hash digest
SHA256 44a14f5915d72ed2b263ec3a4904c371d6c0805a2e328993c242cadb6783a312
MD5 9ee2e71037a70e375d4ad03c40b544b5
BLAKE2b-256 740899586c2b5a885e0ee1034d7d67a68b6268eebfc40d5a2829062ad3e55ae8

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2.tar.gz:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: ambers-0.4.2-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ambers-0.4.2-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 9435ab03886705e72741962d4600a75a75ec7b1e064a6b6a4142b81e68fa8085
MD5 7f190735ff0839dd0e081354105bce79
BLAKE2b-256 9ca17e411fd0516c15aced9495fb5f93ec4abcc30fa4ee9dea081cb27e4be0ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp314-cp314-win_amd64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp314-cp314-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ambers-0.4.2-cp314-cp314-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 a67535167e500e6216d860fa9047b98088d532623f8dce81fa83c90cef7544de
MD5 124c4b80651a7cf2458e358ed9e9fe46
BLAKE2b-256 80346fd592a3375f6a7797cd8b13e22d46115b8492f72ea1ba857b07acbad25d

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp314-cp314-manylinux_2_34_x86_64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for ambers-0.4.2-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 51ce077e884a228ff65e22990e33894b1b09b10d41922677feb04f77060da30b
MD5 32c16e6b7fe9f24b0d29db9eed64192d
BLAKE2b-256 5f8bb3e5e9a7672bdcb70a83ef5b2f9c26abcffd32f33f2b4a3b4f8ae8593be1

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp314-cp314-macosx_11_0_arm64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: ambers-0.4.2-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ambers-0.4.2-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 5a102c7281359c0edcecd86167ccf116c242ea481fa89b6831f6f40e4e3474cb
MD5 fa1d488d0d73eb4a3cddb325f22621f5
BLAKE2b-256 c088ef26e1f86157589f365cb326c0fcd91071c9a4f46bc89e756e783bf96cb2

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp313-cp313-win_amd64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp313-cp313-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ambers-0.4.2-cp313-cp313-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 0c56c29b4ecbacb0349e11aa00ebf014b5756403e496a1e553598193808275e1
MD5 58c384d3d1302430743f0222b76bf91e
BLAKE2b-256 fefe7f88e7eb6a96240f70acc651dbdeb64125a63653c32c2c139d75a26d2f6e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp313-cp313-manylinux_2_34_x86_64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for ambers-0.4.2-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 51501db70cb33d08900890f38d270b0ce977085be1efb6e4dc1a152291bcbaa0
MD5 da77c73442bb4867af47d6b234017dc3
BLAKE2b-256 7d0be62c75dddb95179e859df23fa1f3c82c41aceccb831e149485a1c9ef9b94

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp313-cp313-macosx_11_0_arm64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: ambers-0.4.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ambers-0.4.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 44e4ee48bbf8e004651ee6bb60cdcf1117d17f1c13b651c1ad45f3caebb9baed
MD5 b0f932c236021175c3f61a3016f3c63d
BLAKE2b-256 8128418d4daabef4a80a6b71826947d41e69c2db9cc20ba905ad1598b2cd8ef3

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp312-cp312-win_amd64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp312-cp312-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for ambers-0.4.2-cp312-cp312-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6492ef4f159194ac5fc5f612922685999ac8ecdb81f0094edf702771d619c4ba
MD5 839a0f3a36fadd33551fe9dee2c9018f
BLAKE2b-256 cbb843f46f689ac39c931acb0b08da41690f38ae3edaedceb3c3fde2a03a70b4

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp312-cp312-manylinux_2_34_x86_64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ambers-0.4.2-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for ambers-0.4.2-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4b6c11c5af7b264d722328eb40246c056f5c40fee08ea80b58ebab3793bf73dd
MD5 093159a9dda878d79c8ebd589ba8cc3b
BLAKE2b-256 9509101e851cc2af18eecfbfa249fe23774c050eeeb82b092c6dc6dbca1210b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for ambers-0.4.2-cp312-cp312-macosx_11_0_arm64.whl:

Publisher: release.yml on albertxli/ambers

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page