Skip to main content

A Python package for working with SPSS/SAV files with two-track architecture separating data and metadata operations

Project description

โšกultrasav

An 'Ultra-powerful' Python package for preparing production-ready SPSS/SAV files using a two-track architecture that separates data and metadata operations.

๐Ÿ’ก Motivation

ultrasav is built as a thoughtful wrapper around the excellent pyreadstat package. We're not here to reinvent the wheel for reading and writing SAV files - pyreadstat already does that brilliantly!

Instead, ultrasav provides additional transformation tools for tasks that are commonly done by folks who work with SAV files regularly:

  • ๐Ÿท๏ธ Rename variables - Change variable names in batch with clean methodology
  • ๐Ÿ”„ Recode values - Transform codes across multiple variables with clean syntax
  • ๐Ÿท๏ธ Update labels - Batch update variable labels and value labels without losing track
  • ๐Ÿ“Š Reorganize columns - Move variables to specific positions for standardized layouts
  • ๐Ÿ“€ Merge files intelligently - Stack survey data while preserving all metadata
  • ๐ŸŽฏ Handle missing values - Consistent missing value definitions across datasets
  • ๐Ÿฆธ Inspect & report metadata - Generate datamaps and validation reports with metaman

๐ŸŽฏ Core Philosophy

ultrasav follows a simple but powerful principle: Data and Metadata are two independent layers that only come together at read/write time.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   DATA      โ”‚         โ”‚  METADATA   โ”‚
โ”‚  DataFrame  โ”‚         โ”‚   Labels    โ”‚
โ”‚  Operations โ”‚         โ”‚   Formats   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚                         โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
               โ–ผ
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚  WRITE SAV  โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The Common Problems

If you work with SPSS files in Python, you've probably asked yourself:

  • How do I bulk update variable labels and value labels?
  • How do I quickly relocate variables to ideal positions?
  • How do I merge datasets โ€” and more specifically, how are the labels being merged?
  • How can I see a comprehensive datamap of my data?
  • Most importantly: How do I prepare a tidy SPSS file with clean labels and metadata that is production-ready?

ultrasav answers all of these.

The ultrasav Way

import ultrasav as ul

# Read โ†’ splits into two independent tracks
df, meta = ul.read_sav("survey.sav")

# Track 1 - Data: Transform data freely
data = ul.Data(df) # Wrap df into our Data class
df = data.move(first=['id']).rename({'Q1': 'satisfaction'}).replace({'satisfaction': {6: 99}}).to_native()

# Track 2 - Metadata: Update metadata independently (immutable - returns NEW object)
meta = ul.Metadata(meta) # Wrap meta into our Metadata class
meta = meta.update(
    column_labels={'satisfaction': 'Overall satisfaction'},
    variable_value_labels={'recommend': {0: 'No', 1: 'Yes'}}
)

# Convergence: Reunite at write time
ul.write_sav(df, meta, "clean_survey.sav")

The goal is to provide you with a clean and easy-to-understand way to transform your SPSS data that you can use in real production workflows with minimal tweaking.

๐Ÿš€ DataFrame-Agnostic Design

One of ultrasav's superpowers is being dataframe-agnostic โ€” it works seamlessly with both polars and pandas thanks to narwhals under the hood:

  • ๐Ÿปโ€โ„๏ธ Polars by default - Blazing fast performance out of the box
  • ๐Ÿผ Pandas fully supported - Use output_format="pandas" when needed
  • ๐Ÿ”„ Switch freely - Convert between pandas and polars anytime
  • ๐Ÿ”ง Future-proof - Ready for whatever dataframe library comes next

Default output format: Polars โ€” All operations return polars DataFrames by default for blazing-fast performance. Pandas is fully supported via the output_format="pandas" parameter.

import ultrasav as ul

# Polars by default
df_pl, meta = ul.read_sav("survey.sav", output_format="polars")

# Or explicitly request pandas
df_pd, meta = ul.read_sav("survey.sav", output_format="pandas")

# The Data class works with either
data = ul.Data(df_pl)  # Works with both Polars and pandas!

# Transform using ultrasav's consistent API
data = data.rename({"Q1": "satisfaction"}).replace({'satisfaction': {6: 99}})
df_native = data.to_native()  # Get back your polars DataFrame

Who Is This For?

  • ๐Ÿ“Š Market Researchers - Merge waves, standardize labels, prepare deliverables
  • ๐Ÿ”ฌ Data Scientists - Clean survey data, prepare features, maintain metadata
  • ๐Ÿญ Data Engineers - Build robust pipelines that preserve SPSS metadata
  • ๐ŸŽ“ Academic Researchers - Manage longitudinal studies, harmonize datasets
  • ๐Ÿ“ˆ Anyone working with SPSS - If you use SAV files regularly, this is for you!

๐Ÿš€ Installation

# Using uv
uv add ultrasav

# Or using pip
pip install ultrasav

๐Ÿ“š Quick Start

Basic Usage

import ultrasav as ul

# Read SPSS file - automatically splits into data and metadata
df, meta = ul.read_sav("survey.sav")
# Note: You can also use pyreadstat directly - our classes work with pyreadstat meta objects too

# Track 1: Process data independently
data = ul.Data(df)  # Wrap in Data class for transformations
data = data.move(first=["ID", "Date"])  # Reorder columns
data = data.rename({"Q1": "Satisfaction"})  # Rename columns
data = data.replace({"Satisfaction": {99: None}})  # Replace values
df = data.to_native()  # Back to native DataFrame

# Track 2: Process metadata independently (immutable updates)
meta = ul.Metadata(meta)
meta = meta.update(
    column_labels={"Satisfaction": "Customer Satisfaction Score"},
    variable_value_labels={
        "Satisfaction": {1: "Very Dissatisfied", 5: "Very Satisfied"}
    },
    variable_measure={
        'Satisfaction': 'ordinal',
        'Gender': 'nominal',
        'Age': 'scale',
    }
)

# Convergence: Write both tracks to SPSS
ul.write_sav(df, meta, "cleaned_survey.sav")

Merging Files

import ultrasav as ul

# Merge multiple files vertically with automatic metadata handling
df, meta = ul.add_cases([
    "wave1.sav",
    "wave2.sav", 
    "wave3.sav"
])

# Metadata is automatically preserved from top to bottom.
# A source-tracking column is automatically added to show each row's origin.
# Example: mrgsrc: ["wave1.sav", "wave2.sav", "wave3.sav"]

ul.write_sav(df, meta, "merged_output.sav")

Advanced Merging

import ultrasav as ul

# Use specific metadata template for all files
standard_meta = ul.Metadata()  # Create an empty meta object
standard_meta = standard_meta.update(
    column_labels={"Q1": "Satisfaction", "Q2": "Loyalty"},
    variable_value_labels={
        "Satisfaction": {1: "Very Dissatisfied", 5: "Very Satisfied"}
    }
)

data, meta = ul.add_cases(
    inputs=["file1.sav", "file2.sav", "file3.csv"],
    meta=standard_meta,  # Single metadata - no list wrapper needed
    source_col="mrgsrc",  # Auto append column 'mrgsrc' to track source files
    output_format="polars"  # Explicit format (polars is default)
)

# For multiple metadata objects, use a list
data, meta = ul.add_cases(
    inputs=["survey_v1.sav", "survey_v2.sav"],
    meta=[meta_v1, meta_v2],  # Merge these metadata objects
    meta_strategy="first"  # First metadata wins for conflicts
)

Writing Back

# Read SPSS file
df, meta = ul.read_sav("huge_survey.sav")

# All ultrasav operations work the same
df = ul.Data(df).rename({"Q1": "satisfaction"}).drop(["unused_var"]).to_native()

# Efficient write-back
# Simply provide the 'meta' object; labels and formats are applied automatically.
# Compatible with both ultrasav and pyreadstat meta objects.
ul.write_sav(df, meta, "processed_data.sav")

# For compressed output, use .zsav extension with compress=True
meta = ul.Metadata(meta).update(compress=True)
ul.write_sav(df, meta, "compressed_data.zsav")

๐Ÿฆธ Metaman: The Metadata Submodule

ultrasav includes metaman, a powerful submodule for metadata inspection, extraction, and reporting. All metaman functions are accessible directly from the top-level ul namespace.

Generate Validation Datamaps

Create comprehensive datamaps showing variable types, value distributions, and data quality metrics:

import ultrasav as ul
import polars as pl

df, meta = ul.read_sav("survey.sav")

# Create a validation datamap (with metadata)
datamap = ul.make_datamap(df, meta)

# Or create datamap from DataFrame only (no metadata required)
df_csv = pl.read_csv("survey.csv")
datamap = ul.make_datamap(df_csv)  # Works without meta!

# Export to beautifully formatted Excel
# This function supports polars only at the moment
ul.map_to_excel(datamap, "validation_report.xlsx")

# Use custom color schemes
ul.map_to_excel(
    datamap, 
    "validation_report.xlsx",
    alternating_group_formats=ul.get_color_scheme("pastel_blue")
)

The datamap includes:

  • Variable names and labels
  • Variable types (categorical, numeric, text, date)
  • Value codes and labels
  • Value counts and percentages
  • Missing data flags
  • Missing value label detection

With include_all=True, also includes:

  • Variable measure (scale, nominal, ordinal from SPSS metadata)
  • Variable format (SPSS format string, e.g., "F8.2", "A50")
  • Readstat type (low-level storage type, e.g., "double", "string")

Note: Variable types are detected using a two-phase approach: first from DataFrame dtypes, then refined with metadata when available. In the final datamap output, single-select and multi-select are consolidated into "categorical" for simplicity.

Extract Metadata to Python Files

Save existing metadata (if any) from a sav file as importable Python dictionaries for reuse across projects:

import ultrasav as ul

df, meta = ul.read_sav("survey.sav")

# Extract metadata (labels) to in-memory python object
meta_dict = ul.get_meta(meta)

# Extract and save ALL metadata to a Python file
meta_dict = ul.get_meta(meta, include_all=True, output_path="survey_labels.py")

Create Labels from Excel Templates

Build label dictionaries from scratch using Excel templates - perfect for translating surveys or standardizing labels:

import ultrasav as ul

# Excel file with 'col_label' and 'value_label' sheets
col_labels, val_labels = ul.make_labels(
    input_path="label_template.xlsx",
    output_path="translated_labels.py"  # optional
)

Excel Structure:

Your Excel file should have two sheets:

  1. Column Labels Sheet (default sheet name: "col_label"):

    variable label
    age Age of respondent
    gender Gender
    income Annual household income
  2. Value Labels Sheet (default sheet name: "value_label"):

    variable value label
    gender 1 Male
    gender 2 Female
    income 1 Under $25k
    income 2 $25k-50k

๐Ÿ“– API Reference

Core Functions

read_sav(filepath, output_format="polars")

Read SPSS file and return separated data and metadata. This is a wrapper around pyreadstat.read_sav with some additional encoding handling

df, meta = ul.read_sav("survey.sav")

write_sav(data, meta, filepath, **overrides)

Write data and metadata to SPSS file.

ul.write_sav(df, meta, "processed_data.sav")

# With compression (must use .zsav extension)
meta_compressed = ul.Metadata(meta).update(compress=True)
ul.write_sav(df, meta_compressed, "compressed_data.zsav")

Compression Validation: When compress=True in metadata, the destination file must have a .zsav extension. A ValueError is raised if you attempt to write a compressed file with a .sav extension.

# This will raise ValueError
meta = ul.Metadata().update(compress=True)
ul.write_sav(df, meta, "output.sav")  # โŒ Wrong extension!
# ValueError: Metadata has compress=True but destination file 'output.sav' 
# has extension '.sav'. Compressed SPSS files must use the '.zsav' extension.

# Correct usage
ul.write_sav(df, meta, "output.zsav")  # โœ… Correct

add_cases(inputs, meta=None, output_format="polars", source_col="mrgsrc", meta_strategy="first")

Merge multiple files/dataframes vertically with metadata handling. Returns merged data and metadata.

Parameters:

  • inputs: List of file paths, DataFrames, or (DataFrame, Metadata) tuples
  • meta: Single metadata or list of metadata objects. When provided, ignores SAV file metadata.
  • output_format: Output format - "polars" (default), "pandas", or "narwhals"
  • source_col: Name of provenance column (default: "mrgsrc")
  • meta_strategy: "first" (default) or "last" - determines which metadata wins for conflicts
# Basic usage - metadata auto-extracted from SAV files
df_merged, meta_merged = ul.add_cases(["wave1.sav", "wave2.sav", "wave3.sav"])

# With single metadata (no list wrapper needed)
df_merged, meta_merged = ul.add_cases(files, supermeta)

# With multiple metadata objects
df_merged, meta_merged = ul.add_cases(files, [meta1, meta2], meta_strategy="last")

Classes

Data

Handles all dataframe operations while maintaining compatibility with both Polars and pandas.

import ultrasav as ul

df, meta = ul.read_sav("survey.sav")  # Returns a Polars DataFrame and meta object

# Convert polars or pandas df into our ul.Data() class
data = ul.Data(df)

# Data Class Methods
# move - to relocate columns
data = data.move(
    first=['respondent_id'],
    last=['timestamp'],
    before={'age': 'gender'},  # place 'age' column before 'gender'
    after={'wave': ['age', 'gender', 'income']}  # place demographic columns after 'wave'
)

# rename - to rename columns
data = data.rename({"old": "new"})

# replace - to replace/recode values
data = data.replace({"col": {1: 100}})

# select - to select columns
data = data.select(['age', 'gender'])

# drop - to drop columns
data = data.drop(['id', 'language'])

# to_native - to return ul.Data(df) back to its native dataframe
df = data.to_native()  # Get back Polars/pandas DataFrame

# Optionally, use chaining for cleaner code
df = (
    ul.Data(df)
    .move(first=['respondent_id'])
    .rename({"old": "new"})
    .replace({"col": {1: 100}})
    .select(['age', 'gender'])
    .drop(['id', 'language'])
    .to_native() 
)

Metadata

Manages all SPSS metadata independently from data. Uses immutable updates - all update operations return NEW Metadata objects, nothing is modified in place.

import ultrasav as ul

df, meta = ul.read_sav("survey.sav")

meta = ul.Metadata(meta)

# Use .update() to update metadata (returns NEW object)
meta = meta.update(
    column_labels={"Q1": "Question 1"},
    variable_value_labels={"Q1": {1: "Yes", 0: "No"}},
    variable_measure={"age": "scale"},
    variable_format={"age": "F3.0", "city_name": "A50"},
    variable_display_width={"city_name": 50},
    missing_ranges={"Q1": [99], "Q2": [{"lo": 998, "hi": 999}]},
    note="Created on 2025-02-15",
    file_label="My Survey 2025",
    compress=False,  # Set to True for .zsav output
    row_compress=False
)

# Or use convenience with_*() methods for single updates
meta = meta.with_column_labels({"Q2": "Question 2"})
meta = meta.with_file_label("Updated Survey 2025")
meta = meta.with_compress(True)  # For .zsav output

# Chain multiple updates
meta = (meta
    .with_column_labels({"Q1": "Question 1"})
    .with_variable_measure({"Q1": "nominal"})
    .with_file_label("My Survey 2025")
)

# Access metadata properties (read-only)
print(meta.column_labels)          # {'Q1': 'Question 1', ...}
print(meta.variable_value_labels)  # {'Q1': {1: 'Yes', 0: 'No'}, ...}
print(meta.compress)               # True/False

Immutable Design:

  • Original metadata is preserved and never destroyed
  • All update() and with_*() methods return NEW Metadata objects
  • The original object remains unchanged
meta1 = ul.Metadata(meta)
meta2 = meta1.update(column_labels={"Q1": "New Label"})
# meta1 is UNCHANGED, meta2 has the update

Metadata Updating Logic:

  • User updates overlay on top of originals
  • When you update column_labels={"Q1": "New Label"}:
    • This updates Q1's column label if there is an existing column label
    • If Q1 is not in the original metadata, Q1's new label will be appended
    • All other column labels remain unchanged

Note on variable_value_labels Update Behavior:

When updating variable_value_labels, the entire value-label dictionary for a variable is replaced, not merged.

# Original metadata
meta = ul.Metadata({"variable_value_labels": {"Q1": {1: "Yes", 2: "No", 99: "Unsure"}}})

# User update
meta = meta.update(variable_value_labels={"Q1": {1: "Yes", 0: "No"}})

# Result for Q1 becomes:
{"Q1": {1: "Yes", 0: "No"}}  # Previous values 2 and 99 are NOT preserved

This means:

  • Only the value-label pairs explicitly provided in the update are kept
  • The entire dictionary for that variable is replaced at once
  • Variable-level entries are preserved (e.g., "Q1" still exists), but value-level merging does not occur

This follows ultrasav's design principle: metadata updates overlay at the variable level โ€” never partially merged โ€” ensuring clean and intentional metadata after each update.

Critical Design Choice:

  • When you rename an existing column "Q1" to "Q1a" in data, the associated metadata does not automatically carry over
  • You must explicitly provide new metadata for the newly renamed column "Q1a"
  • No automatic tracking or mapping between old and new names

๐Ÿฆธ Metaman Functions

make_datamap(df, meta=None, output_format=None, include_all=False)

Create a validation datamap from data and optional metadata.

# With metadata (full labels and type detection)
datamap = ul.make_datamap(df, meta)

# Without metadata (df-only mode - dtype-based detection)
datamap = ul.make_datamap(df)

# Include all SPSS debug columns (variable_measure, variable_format, readstat_type)
datamap = ul.make_datamap(df, meta, include_all=True)

map_to_excel(df, file_path, **kwargs)

Export datamap to formatted Excel with merged cells and alternating colors.

ul.map_to_excel(datamap, "report.xlsx") # Saves datamap to Excel
ul.map_to_excel(datamap, "report.xlsx", alternating_group_formats=ul.get_color_scheme("pastel_blue"))

get_meta(meta, output_path=None, include_all=False)

Extract metadata to a Python file or dictionary.

meta_dict = ul.get_meta(meta)  # Returns meta_dict in memory
ul.get_meta(meta, output_path="labels.py")  # Saves to file

make_labels(input_path, output_path=None)

Create label dictionaries from an Excel template.

col_labels, val_labels = ul.make_labels("template.xlsx") # Returns label dicts in memory
col_labels, val_labels = ul.make_labels("template.xlsx", "labels.py") # Saves to file

detect_variable_type(df, var_name, meta=None)

Detect variable type (single-select, multi-select, categorical, numeric, text, date).

# With metadata (full detection)
var_type = ul.detect_variable_type(df, "Q1", meta)

# Without metadata (dtype-based detection)
var_type = ul.detect_variable_type(df, "Q1")

get_color_scheme(name)

Get a color scheme for Excel formatting.

scheme = ul.get_color_scheme("pastel_blue")
# Options: "classic_grey", "pastel_green", "pastel_blue", "pastel_purple", "pastel_indigo"

describe(df, meta, columns)

Quickly view variable summary including variable metadata and value distributions:

# Single variable
ul.describe(df, meta, "Q1")

# Multiple variables
ul.describe(df, meta, ["Q1", "Q2", "Q3"])

# Get summary dict without printing
summary = ul.describe(df, meta, "Q1", print_output=False)

โšก Why "ultrasav"?

The name combines "Ultra" (super-powered) with "SAV" (SPSS file format), representing the ultra-powerful transformation capabilities of this package. Just like Ultraman's Specium Ray, ultrasav splits and recombines data with precision and power!

And metaman? He's the metadata superhero who swoops in to inspect, validate, and report on your SPSS data! ๐Ÿฆธ

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ultrasav-0.2.10.tar.gz (78.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ultrasav-0.2.10-py3-none-any.whl (89.2 kB view details)

Uploaded Python 3

File details

Details for the file ultrasav-0.2.10.tar.gz.

File metadata

  • Download URL: ultrasav-0.2.10.tar.gz
  • Upload date:
  • Size: 78.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ultrasav-0.2.10.tar.gz
Algorithm Hash digest
SHA256 90fcfcdbbfcdec6d7b07f0a5d7a22358d65d75f5e03f77a0589b09609307b970
MD5 c6f070992ae044a0a6515d6bd58fd6c0
BLAKE2b-256 36e393b92ddd1e940fc0109bbfc099447a642e988efc57d122271ae89f513f76

See more details on using hashes here.

File details

Details for the file ultrasav-0.2.10-py3-none-any.whl.

File metadata

  • Download URL: ultrasav-0.2.10-py3-none-any.whl
  • Upload date:
  • Size: 89.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.15 {"installer":{"name":"uv","version":"0.9.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ultrasav-0.2.10-py3-none-any.whl
Algorithm Hash digest
SHA256 1c0a93a2ba12f82dadecce3ba7abdfe14d7b72f9d6e03ee5003339000cc5a47f
MD5 a9f9fc5c0372e06e275b630d81a85d0b
BLAKE2b-256 599e7cf13992db36a4149339b5d4a310f5b53186f14147ed7ac5f59a44ac9e46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page