Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars plugin for SAS (.sas7bdat), Stata (.dta), and SPSS (.sav/.zsav) files.

The Python package wraps the Rust core in polars_readstat_rs and exposes a Polars-first API. The project includes cross-library parity tests and roundtrip checks to reduce regressions.

The Rust engine is generally faster for many workloads, but performance varies by file shape and options. If you need the legacy C/C++ engine, use version 0.11.1 (see the prior version).

Why use this?

  • In project benchmarks, the new Rust-backed engine is typically faster than pandas/pyreadstat on large SAS/Stata files, especially for subset/filter workloads.
  • It avoids the older C/C++ toolchain complexity and ships as standard Python wheels.
  • API is Polars-first (scan_readstat, read_readstat, write_readstat).

Install

pip install polars-readstat

Core API

1) Lazy scan

import polars as pl
from polars_readstat import scan_readstat

lf = scan_readstat("/path/file.sas7bdat", preserve_order=True)
df = lf.select(["SERIALNO", "AGEP"]).filter(pl.col("AGEP") >= 18).collect()

Key parameters:

Parameter Default Description
preserve_order False Return rows in original file order. Set True when order matters; may be slower with multiple threads.
missing_string_as_null False Convert empty strings to null.
value_labels_as_strings False For labeled numeric columns (Stata/SPSS), return the string label instead of the numeric code.
schema_overrides None Dict mapping column names to Polars types (e.g. {"id": pl.Int64}). Useful when the file header reports a narrower type than the data requires.
batch_size 100_000 Number of rows per internal chunk during collect.
informative_nulls None Capture user-defined missing value indicators. See Informative Nulls.

2) Informative Nulls

SAS, Stata, and SPSS files support user-defined missing value codes (SAS .A.Z, Stata .a.z, SPSS discrete/range missings). By default these are read as null. The informative_nulls option captures the missing-value indicator alongside the data value.

from polars_readstat import scan_readstat, read_readstat, InformativeNullOpts

# Track all eligible columns; add a "<col>_null" String indicator column after each
lf = scan_readstat("file.dta", informative_nulls=InformativeNullOpts(columns="all"))
df = read_readstat("file.sas7bdat", informative_nulls={"columns": "all"})

Three output modes (set via the mode parameter):

Mode Description
"separate_column" (default) Adds a parallel String column <col><suffix> after each tracked column
"struct" Wraps each (value, indicator) pair into a Struct column
"merged_string" Merges into a single String column (value as string, or the indicator code)
from polars_readstat import InformativeNullOpts

opts = InformativeNullOpts(
    columns=["income", "age"],      # or "all"
    mode="separate_column",         # "separate_column", "struct", or "merged_string"
    suffix="_missing",              # indicator column suffix (separate_column mode only)
    use_value_labels=True,          # use value label for indicator string when defined
)

informative_nulls accepts either an InformativeNullOpts dataclass or a plain dict. It is supported on scan_readstat, read_readstat, and ScanReadstat.

3) Eager read

from polars_readstat import read_readstat

df = read_readstat("/path/file.dta")

4) Metadata + schema

from polars_readstat import ScanReadstat

reader = ScanReadstat(path="/path/file.sav")
schema = reader.schema      # polars.Schema
metadata = reader.metadata  # dict with file info and per-column details
lf = reader.df              # LazyFrame — same as calling scan_readstat(path)

metadata is a dict with a columns list. Each column entry includes:

  • "name" — column name
  • "label" — variable label (description), if present
  • "value_labels" — dict mapping coded values to label strings, if present

5) Write (Stata/SPSS) - EXPERIMENTAL

Writing support is experimental and compatibility varies across tools. Stata roundtrip tests are included; SPSS roundtrip coverage is limited. Please report issues.

from polars_readstat import write_readstat

write_readstat(df, "/path/out.dta")
write_readstat(df, "/path/out.sav")

# With value labels and variable labels (both formats)
write_readstat(
    df,
    "/path/out.dta",
    value_labels={"sex": {1: "Male", 2: "Female"}},
    variable_labels={"sex": "Sex of respondent", "age": "Age in years"},
)

# Stata-only options
write_readstat(df, "/path/out.dta", compress=True, threads=8)

write_readstat supports Stata (dta) and SPSS (sav). SAS writing is not supported.

Parameter Formats Description
value_labels dta, sav Dict mapping column names to {coded_value: label_string}.
variable_labels dta, sav Dict mapping column names to descriptive label strings.
compress dta only Write compressed Stata file.
threads dta only Number of threads for writing.

Tests run

Test coverage includes:

  • Cross-library comparisons on the pyreadstat and pandas test data, checking results against polars-readstat==0.11.1, pyreadstat, and pandas.
  • Stata/SPSS read/write roundtrip tests.
  • Large-file read/write benchmark runs on real-world data (results below).

If you want to run the same checks locally, helper scripts and tests are in scripts/ and tests/.

Benchmark

Benchmarks compare four scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subset of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Benchmark context:

  • Machine: AMD Ryzen 7 8845HS (16 cores), 14 GiB RAM, Linux Mint 22
  • Storage: external SSD
  • polars-readstat (rust engine v0.12.4) last run: February 24, 2026; comparison library timings for SAS/Stata (v0.11.1) last run August 31, 2025
  • Version tested: polars-readstat 0.12.4 (new Rust engine) against polars-readstat 0.11.1 (prior C++ and C engines) and pandas and pyreadstat
  • Method: wall-clock timings via Python time.time()

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
New rust engine
0.72
(2.9×)
0.04
(51.5×)
1.04
(2.9×)
0.04
(52.5×)
polars_readstat
engine="cpp"
(fastest for 0.11.1)
1.31
(1.6×)
0.09
(22.9×)
1.56
(1.9×)
0.09
(23.2×)
pandas 2.07 2.06 3.03 2.09
pyreadstat 10.75
(0.2×)
0.46
(4.5×)
11.93
(0.3×)
0.50
(4.2×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
New rust engine
0.17
(6.7×)
0.12
(9.8×)
0.24
(4.1×)
0.11
(8.7×)
polars_readstat
engine="readstat"
(the only option for 0.11.1)
1.80
(0.6×)
0.27
(4.4×)
1.31
(0.8×)
0.29
(3.3×)
pandas 1.14 1.18 0.99 0.96
pyreadstat 7.46
(0.2×)
2.18
(0.5×)
7.66
(0.1×)
2.24
(0.4×)

SPSS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
New rust engine
0.22
(6.6×)
0.15
(9.1×)
0.25
(6.0×)
0.26
(4.5×)
pandas 1.46 1.36 1.49 1.16
pyreadstat 9.25
(0.2×)
4.85
(0.3×)
9.39
(0.2×)
4.75
(0.2×)

Detailed benchmark notes and dataset descriptions are in BENCHMARKS.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.13.0-cp39-abi3-win_amd64.whl (20.6 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.13.0-cp39-abi3-manylinux_2_28_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.13.0-cp39-abi3-macosx_11_0_arm64.whl (16.8 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.13.0-cp39-abi3-macosx_10_15_x86_64.whl (18.4 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.13.0-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.13.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5e12b592f5f83a3314be8851b0cf850adde02e2222a82018652b14066ceda7d8
MD5 9152c0c4cc7fac8bd95c07435a2f655d
BLAKE2b-256 dd18ed774897ab891758ee00f471b1112dbb9f4ff4d619f8a956178a18fdd08d

See more details on using hashes here.

File details

Details for the file polars_readstat-0.13.0-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.13.0-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 867b7122c4cdf3f6409e347e03e7143207441b79fc7cf4c07cc0350932c1efc7
MD5 a8d6005e15dafff8509a64f249e94920
BLAKE2b-256 7ef07302f892b86a8fd85544e3562c5309de4d81a7033293ee3e3d7d6353cf20

See more details on using hashes here.

File details

Details for the file polars_readstat-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.13.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 47fbdda80f8c769170e3f0e93850a0723e8dfd749e4b89153e12e088e0b5cec0
MD5 6de8a0b0d06aae161e793efcf63674b1
BLAKE2b-256 6abbe61f0b2f7a2f2fe9f29dc8e2d3c59c432a3be1682e0b97ae54d098284f9f

See more details on using hashes here.

File details

Details for the file polars_readstat-0.13.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.13.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 40e08aec3190b459a53e23bb76c6071b175c0fac171228b445a024d227622ba2
MD5 dbce9ecc83209b5481580e1c85eb5b2b
BLAKE2b-256 7ca968733d5f46b49b0fb121efb15942f9a6535770ba01c93d1e268305833a3a

See more details on using hashes here.

File details

Details for the file polars_readstat-0.13.0-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.13.0-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 226e5becef536748a24c29290f3de5142e6ea125d3748d3dff2d7e57c559f0f0
MD5 34e95e3d993b8c7027a7920e5f4edf06
BLAKE2b-256 6d80e86de751e1b631b1383435d717099cd07cf929a707d1b990e4be22e057c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page