Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars plugin for SAS (.sas7bdat), Stata (.dta), and SPSS (.sav/.zsav) files.

The Python package wraps the Rust core in polars_readstat_rs and exposes a Polars-first API. The project includes cross-library parity tests and roundtrip checks to reduce regressions.

The Rust engine is generally faster for many workloads, but performance varies by file shape and options. If you need the legacy C/C++ engine, use version 0.11.1 (see the prior version).

Why use this?

  • In project benchmarks, the new Rust-backed engine is typically faster than pandas/pyreadstat on large SAS/Stata files, especially for subset/filter workloads.
  • It avoids the older C/C++ toolchain complexity and ships as standard Python wheels.
  • API is Polars-first (scan_readstat, read_readstat, write_readstat).

Install

pip install polars-readstat

Core API

1) Lazy scan

import polars as pl
from polars_readstat import scan_readstat

lf = scan_readstat("/path/file.sas7bdat", preserve_order=True)
df = lf.select(["SERIALNO", "AGEP"]).filter(pl.col("AGEP") >= 18).collect()

Key parameters:

Parameter Default Description
preserve_order False Return rows in original file order. Set True when order matters; may be slower with multiple threads.
missing_string_as_null False Convert empty strings to null.
value_labels_as_strings False For labeled numeric columns (Stata/SPSS), return the string label instead of the numeric code.
schema_overrides None Dict mapping column names to Polars types (e.g. {"id": pl.Int64}). Useful when the file header reports a narrower type than the data requires.
batch_size 100_000 Number of rows per internal chunk during collect.

2) Eager read

from polars_readstat import read_readstat

df = read_readstat("/path/file.dta")

3) Metadata + schema

from polars_readstat import ScanReadstat

reader = ScanReadstat(path="/path/file.sav")
schema = reader.schema      # polars.Schema
metadata = reader.metadata  # dict with file info and per-column details
lf = reader.df              # LazyFrame — same as calling scan_readstat(path)

metadata is a dict with a columns list. Each column entry includes:

  • "name" — column name
  • "label" — variable label (description), if present
  • "value_labels" — dict mapping coded values to label strings, if present

4) Write (Stata/SPSS) - EXPERIMENTAL

Writing support is experimental and compatibility varies across tools. Stata roundtrip tests are included; SPSS roundtrip coverage is limited. Please report issues.

from polars_readstat import write_readstat

write_readstat(df, "/path/out.dta")
write_readstat(df, "/path/out.sav")

# With value labels and variable labels (both formats)
write_readstat(
    df,
    "/path/out.dta",
    value_labels={"sex": {1: "Male", 2: "Female"}},
    variable_labels={"sex": "Sex of respondent", "age": "Age in years"},
)

# Stata-only options
write_readstat(df, "/path/out.dta", compress=True, threads=8)

write_readstat supports Stata (dta) and SPSS (sav). SAS writing is not supported.

Parameter Formats Description
value_labels dta, sav Dict mapping column names to {coded_value: label_string}.
variable_labels dta, sav Dict mapping column names to descriptive label strings.
compress dta only Write compressed Stata file.
threads dta only Number of threads for writing.

Tests run

Test coverage includes:

  • Cross-library comparisons on the pyreadstat and pandas test data, checking results against polars-readstat==0.11.1, pyreadstat, and pandas.
  • Stata/SPSS read/write roundtrip tests.
  • Large-file read/write benchmark runs on real-world data (results below).

If you want to run the same checks locally, helper scripts and tests are in scripts/ and tests/.

Benchmark

Benchmarks compare four scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subset of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Benchmark context:

  • Machine: AMD Ryzen 7 8845HS (16 cores), 14 GiB RAM, Linux Mint 22
  • Storage: external SSD
  • polars-readstat (rust engine v0.12.4) last run: February 24, 2026; comparison library timings for SAS/Stata (v0.11.1) last run August 31, 2025
  • Version tested: polars-readstat 0.12.4 (new Rust engine) against polars-readstat 0.11.1 (prior C++ and C engines) and pandas and pyreadstat
  • Method: wall-clock timings via Python time.time()

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
New rust engine
0.72
(2.9×)
0.04
(51.5×)
1.04
(2.9×)
0.04
(52.5×)
polars_readstat
engine="cpp"
(fastest for 0.11.1)
1.31
(1.6×)
0.09
(22.9×)
1.56
(1.9×)
0.09
(23.2×)
pandas 2.07 2.06 3.03 2.09
pyreadstat 10.75
(0.2×)
0.46
(4.5×)
11.93
(0.3×)
0.50
(4.2×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
New rust engine
0.17
(6.7×)
0.12
(9.8×)
0.24
(4.1×)
0.11
(8.7×)
polars_readstat
engine="readstat"
(the only option for 0.11.1)
1.80
(0.6×)
0.27
(4.4×)
1.31
(0.8×)
0.29
(3.3×)
pandas 1.14 1.18 0.99 0.96
pyreadstat 7.46
(0.2×)
2.18
(0.5×)
7.66
(0.1×)
2.24
(0.4×)

SPSS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
New rust engine
0.22
(6.6×)
0.15
(9.1×)
0.25
(6.0×)
0.26
(4.5×)
pandas 1.46 1.36 1.49 1.16
pyreadstat 9.25
(0.2×)
4.85
(0.3×)
9.39
(0.2×)
4.75
(0.2×)

Detailed benchmark notes and dataset descriptions are in BENCHMARKS.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.12.4-cp39-abi3-win_amd64.whl (20.6 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.12.4-cp39-abi3-manylinux_2_28_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.12.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.12.4-cp39-abi3-macosx_11_0_arm64.whl (16.8 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.12.4-cp39-abi3-macosx_10_15_x86_64.whl (18.4 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.12.4-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.4-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 102392847f38e0887dcf9a6e8328056e8f858194b599dd3d9924cb3d5d1740a0
MD5 929fba4931783bacf5e483ff9f041b3b
BLAKE2b-256 2619b9e5cf77edfca11ff0f974b150baff0da5bd468c960ba83f947c066d5538

See more details on using hashes here.

File details

Details for the file polars_readstat-0.12.4-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.4-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 65a9d47dd75b0ba92c7c94130e1d898d01f983ac09b8d2e4a2778d999d808eab
MD5 a4c485495227728e92472716e18239e9
BLAKE2b-256 bef919154781ae0dd9e435e8528375cac7556f1ac5ede0f197951e29ff66f45e

See more details on using hashes here.

File details

Details for the file polars_readstat-0.12.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 06a3b1d57904234896b085b731d3f9f3d093af0df6e759e110b0f4520378fdec
MD5 617e714d7628a25e739f98839d40b3fe
BLAKE2b-256 af91371f74d8a1e2cd2747675074b200426a4e526ff16c85e8f404a0b722457a

See more details on using hashes here.

File details

Details for the file polars_readstat-0.12.4-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.4-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f6b53256f01383fee1e9f3e580364aab32b4ee9d5e7a318ef691373f18dec2f6
MD5 db5c0973ec034c8875f2cf70f6f70caf
BLAKE2b-256 6f75ae2b6993f0a6e7010e3c2a6c6a9c1ecd449b2562477c014346318c2d7303

See more details on using hashes here.

File details

Details for the file polars_readstat-0.12.4-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.4-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 72ce2f7768e840ac2ab3e58d01358cbba7579be9286acaa7adec0fb1d6705345
MD5 f864b5733f7ad97b03da904afed82880
BLAKE2b-256 dd922bdb8f14f9ee635e563493f4ca37600a9dec28dd18fc57a5c905881d1556

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page