Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars plugin for SAS (.sas7bdat), Stata (.dta), and SPSS (.sav/.zsav) files.

The Python package wraps the Rust core in polars_readstat_rs and exposes a Polars-first API. The project includes cross-library parity tests and roundtrip checks to reduce regressions.

The Rust engine is generally faster for many workloads, but performance varies by file shape and options. If you need the legacy C/C++ engine, use version 0.11.1 (see the prior version).

Why use this?

  • In project benchmarks, the new Rust-backed engine is typically faster than pandas/pyreadstat on large SAS/Stata files, especially for subset/filter workloads.
  • It avoids the older C/C++ toolchain complexity and ships as standard Python wheels.
  • API is Polars-first (scan_readstat, read_readstat, write_readstat, write_sas_csv_import).
  • Because scan_readstat returns a Polars LazyFrame, column selection and row limits are pushed into the reader — only the data you actually need is read from disk.

Install

pip install polars-readstat

Core API

1) Lazy scan

import polars as pl
from polars_readstat import scan_readstat

lf = scan_readstat("/path/file.sas7bdat", preserve_order=True)
df = lf.select(["SERIALNO", "AGEP"]).filter(pl.col("AGEP") >= 18).collect()

2) Getting metadata

from polars_readstat import ScanReadstat

reader = ScanReadstat(path="/path/file.sav")
schema = reader.schema           # polars.Schema
metadata = reader.metadata       # dict with file info and per-column details
lf = reader.df                   # LazyFrame — same as calling scan_readstat(path)

metadata is a dict with a variables (SPSS/Stata) or columns (SAS) list. Each entry includes:

  • "name" — column name
  • "label" — variable label (description), if present
  • "value_labels" — dict mapping coded values to label strings, if present

Polars lazy evaluation

scan_readstat returns a LazyFrame, so Polars can push operations into the reader before any data is loaded:

Read only specific columns — column selection is pushed into the reader; unselected columns are never read from disk:

lf = scan_readstat("file.sav")
df = lf.select(["id", "age", "income"]).collect()

Read the first N rowshead() / limit() stops the reader after N rows, so you never load the full file:

df = scan_readstat("file.sas7bdat").head(1000).collect()

Filter rows — filters are applied in Polars after reading, but still benefit from column pushdown if combined with .select():

df = scan_readstat("file.dta").select(["id", "age"]).filter(pl.col("age") >= 18).collect()

The benchmark numbers above reflect these optimizations — the large "Subset: True" speedups come from column pushdown.

3) Write (Experimental)

Writing support is experimental and compatibility varies across tools. Stata roundtrip tests are included; SPSS roundtrip coverage is limited. Please report issues.

from polars_readstat import write_readstat, write_sas_csv_import

write_readstat(df, "/path/out.dta")
write_readstat(df, "/path/out.sav")
write_sas_csv_import(df, "/path/out/sas_bundle", dataset_name="my_data")

write_readstat supports Stata (dta) and SPSS (sav).
Use write_sas_csv_import for SAS-ingestible output (.csv + .sas import script). Binary .sas7bdat writing is not currently supported.

Docs

View the docs at https://jrothbaum.github.io/polars_readstat/ for more information on the options you can pass to the scan and write functions.

Benchmark

Benchmarks compare four scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subset of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Benchmark context:

  • Machine: AMD Ryzen 7 8845HS (16 cores), 14 GiB RAM, Linux Mint 22
  • Storage: external SSD
  • Last run: May 14, 2026 — polars-readstat v0.17.0 vs pandas and pyreadstat
  • Method: wall-clock timings via Python time.time()

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.55
(3.9×)
0.07
(28.4×)
1.46
(2.0×)
0.08
(39.4×)
pandas 2.16 1.99 2.93 3.15
pyreadstat 6.76
(0.3×)
1.64
(1.2×)
7.86
(0.4×)
2.18
(1.4×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.16
(7.3×)
0.10
(11.7×)
0.18
(7.3×)
0.09
(13.8×)
pandas 1.17 1.17 1.31 1.24
pyreadstat 5.48
(0.2×)
4.57
(0.3×)
5.67
(0.2×)
7.69
(0.2×)

SPSS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 1.09
(62.5×)
0.15
(3.9×)
1.10
(62.4×)
0.15
(3.9×)
pandas 68.12 0.59 68.67 0.59
pyreadstat 3.06
(22.3×)
1.15
(0.5×)
7.09
(9.7×)
1.23
(0.5×)

zsav

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 3.97
(5.9×)
1.04
(2.1×)
4.77
(4.7×)
1.15
(2.0×)
pandas 23.47 2.20 22.40 2.29

Detailed benchmark notes and dataset descriptions are in BENCHMARKS.md.

Tests run

Test coverage includes:

  • Cross-library comparisons on the pyreadstat and pandas test data, checking results against polars-readstat==0.11.1, pyreadstat, and pandas.
  • Stata/SPSS read/write roundtrip tests.
  • Large-file read/write benchmark runs on real-world data (results below).

If you want to run the same checks locally, helper scripts and tests are in scripts/ and tests/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.19.3-cp39-abi3-win_amd64.whl (20.8 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.19.3-cp39-abi3-manylinux_2_28_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.19.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.19.3-cp39-abi3-macosx_11_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.19.3-cp39-abi3-macosx_10_15_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.19.3-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.3-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a5e9ed09a99dd4c16afc0fe27ce3c7ce6a2f55c9494a658a7841ddfc9cb29acc
MD5 f65964f5c9320fef799b3c6a9dbc0a39
BLAKE2b-256 5c58766f692c4592077f58511acde46ec7dcf6edfbfaff8467d2b9a8fd438fc0

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.3-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.3-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9314c0f279d1b5e6cf52d816b41bf32b91cee464953d08ea0fcdf973c1aafd87
MD5 4f02037c78119f05ec6ca4ae9ec3aaa5
BLAKE2b-256 374b64deacdd2131eafd17627f7856fd5a84c1a3fd0f52ba469ab357849c5b46

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dd3c69e0d5a01b9d9809c479d5d162905e56ea13952085a03a21692f7ea54c8e
MD5 95b84e5e89e920a08dd5077f64b2dd5d
BLAKE2b-256 365c0015e46776b5995b898fdf33653d2afa0bf00bbec778a5559b4338e49748

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.3-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.3-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3f26cb2e42be4e0bb94b9adbbde388d9c379a4222cfae62b0f67e9d61cd6c4bf
MD5 6ab777bdafecbcf942de50e6c525388f
BLAKE2b-256 ad568c35e8e6b96958815291ff90b404179f2b4c2f7c43dfef6595850973f6d1

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.3-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.3-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 2f6bc276a1bd1569d74531cf3793499a0b253c769ca8f8ad14e7374f2cf23647
MD5 ab1d99699848c712a13b38dc76256fd1
BLAKE2b-256 c89c3b295fe56a4144d4ac991fdd818cce03a8c412f41c65896c5fb4f39a326b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page