Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars plugin for SAS (.sas7bdat), Stata (.dta), and SPSS (.sav/.zsav) files.

The Python package wraps the Rust core in polars_readstat_rs and exposes a Polars-first API. The project includes cross-library parity tests and roundtrip checks to reduce regressions.

The Rust engine is generally faster for many workloads, but performance varies by file shape and options. If you need the legacy C/C++ engine, use version 0.11.1 (see the prior version).

Why use this?

  • In project benchmarks, the new Rust-backed engine is typically faster than pandas/pyreadstat on large SAS/Stata files, especially for subset/filter workloads.
  • It avoids the older C/C++ toolchain complexity and ships as standard Python wheels.
  • API is Polars-first (scan_readstat, read_readstat, write_readstat, write_sas_csv_import).
  • Because scan_readstat returns a Polars LazyFrame, column selection and row limits are pushed into the reader — only the data you actually need is read from disk.

Install

pip install polars-readstat

Core API

1) Lazy scan

import polars as pl
from polars_readstat import scan_readstat

lf = scan_readstat("/path/file.sas7bdat")
#   do something
df = lf.collect()

df = (
    scan_readstat("/path/file.sas7bdat")
    .select(["SERIALNO", "AGEP"])   # column pushdown — only these columns are read
    .head(1_000)                    # row limit is pushed down too
    .filter(pl.col("AGEP") >= 18)   # filters applied to streamed batches to avoid loading full file into memory
    .collect()
)

2) Getting metadata

from polars_readstat import ScanReadstat

reader = ScanReadstat(path="/path/file.sav")
schema = reader.schema           # polars.Schema
metadata = reader.metadata       # dict with file info and per-column details
lf = reader.df                   # LazyFrame — same as calling scan_readstat(path)

metadata is a dict with a variables (SPSS/Stata) or columns (SAS) list. Each entry includes:

  • "name" — column name
  • "label" — variable label (description), if present
  • "value_labels" — dict mapping coded values to label strings, if present

Polars lazy evaluation

scan_readstat returns a LazyFrame, so Polars can push operations into the reader before any data is loaded:

Read only specific columns — column selection is pushed into the reader; unselected columns are never read from disk:

lf = scan_readstat("file.sav")
df = lf.select(["id", "age", "income"]).collect()

Read the first N rowshead() / limit() stops the reader after N rows, so you never load the full file:

df = scan_readstat("file.sas7bdat").head(1000).collect()

Filter rows — filters are applied in Polars after reading, but still benefit from column pushdown if combined with .select():

df = scan_readstat("file.dta").select(["id", "age"]).filter(pl.col("age") >= 18).collect()

The benchmark numbers above reflect these optimizations — the large "Subset: True" speedups come from column pushdown.

3) Write (Experimental)

Writing support is experimental and compatibility varies across tools. Stata roundtrip tests are included; SPSS roundtrip coverage is limited. Please report issues.

from polars_readstat import write_readstat, write_sas_csv_import

write_readstat(df, "/path/out.dta")
write_readstat(df, "/path/out.sav")
write_sas_csv_import(df, "/path/out/sas_bundle", dataset_name="my_data")

write_readstat supports Stata (dta) and SPSS (sav).
Use write_sas_csv_import for SAS-ingestible output (.csv + .sas import script). Binary .sas7bdat writing is not currently supported.

Docs

View the docs at https://jrothbaum.github.io/polars_readstat/ for more information on the options you can pass to the scan and write functions.

Benchmark

Benchmarks compare four scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subset of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Benchmark context:

  • Machine: AMD Ryzen 7 8845HS (16 cores), 14 GiB RAM, Linux Mint 22
  • Storage: external SSD
  • Last run: May 14, 2026 — polars-readstat v0.17.0 vs pandas and pyreadstat
  • Method: wall-clock timings via Python time.time()

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.55
(3.9×)
0.07
(28.4×)
1.46
(2.0×)
0.08
(39.4×)
pandas 2.16 1.99 2.93 3.15
pyreadstat 6.76
(0.3×)
1.64
(1.2×)
7.86
(0.4×)
2.18
(1.4×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.16
(7.3×)
0.10
(11.7×)
0.18
(7.3×)
0.09
(13.8×)
pandas 1.17 1.17 1.31 1.24
pyreadstat 5.48
(0.2×)
4.57
(0.3×)
5.67
(0.2×)
7.69
(0.2×)

SPSS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 1.09
(62.5×)
0.15
(3.9×)
1.10
(62.4×)
0.15
(3.9×)
pandas 68.12 0.59 68.67 0.59
pyreadstat 3.06
(22.3×)
1.15
(0.5×)
7.09
(9.7×)
1.23
(0.5×)

zsav

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 3.97
(5.9×)
1.04
(2.1×)
4.77
(4.7×)
1.15
(2.0×)
pandas 23.47 2.20 22.40 2.29

Detailed benchmark notes and dataset descriptions are in BENCHMARKS.md.

Tests run

Test coverage includes:

  • Cross-library comparisons on the pyreadstat and pandas test data, checking results against polars-readstat==0.11.1, pyreadstat, and pandas.
  • Stata/SPSS read/write roundtrip tests.
  • Large-file read/write benchmark runs on real-world data (results below).

If you want to run the same checks locally, helper scripts and tests are in scripts/ and tests/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.20.1-cp39-abi3-win_amd64.whl (21.1 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.20.1-cp39-abi3-manylinux_2_28_x86_64.whl (19.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.20.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.20.1-cp39-abi3-macosx_11_0_arm64.whl (17.2 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.20.1-cp39-abi3-macosx_10_15_x86_64.whl (18.8 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.20.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.20.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6ceec51ca23ff6ed7a074605db4a9943efda89f5201a3e03a33703fe1d21828d
MD5 00205d8e3982ccf5d98571662cd2aaa3
BLAKE2b-256 a2d74f04e30056f014a339892cb2d005fa2658f6f33cac48e64b4116739dee1a

See more details on using hashes here.

File details

Details for the file polars_readstat-0.20.1-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.20.1-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 314d5274cc0197ab552e70a08618fc5521874ef81b8d6e695c6bd22a413abf65
MD5 8f565bdd49e9e3e34d1795b1d47bf214
BLAKE2b-256 d896878336cfa8e49e14ff50b079c5f7771715aa296cfed689270d8d10a199fa

See more details on using hashes here.

File details

Details for the file polars_readstat-0.20.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.20.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4062310accb954b05ce2110f3fd009323931601daee284e902452ef4ab6fa2c9
MD5 2c9f6c8085156c02b547c245784034ad
BLAKE2b-256 12174a0478f6df65ccf46f36b4e788f66d0d5c18dbb20ffc92d25a80cb805b64

See more details on using hashes here.

File details

Details for the file polars_readstat-0.20.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.20.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 71077f7cf8a554c18fb5e38f0f31152b667233d697eefce9e9f62bcd9a5364e4
MD5 7db5833a5c6f2b77d8de4ed23bbf8b1a
BLAKE2b-256 ab82d824141a561ead4eb76bde12ce15c5b6390f0ba1eceeba049becd9c08f46

See more details on using hashes here.

File details

Details for the file polars_readstat-0.20.1-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.20.1-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 893034bc91fa3935868e6bde555f952fefb83959402b3153eef726de0dc5d5a2
MD5 fa9974230d1ed43d253ad94b7c8e55a5
BLAKE2b-256 70d69a6b865caeafca2b14a96d9979339a61d52834b4bf3937910e6b5201f00b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page