Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars plugin for SAS (.sas7bdat), Stata (.dta), and SPSS (.sav/.zsav) files.

The Python package wraps the Rust core in polars_readstat_rs and exposes a Polars-first API. The project includes cross-library parity tests and roundtrip checks to reduce regressions.

The Rust engine is generally faster for many workloads, but performance varies by file shape and options. If you need the legacy C/C++ engine, use version 0.11.1 (see the prior version).

Why use this?

  • In project benchmarks, the new Rust-backed engine is typically faster than pandas/pyreadstat on large SAS/Stata files, especially for subset/filter workloads.
  • It avoids the older C/C++ toolchain complexity and ships as standard Python wheels.
  • API is Polars-first (scan_readstat, read_readstat, write_readstat, write_sas_csv_import).
  • Because scan_readstat returns a Polars LazyFrame, column selection and row limits are pushed into the reader — only the data you actually need is read from disk.

Install

pip install polars-readstat

Core API

1) Lazy scan

import polars as pl
from polars_readstat import scan_readstat

lf = scan_readstat("/path/file.sas7bdat", preserve_order=True)
df = lf.select(["SERIALNO", "AGEP"]).filter(pl.col("AGEP") >= 18).collect()

2) Getting metadata

from polars_readstat import ScanReadstat

reader = ScanReadstat(path="/path/file.sav")
schema = reader.schema           # polars.Schema
metadata = reader.metadata       # dict with file info and per-column details
lf = reader.df                   # LazyFrame — same as calling scan_readstat(path)

metadata is a dict with a variables (SPSS/Stata) or columns (SAS) list. Each entry includes:

  • "name" — column name
  • "label" — variable label (description), if present
  • "value_labels" — dict mapping coded values to label strings, if present

Polars lazy evaluation

scan_readstat returns a LazyFrame, so Polars can push operations into the reader before any data is loaded:

Read only specific columns — column selection is pushed into the reader; unselected columns are never read from disk:

lf = scan_readstat("file.sav")
df = lf.select(["id", "age", "income"]).collect()

Read the first N rowshead() / limit() stops the reader after N rows, so you never load the full file:

df = scan_readstat("file.sas7bdat").head(1000).collect()

Filter rows — filters are applied in Polars after reading, but still benefit from column pushdown if combined with .select():

df = scan_readstat("file.dta").select(["id", "age"]).filter(pl.col("age") >= 18).collect()

The benchmark numbers above reflect these optimizations — the large "Subset: True" speedups come from column pushdown.

3) Write (Experimental)

Writing support is experimental and compatibility varies across tools. Stata roundtrip tests are included; SPSS roundtrip coverage is limited. Please report issues.

from polars_readstat import write_readstat, write_sas_csv_import

write_readstat(df, "/path/out.dta")
write_readstat(df, "/path/out.sav")
write_sas_csv_import(df, "/path/out/sas_bundle", dataset_name="my_data")

write_readstat supports Stata (dta) and SPSS (sav).
Use write_sas_csv_import for SAS-ingestible output (.csv + .sas import script). Binary .sas7bdat writing is not currently supported.

Docs

View the docs at https://jrothbaum.github.io/polars_readstat/ for more information on the options you can pass to the scan and write functions.

Benchmark

Benchmarks compare four scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subset of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Benchmark context:

  • Machine: AMD Ryzen 7 8845HS (16 cores), 14 GiB RAM, Linux Mint 22
  • Storage: external SSD
  • Last run: May 14, 2026 — polars-readstat v0.17.0 vs pandas and pyreadstat
  • Method: wall-clock timings via Python time.time()

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.55
(3.9×)
0.07
(28.4×)
1.46
(2.0×)
0.08
(39.4×)
pandas 2.16 1.99 2.93 3.15
pyreadstat 6.76
(0.3×)
1.64
(1.2×)
7.86
(0.4×)
2.18
(1.4×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.16
(7.3×)
0.10
(11.7×)
0.18
(7.3×)
0.09
(13.8×)
pandas 1.17 1.17 1.31 1.24
pyreadstat 5.48
(0.2×)
4.57
(0.3×)
5.67
(0.2×)
7.69
(0.2×)

SPSS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 1.09
(62.5×)
0.15
(3.9×)
1.10
(62.4×)
0.15
(3.9×)
pandas 68.12 0.59 68.67 0.59
pyreadstat 3.06
(22.3×)
1.15
(0.5×)
7.09
(9.7×)
1.23
(0.5×)

zsav

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 3.97
(5.9×)
1.04
(2.1×)
4.77
(4.7×)
1.15
(2.0×)
pandas 23.47 2.20 22.40 2.29

Detailed benchmark notes and dataset descriptions are in BENCHMARKS.md.

Tests run

Test coverage includes:

  • Cross-library comparisons on the pyreadstat and pandas test data, checking results against polars-readstat==0.11.1, pyreadstat, and pandas.
  • Stata/SPSS read/write roundtrip tests.
  • Large-file read/write benchmark runs on real-world data (results below).

If you want to run the same checks locally, helper scripts and tests are in scripts/ and tests/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.19.2-cp39-abi3-win_amd64.whl (20.8 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.19.2-cp39-abi3-manylinux_2_28_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.19.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.19.2-cp39-abi3-macosx_11_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.19.2-cp39-abi3-macosx_10_15_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.19.2-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d1aa024505ecf1454c8a9c891cc284a8562fa3f438eb173b8ed3de2674673f9f
MD5 8704c6fcabec6f21a1a28ae28865f4ba
BLAKE2b-256 2a5706036e3bbdea91a1bead52f316b723cdd98ee6918721a5ad1159f50e7544

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.2-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.2-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5de39bf99e4c8c614ec6539398199ae3feb69c8d2262ea6c00e198f16bceb9e7
MD5 a1b8692e4c5ba494737ce5fcca2223e6
BLAKE2b-256 bc15556fc08cd26a40d39df11fecbda2c0fc4beb900348f67f9a689f38127eec

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 63e33be00d8c9e4d3d2fdb7912430d81ea9b556ff5c94172f27884d6028bb6b6
MD5 b45f80ef7094a26586c2d3dc85a5ba48
BLAKE2b-256 76e9302bb20db147378c4cfa8734f6d92433c6163e8f6bbd0cfabfe299747337

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c98c833ce6d13567f068e0e2cd8b855824f3bebbbeddf6ee77e8015dbc1f191e
MD5 6779d35d51099960e107c8a1c02c9464
BLAKE2b-256 0e5bce7f413a81b407a4d3d526457da2df7c36faf8228557a44581b2fd293729

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.2-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.2-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 02effcb41077d4db5293a013c47439f6850660d69b687298c4b8812edfd2233b
MD5 f10c19199d47ccec66f44f2ab8940d85
BLAKE2b-256 7b75d11ddf6fd9368f8b86f474138eecd82fdf1c78f407520d21c9ec78ba72ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page