Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars plugin for SAS (.sas7bdat), Stata (.dta), and SPSS (.sav/.zsav) files.

The Python package wraps the Rust core in polars_readstat_rs and exposes a Polars-first API. The project includes cross-library parity tests and roundtrip checks to reduce regressions.

The Rust engine is generally faster for many workloads, but performance varies by file shape and options. If you need the legacy C/C++ engine, use version 0.11.1 (see the prior version).

Why use this?

  • In project benchmarks, the new Rust-backed engine is typically faster than pandas/pyreadstat on large SAS/Stata files, especially for subset/filter workloads.
  • It avoids the older C/C++ toolchain complexity and ships as standard Python wheels.
  • API is Polars-first (scan_readstat, read_readstat, write_readstat, write_sas_csv_import).
  • Because scan_readstat returns a Polars LazyFrame, column selection and row limits are pushed into the reader — only the data you actually need is read from disk.

Install

pip install polars-readstat

Core API

1) Lazy scan

import polars as pl
from polars_readstat import scan_readstat

lf = scan_readstat("/path/file.sas7bdat", preserve_order=True)
df = lf.select(["SERIALNO", "AGEP"]).filter(pl.col("AGEP") >= 18).collect()

2) Getting metadata

from polars_readstat import ScanReadstat

reader = ScanReadstat(path="/path/file.sav")
schema = reader.schema           # polars.Schema
metadata = reader.metadata       # dict with file info and per-column details
lf = reader.df                   # LazyFrame — same as calling scan_readstat(path)

metadata is a dict with a variables (SPSS/Stata) or columns (SAS) list. Each entry includes:

  • "name" — column name
  • "label" — variable label (description), if present
  • "value_labels" — dict mapping coded values to label strings, if present

Polars lazy evaluation

scan_readstat returns a LazyFrame, so Polars can push operations into the reader before any data is loaded:

Read only specific columns — column selection is pushed into the reader; unselected columns are never read from disk:

lf = scan_readstat("file.sav")
df = lf.select(["id", "age", "income"]).collect()

Read the first N rowshead() / limit() stops the reader after N rows, so you never load the full file:

df = scan_readstat("file.sas7bdat").head(1000).collect()

Filter rows — filters are applied in Polars after reading, but still benefit from column pushdown if combined with .select():

df = scan_readstat("file.dta").select(["id", "age"]).filter(pl.col("age") >= 18).collect()

The benchmark numbers above reflect these optimizations — the large "Subset: True" speedups come from column pushdown.

3) Write (Experimental)

Writing support is experimental and compatibility varies across tools. Stata roundtrip tests are included; SPSS roundtrip coverage is limited. Please report issues.

from polars_readstat import write_readstat, write_sas_csv_import

write_readstat(df, "/path/out.dta")
write_readstat(df, "/path/out.sav")
write_sas_csv_import(df, "/path/out/sas_bundle", dataset_name="my_data")

write_readstat supports Stata (dta) and SPSS (sav).
Use write_sas_csv_import for SAS-ingestible output (.csv + .sas import script). Binary .sas7bdat writing is not currently supported.

Docs

View the docs at https://jrothbaum.github.io/polars_readstat/ for more information on the options you can pass to the scan and write functions.

Benchmark

Benchmarks compare four scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subset of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Benchmark context:

  • Machine: AMD Ryzen 7 8845HS (16 cores), 14 GiB RAM, Linux Mint 22
  • Storage: external SSD
  • Last run: May 14, 2026 — polars-readstat v0.17.0 vs pandas and pyreadstat
  • Method: wall-clock timings via Python time.time()

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.55
(3.9×)
0.07
(28.4×)
1.46
(2.0×)
0.08
(39.4×)
pandas 2.16 1.99 2.93 3.15
pyreadstat 6.76
(0.3×)
1.64
(1.2×)
7.86
(0.4×)
2.18
(1.4×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.16
(7.3×)
0.10
(11.7×)
0.18
(7.3×)
0.09
(13.8×)
pandas 1.17 1.17 1.31 1.24
pyreadstat 5.48
(0.2×)
4.57
(0.3×)
5.67
(0.2×)
7.69
(0.2×)

SPSS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 1.09
(62.5×)
0.15
(3.9×)
1.10
(62.4×)
0.15
(3.9×)
pandas 68.12 0.59 68.67 0.59
pyreadstat 3.06
(22.3×)
1.15
(0.5×)
7.09
(9.7×)
1.23
(0.5×)

zsav

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 3.97
(5.9×)
1.04
(2.1×)
4.77
(4.7×)
1.15
(2.0×)
pandas 23.47 2.20 22.40 2.29

Detailed benchmark notes and dataset descriptions are in BENCHMARKS.md.

Tests run

Test coverage includes:

  • Cross-library comparisons on the pyreadstat and pandas test data, checking results against polars-readstat==0.11.1, pyreadstat, and pandas.
  • Stata/SPSS read/write roundtrip tests.
  • Large-file read/write benchmark runs on real-world data (results below).

If you want to run the same checks locally, helper scripts and tests are in scripts/ and tests/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.19.1-cp39-abi3-win_amd64.whl (20.8 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.19.1-cp39-abi3-manylinux_2_28_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.19.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.19.1-cp39-abi3-macosx_11_0_arm64.whl (17.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.19.1-cp39-abi3-macosx_10_15_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.19.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 26ad7ecb3dc0e095625e2b0a618457d2c2aa0aabd4466a090ed4deb949ac6719
MD5 83c5b06b9b9677911c3b6765e579201c
BLAKE2b-256 b97862cda8e7e766c54be948694de33aafb625678c10f33fadb3587dc53f4f83

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.1-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.1-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fb2974d1ade45c4a6f4e6af38f8cfa13a29b6aa05e046237e9eeda5f7a29e38a
MD5 13294252704f40a4376cc699744552be
BLAKE2b-256 6ef76361f40c466834ba6ba6fd525754f027cb3d974c5813379edd1356c88e03

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cc72a82428a5f69d2fac4d26e16d570ec7516854ca9c4bb9d2e8f5c4d6a8d814
MD5 0826e115b0f781d403ca97fd8592c82a
BLAKE2b-256 d0f8b7e66eac41f5c986330cb4b1a305d18e746b51d7fdbbc7f5714133bd37c6

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d8c71b5f31b655efe8dda6c8d4e6c028135849d69105a0cd53ed645c9996ca1c
MD5 7cae1c43168f187156c40a5f313800f2
BLAKE2b-256 8479d9bb6bd9e6d16eda388b401e5dd6d188c6a766c128e7a1af0ce521cbbad4

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.1-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.1-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 12234980d728b2d5ce4b192e11e52891b34e9cb0342a5c9fe2b97936dd7b40fd
MD5 e769b9c6c24a8da01367f098c8d14750
BLAKE2b-256 4541a3b413655d0c9d004928aa83f4dc7f9e91106a332ae4643e7a961a3388a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page