Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars plugin for SAS (.sas7bdat), Stata (.dta), and SPSS (.sav/.zsav) files.

The Python package wraps the Rust core in polars_readstat_rs and exposes a Polars-first API. The project includes cross-library parity tests and roundtrip checks to reduce regressions.

The Rust engine is generally faster for many workloads, but performance varies by file shape and options. If you need the legacy C/C++ engine, use version 0.11.1 (see the prior version).

Why use this?

  • In project benchmarks, the new Rust-backed engine is typically faster than pandas/pyreadstat on large SAS/Stata files, especially for subset/filter workloads.
  • It avoids the older C/C++ toolchain complexity and ships as standard Python wheels.
  • API is Polars-first (scan_readstat, read_readstat, write_readstat, write_sas_csv_import).
  • Because scan_readstat returns a Polars LazyFrame, column selection and row limits are pushed into the reader — only the data you actually need is read from disk.

Install

pip install polars-readstat

Core API

1) Lazy scan

import polars as pl
from polars_readstat import scan_readstat

lf = scan_readstat("/path/file.sas7bdat", preserve_order=True)
df = lf.select(["SERIALNO", "AGEP"]).filter(pl.col("AGEP") >= 18).collect()

2) Getting metadata

from polars_readstat import ScanReadstat

reader = ScanReadstat(path="/path/file.sav")
schema = reader.schema           # polars.Schema
metadata = reader.metadata       # dict with file info and per-column details
lf = reader.df                   # LazyFrame — same as calling scan_readstat(path)

metadata is a dict with a variables (SPSS/Stata) or columns (SAS) list. Each entry includes:

  • "name" — column name
  • "label" — variable label (description), if present
  • "value_labels" — dict mapping coded values to label strings, if present

Polars lazy evaluation

scan_readstat returns a LazyFrame, so Polars can push operations into the reader before any data is loaded:

Read only specific columns — column selection is pushed into the reader; unselected columns are never read from disk:

lf = scan_readstat("file.sav")
df = lf.select(["id", "age", "income"]).collect()

Read the first N rowshead() / limit() stops the reader after N rows, so you never load the full file:

df = scan_readstat("file.sas7bdat").head(1000).collect()

Filter rows — filters are applied in Polars after reading, but still benefit from column pushdown if combined with .select():

df = scan_readstat("file.dta").select(["id", "age"]).filter(pl.col("age") >= 18).collect()

The benchmark numbers above reflect these optimizations — the large "Subset: True" speedups come from column pushdown.

3) Write (Experimental)

Writing support is experimental and compatibility varies across tools. Stata roundtrip tests are included; SPSS roundtrip coverage is limited. Please report issues.

from polars_readstat import write_readstat, write_sas_csv_import

write_readstat(df, "/path/out.dta")
write_readstat(df, "/path/out.sav")
write_sas_csv_import(df, "/path/out/sas_bundle", dataset_name="my_data")

write_readstat supports Stata (dta) and SPSS (sav).
Use write_sas_csv_import for SAS-ingestible output (.csv + .sas import script). Binary .sas7bdat writing is not currently supported.

Docs

View the docs at https://jrothbaum.github.io/polars_readstat/ for more information on the options you can pass to the scan and write functions.

Benchmark

Benchmarks compare four scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subset of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Benchmark context:

  • Machine: AMD Ryzen 7 8845HS (16 cores), 14 GiB RAM, Linux Mint 22
  • Storage: external SSD
  • Last run: May 14, 2026 — polars-readstat v0.17.0 vs pandas and pyreadstat
  • Method: wall-clock timings via Python time.time()

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.55
(3.9×)
0.07
(28.4×)
1.46
(2.0×)
0.08
(39.4×)
pandas 2.16 1.99 2.93 3.15
pyreadstat 6.76
(0.3×)
1.64
(1.2×)
7.86
(0.4×)
2.18
(1.4×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 0.16
(7.3×)
0.10
(11.7×)
0.18
(7.3×)
0.09
(13.8×)
pandas 1.17 1.17 1.31 1.24
pyreadstat 5.48
(0.2×)
4.57
(0.3×)
5.67
(0.2×)
7.69
(0.2×)

SPSS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 1.09
(62.5×)
0.15
(3.9×)
1.10
(62.4×)
0.15
(3.9×)
pandas 68.12 0.59 68.67 0.59
pyreadstat 3.06
(22.3×)
1.15
(0.5×)
7.09
(9.7×)
1.23
(0.5×)

zsav

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat 3.97
(5.9×)
1.04
(2.1×)
4.77
(4.7×)
1.15
(2.0×)
pandas 23.47 2.20 22.40 2.29

Detailed benchmark notes and dataset descriptions are in BENCHMARKS.md.

Tests run

Test coverage includes:

  • Cross-library comparisons on the pyreadstat and pandas test data, checking results against polars-readstat==0.11.1, pyreadstat, and pandas.
  • Stata/SPSS read/write roundtrip tests.
  • Large-file read/write benchmark runs on real-world data (results below).

If you want to run the same checks locally, helper scripts and tests are in scripts/ and tests/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.19.4-cp39-abi3-win_amd64.whl (20.8 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.19.4-cp39-abi3-manylinux_2_28_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.19.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.3 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.19.4-cp39-abi3-macosx_11_0_arm64.whl (17.1 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.19.4-cp39-abi3-macosx_10_15_x86_64.whl (18.6 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.19.4-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.4-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0c30577efec76081bf549f6b96d6ccae2867984998f31ceca15f3dcb701929ee
MD5 c3a33f965e192b06ac9e0da688614ed6
BLAKE2b-256 173b76bb6128111c68d0ff4b813e36de92f27d59884c24d144e2621a5235b817

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.4-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.4-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5c91878fddc6ca4bb5cbd6092fa1f7041e176785c48b8fa330e28bfec5f57672
MD5 00c077dcae9af538808a43ea99407e84
BLAKE2b-256 967e49c3e3eb8d6656668cb07e1911fd588853525c1cbf6631f19fa1eb1c18d1

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a1dc92f410858ee02a03bb25e58f73d1b5dbafe45debcdee181b9f0ff954b2c1
MD5 2f70235094456195df780f91e39e4a94
BLAKE2b-256 6a3f0d41ae8e4ae6ce066a7a6b436c4d7c5ca05a6bf2b1f3906f772d0def7e6c

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.4-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.4-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9dc2e83fabcfdc13dec75ab76b1a8feb5f3934cb0f436c639a29b8fc05dc570
MD5 f06e357156a6f274f784e0bbfbf72be8
BLAKE2b-256 e312d4c75e70544c802162d335b9b29645194f33d8f23011c76ca96527bcdfa4

See more details on using hashes here.

File details

Details for the file polars_readstat-0.19.4-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.19.4-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 aa24f6bc4db2f086d29d5b254b6f292d07f049d5c7c7b30e43adaa71cf6f6ef6
MD5 e6576d3e4327ba2c7c8f2e9b067c44e0
BLAKE2b-256 cba5c80f5d411fdff7f66c14c3730ce29d3370824218380b1b49e6317b074ab5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page