Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars plugin for SAS (.sas7bdat), Stata (.dta), and SPSS (.sav/.zsav) files.

The Python package wraps the Rust core in polars_readstat_rs and exposes a simple Polars-first API. I have tried to make sure there are no errors or regressions in this release (tested against 178 test files from pandas, pyreadstat, etc.).

The new rust engine is on par or faster than the old for many files, but it's not always faster (at least for SAS data sets), so if it's slower or I missed a bug, you can find info on the prior version and install version 0.11.1 from pypi.

Why use this?

  • In project benchmarks, the new Rust-backed engine is typically faster than pandas/pyreadstat on large SAS/Stata files, especially for subset/filter workloads.
  • It avoids the older C/C++ toolchain complexity and ships as standard Python wheels.
  • API is Polars-first (scan_readstat, read_readstat, write_readstat).

Install

pip install polars-readstat

Core API

1) Lazy scan

import polars as pl
from polars_readstat import scan_readstat

lf = scan_readstat("/path/file.sas7bdat", preserve_order=True)
df = lf.select(["SERIALNO", "AGEP"]).filter(pl.col("AGEP") >= 18).collect()

2) Eager read

from polars_readstat import read_readstat

df = read_readstat("/path/file.dta")

3) Metadata + schema

from polars_readstat import ScanReadstat

reader = ScanReadstat(path="/path/file.sav")
schema = reader.schema
metadata = reader.metadata

4) Write (Stata/SPSS) - EXPERIMENTAL

I can test reading the data back with Stata, as I have access to it, but I don't have access to SPSS. I can make sure my code roundtrips properly and I'll be adding read tests from other packages (pyreadstat and pandas) to make sure they can read the files I create, but I'll need help testing things from others before I'm comfortable with the SPSS code.

from polars_readstat import write_readstat

write_readstat(df, "/path/out.dta", threads=8)
write_readstat(df, "/path/out.sav")

write_readstat supports Stata (dta) and SPSS (sav). SAS writing is not supported.

Tests run

We’ve tried to test this thoroughly:

  • Cross-library comparisons on the pyreadstat and pandas test data, checking results against polars-readstat==0.11.1, pyreadstat, and pandas.
  • Stata/SPSS read/write roundtrip tests.
  • Large-file read/write benchmark runs on real-world data (results below).

If you want to run the same checks locally, helper scripts and tests are in scripts/ and tests/.

Benchmark

For each file, I compared 4 different scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subet of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Benchmark context:

  • Machine: AMD Ryzen 7 8845HS (16 cores), 14 GiB RAM, Linux Mint 22
  • Storage: external SSD
  • Last run: August 31, 2025
  • Version tested: polars-readstat 0.12 (new Rust engine) against polars-readstat 0.11.1 (prior C++ and C engines) and pandas and pyreadstat
  • Method: wall-clock timings via Python time.time()

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
New rust engine
0.90
(2.3×)
0.07
(29.4×)
1.23
(2.5×)
0.07
(29.9×)
polars_readstat
engine="cpp"
(fastest for 0.11.1)
1.31
(1.6×)
0.09
(22.9×)
1.56
(1.9×)
0.09
(23.2×)
pandas 2.07 2.06 3.03 2.09
pyreadstat 10.75
(0.2×)
0.46
(4.5×)
11.93
(0.3×)
0.50
(4.2×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
New rust engine
0.17
(6.7×)
0.12
(9.8×)
0.24
(4.1×)
0.11
(8.7×)
polars_readstat
engine="readstat"
(the only option for 0.11.1)
1.80
(0.6×)
0.27
(4.4×)
1.31
(0.8×)
0.29
(3.3×)
pandas 1.14 1.18 0.99 0.96
pyreadstat 7.46
(0.2×)
2.18
(0.5×)
7.66
(0.1×)
2.24
(0.4×)

Detailed benchmark notes and dataset descriptions are in BENCHMARKS.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.12.3-cp39-abi3-win_amd64.whl (20.5 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.12.3-cp39-abi3-manylinux_2_28_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.12.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.12.3-cp39-abi3-macosx_11_0_arm64.whl (16.8 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.12.3-cp39-abi3-macosx_10_15_x86_64.whl (18.3 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.12.3-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.3-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6152a427f24bbf2397aa58209e4a72cf04a6aede0b37f463ba1a866db5496767
MD5 6277dd34a1d08f5cef6d2bb6a3006776
BLAKE2b-256 dc825ed4696708d030bd79709c7e75e66037300cd90afccb99546473e42fc9dc

See more details on using hashes here.

File details

Details for the file polars_readstat-0.12.3-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.3-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 109b0ac02e2206de82358577b424e20fee0dc42e43f263ec2fd18759bbd7d8aa
MD5 f5849766e9e430ca9467ca71ce296161
BLAKE2b-256 1708768090324cc28830ca75f52e9b4109b2d54733885d9d908c27b7a168b0cc

See more details on using hashes here.

File details

Details for the file polars_readstat-0.12.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.3-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a4cb37fed55f8742db7acba1bae03f62111de9cdbcad88a20a53e617f8521961
MD5 117c9153e14810ade754dec86d50c5e1
BLAKE2b-256 c71b62b3106b92e96d23ab54fa82821f6bf8bb624f0c94eaa6bfceab6f76ab66

See more details on using hashes here.

File details

Details for the file polars_readstat-0.12.3-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.3-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 41aa945d8f4fdfe22483f0096ec6bbd24ee980379a2949bfcb90bac82b0e468b
MD5 fb0c845ac83fac091a8bc6cec720cbcf
BLAKE2b-256 0bb5d21cf0981182ccf84f455040da4b01f146a5bc448a82e2b02eb3bd0d58b1

See more details on using hashes here.

File details

Details for the file polars_readstat-0.12.3-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.12.3-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 bf6246ef245a277ddfba46ba5f7ab4f294a2d8e83bafd1397c9179154d7d7987
MD5 c66b5b3728a6240cde00fd7962a2cf0c
BLAKE2b-256 c2cf0081c2c331a8d0996b51969f5ad6461765d3858b8d184b450ec58b8ced1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page