Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars IO plugin to read SAS (sas7bdat), Stata (dta), and SPSS (sav) files. It's competitive with pandas and pyreadstat for Stata and SPSS files, and can be much faster for SAS files - 50% faster for full files and 20x faster for reading a small subset of columns from a wide file. See some basic benchmarks below.

Install

pip install polars-readstat

Basic usage

import polars as pl
from polars_readstat import scan_readstat
df_stata = scan_readstat("/path/file.dta")
df_sas = scan_readstat("/path/file.sas7bdat")
df_spss = scan_readstat("/path/file.sav")


# Then do any normal thing you'd do in polars
df_stata = (df_stata.head(1000)
                    .filter(pl.col("a") > 0.5)
                    .select(["b","c"]))
...
df_stata = df_stata.collect()


# For sas7bdat files, there are two "engines"
#   1. readstat:  generally, but not always slower
#   2. cpp:       faster, the default, used to be more likely to have errors, but not the case anymore in my testing
#                 If it's going to throw an error, it usually does so quickly
df = scan_readstat("/path/file.sas7bdat",
                   engine="readstat")

df = scan_readstat("/path/file.sas7bdat",
                   engine="cpp")


# If you want to get the metadata, use the ScanReadstat python class:

from polars_readstat import ScanReadstat
reader = ScanReadstat(path=path)  # You can pass engine for sas7bdat files, as above

# Python dictionary with metadata information, including value labels
metadata = reader.metadata  

# Polars schema
schema = reader.schema  

# LazyFrame
df = reader.df  

# You can also use memory mapping to speed up reads with the readstat engine
#   Some caveats, 
#     1) it can use a LOT of ram)
#     2) It doesn't really speed up reads for sas files (and I haven't tested SPSS)
df = scan_readstat("/path/file.dta",
                   use_mmap=True) # default is False 
                   
# Then do any normal thing you'd do in polars

# That's it

:key: Dependencies

This plugin calls rust bindings to load files in chunks, it is only possible due to the following excellent projects:

This takes a modified version of the readstat-rs bindings to readstat's C functions. My modifications:

  • Swapped out the now unmaintained arrow2 crate for Polars
  • Removed the CLI and write capabilities
  • Added read support for Stata (dta) and SPSS (sav) files
  • Removed some intermediate steps that resulted in processing full vectors of data repeatedly before creating polars dataframe
  • Modified the parsing of SAS and Stata data formats (particularly dates and datetimes) to provide a better (?... hopefully) mapping to polars data types
  • Modified readstat to read from a shared memory map to improve multithreaded performance (speeds up Stata file reads in my test, but not SAS)

Because of concerns about the performance of readstat reading large SAS files, I have also started integrating a different engine from the cpp-sas7bdat library. To do so, I have modified it as follows:

  • Added an Arrow sink to read the sas7bdat file to Arrow arrays using the c++ Arrow library
  • Updated the package build to use UV instead of pip for loading conan to manage the C++ packages
  • Added rust ffi bindings to the C++ code to zero-copy pass the Arrow array to rust and polars

Other notable features

  • Multithreaded using the number of pl.thread_pool_size
  • Currently comparable to pandas and pyreadstat or faster (see benchmarks below)

Pending tasks:

  • Write support for Stata (dta) and SPSS (sav) files. Readstat itself cannot write SAS (sas7bdat) files that SAS can read, and I'm not fool enough to try to figure that out. Also, any workflow that involves SAS should be one-way (SAS->something else) so you should only read SAS files, never write them.
  • Unit tests on the data sets used by pyreadstat to confirm that my output matches theirs

Benchmark

For each file, I compared 4 different scenarios: 1) load the full file, 2) load a subset of columns (Subset:True), 3) filter to a subet of rows (Filter: True), 4) load a subset of columns and filter to a subset of rows (Subset:True, Filter: True).

Compared to Pandas and Pyreadstat (using read_file_multiprocessing for parallel processing in Pyreadstat)

SAS

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
engine="cpp"
(the default option)
1.31
(1.6×)
0.09
(22.9×)
1.56
(1.9×)
0.09
(23.2×)
polars_readstat
engine="readstat"
5.27
(0.4×)
0.69
(3.0×)
7.62
(0.4×)
0.79
(2.6×)
pandas 2.07 2.06 3.03 2.09
pyreadstat 10.75
(0.2×)
0.46
(4.5×)
11.93
(0.3×)
0.50
(4.2×)

Stata

all times in seconds (speedup relative to pandas in parenthesis below each)

Library Full File Subset: True Filter: True Subset: True, Filter: True
polars_readstat
engine="readstat"
(the only option)
1.80
(0.6×)
0.27
(4.4×)
1.31
(0.8×)
0.29
(3.3×)
pandas 1.14 1.18 0.99 0.96
pyreadstat 7.46
(0.2×)
2.18
(0.5×)
7.66
(0.1×)
2.24
(0.4×)

Details

This was run on my computer, with the following specs (and reading the data from an external SSD):
CPU: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics
Cores: 16
RAM: 14Gi
OS: Linux Mint 22
Last Run: August 31, 2025 Version: 0.7 (with mmap in readstat)

This is not intended to be a scientific benchmark, just a test of loading realistic files. The Stata and SAS files used are different. One is tall and narrow (lots of rows, few columns) and the other is shorter and wider (fewer rows, many more columns).

All reported times are in seconds using python's time.time() (I know...).

File details:

  • Stata (dta)
    • 2000 5% sample decennial census file from ipums
    • Schema: Schema({'index': Int32, 'YEAR': Int32, 'SAMPLE': Int32, 'SERIAL': Int32, 'HHWT': Float64, 'CLUSTER': Float64, 'STRATA': Int32, 'GQ': Int32, 'PERNUM': Int32, 'PERWT': Float64, 'RELATE': Int32, 'RELATED': Int32, 'SEX': Int32, 'AGE': Int32, 'MARST': Int32, 'BIRTHYR': Int32})
    • Rows: 10,000,000 (limited to 10 million to fit in laptop memory)
  • SAS (sas7bdat)
    • American Community Survey 5-year file for Illinois, available here as the sas_pil.zip file.
    • Schema: Schema({'RT': String, 'SERIALNO': String, 'DIVISION': String, 'SPORDER': Float64, 'PUMA': String, 'REGION': String, 'STATE': String, 'ADJINC': String, 'PWGTP': Float64, 'AGEP': Float64, 'CIT': String, 'CITWP': Float64, 'COW': String, 'DDRS': String, 'DEAR': String, 'DEYE': String, 'DOUT': String, 'DPHY': String, 'DRAT': String, 'DRATX': String, 'DREM': String, 'ENG': String, 'FER': String, 'GCL': String, 'GCM': String, 'GCR': String, 'HINS1': String, 'HINS2': String, 'HINS3': String, 'HINS4': String, 'HINS5': String, 'HINS6': String, 'HINS7': String, 'INTP': Float64, 'JWMNP': Float64, 'JWRIP': Float64, 'JWTRNS': String, 'LANX': String, 'MAR': String, 'MARHD': String, 'MARHM': String, 'MARHT': String, 'MARHW': String, 'MARHYP': Float64, 'MIG': String, 'MIL': String, 'MLPA': String, 'MLPB': String, 'MLPCD': String, 'MLPE': String, 'MLPFG': String, 'MLPH': String, 'MLPIK': String, 'MLPJ': String, 'NWAB': String, 'NWAV': String, 'NWLA': String, 'NWLK': String, 'NWRE': String, 'OIP': Float64, 'PAP': Float64, 'RELSHIPP': String, 'RETP': Float64, 'SCH': String, 'SCHG': String, 'SCHL': String, 'SEMP': Float64, 'SEX': String, 'SSIP': Float64, 'SSP': Float64, 'WAGP': Float64, 'WKHP': Float64, 'WKL': String, 'WKWN': Float64, 'WRK': String, 'YOEP': Float64, 'ANC': String, 'ANC1P': String, 'ANC2P': String, 'DECADE': String, 'DIS': String, 'DRIVESP': String, 'ESP': String, 'ESR': String, 'FOD1P': String, 'FOD2P': String, 'HICOV': String, 'HISP': String, 'INDP': String, 'JWAP': String, 'JWDP': String, 'LANP': String, 'MIGPUMA': String, 'MIGSP': String, 'MSP': String, 'NAICSP': String, 'NATIVITY': String, 'NOP': String, 'OC': String, 'OCCP': String, 'PAOC': String, 'PERNP': Float64, 'PINCP': Float64, 'POBP': String, 'POVPIP': Float64, 'POWPUMA': String, 'POWSP': String, 'PRIVCOV': String, 'PUBCOV': String, 'QTRBIR': String, 'RAC1P': String, 'RAC2P19': String, 'RAC2P23': String, 'RAC3P': String, 'RACAIAN': String, 'RACASN': String, 'RACBLK': String, 'RACNH': String, 'RACNUM': String, 'RACPI': String, 'RACSOR': String, 'RACWHT': String, 'RC': String, 'SCIENGP': String, 'SCIENGRLP': String, 'SFN': String, 'SFR': String, 'SOCP': String, 'VPS': String, 'WAOB': String, 'FAGEP': String, 'FANCP': String, 'FCITP': String, 'FCITWP': String, 'FCOWP': String, 'FDDRSP': String, 'FDEARP': String, 'FDEYEP': String, 'FDISP': String, 'FDOUTP': String, 'FDPHYP': String, 'FDRATP': String, 'FDRATXP': String, 'FDREMP': String, 'FENGP': String, 'FESRP': String, 'FFERP': String, 'FFODP': String, 'FGCLP': String, 'FGCMP': String, 'FGCRP': String, 'FHICOVP': String, 'FHINS1P': String, 'FHINS2P': String, 'FHINS3C': String, 'FHINS3P': String, 'FHINS4C': String, 'FHINS4P': String, 'FHINS5C': String, 'FHINS5P': String, 'FHINS6P': String, 'FHINS7P': String, 'FHISP': String, 'FINDP': String, 'FINTP': String, 'FJWDP': String, 'FJWMNP': String, 'FJWRIP': String, 'FJWTRNSP': String, 'FLANP': String, 'FLANXP': String, 'FMARP': String, 'FMARHDP': String, 'FMARHMP': String, 'FMARHTP': String, 'FMARHWP': String, 'FMARHYP': String, 'FMIGP': String, 'FMIGSP': String, 'FMILPP': String, 'FMILSP': String, 'FOCCP': String, 'FOIP': String, 'FPAP': String, 'FPERNP': String, 'FPINCP': String, 'FPOBP': String, 'FPOWSP': String, 'FPRIVCOVP': String, 'FPUBCOVP': String, 'FRACP': String, 'FRELSHIPP': String, 'FRETP': String, 'FSCHGP': String, 'FSCHLP': String, 'FSCHP': String, 'FSEMP': String, 'FSEXP': String, 'FSSIP': String, 'FSSP': String, 'FWAGP': String, 'FWKHP': String, 'FWKLP': String, 'FWKWNP': String, 'FWRKP': String, 'FYOEP': String, 'PWGTP1': Float64, 'PWGTP2': Float64, 'PWGTP3': Float64, 'PWGTP4': Float64, 'PWGTP5': Float64, 'PWGTP6': Float64, 'PWGTP7': Float64, 'PWGTP8': Float64, 'PWGTP9': Float64, 'PWGTP10': Float64, 'PWGTP11': Float64, 'PWGTP12': Float64, 'PWGTP13': Float64, 'PWGTP14': Float64, 'PWGTP15': Float64, 'PWGTP16': Float64, 'PWGTP17': Float64, 'PWGTP18': Float64, 'PWGTP19': Float64, 'PWGTP20': Float64, 'PWGTP21': Float64, 'PWGTP22': Float64, 'PWGTP23': Float64, 'PWGTP24': Float64, 'PWGTP25': Float64, 'PWGTP26': Float64, 'PWGTP27': Float64, 'PWGTP28': Float64, 'PWGTP29': Float64, 'PWGTP30': Float64, 'PWGTP31': Float64, 'PWGTP32': Float64, 'PWGTP33': Float64, 'PWGTP34': Float64, 'PWGTP35': Float64, 'PWGTP36': Float64, 'PWGTP37': Float64, 'PWGTP38': Float64, 'PWGTP39': Float64, 'PWGTP40': Float64, 'PWGTP41': Float64, 'PWGTP42': Float64, 'PWGTP43': Float64, 'PWGTP44': Float64, 'PWGTP45': Float64, 'PWGTP46': Float64, 'PWGTP47': Float64, 'PWGTP48': Float64, 'PWGTP49': Float64, 'PWGTP50': Float64, 'PWGTP51': Float64, 'PWGTP52': Float64, 'PWGTP53': Float64, 'PWGTP54': Float64, 'PWGTP55': Float64, 'PWGTP56': Float64, 'PWGTP57': Float64, 'PWGTP58': Float64, 'PWGTP59': Float64, 'PWGTP60': Float64, 'PWGTP61': Float64, 'PWGTP62': Float64, 'PWGTP63': Float64, 'PWGTP64': Float64, 'PWGTP65': Float64, 'PWGTP66': Float64, 'PWGTP67': Float64, 'PWGTP68': Float64, 'PWGTP69': Float64, 'PWGTP70': Float64, 'PWGTP71': Float64, 'PWGTP72': Float64, 'PWGTP73': Float64, 'PWGTP74': Float64, 'PWGTP75': Float64, 'PWGTP76': Float64, 'PWGTP77': Float64, 'PWGTP78': Float64, 'PWGTP79': Float64, 'PWGTP80': Float64})
    • Rows: 623,757

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.11.1-cp39-abi3-win_amd64.whl (7.8 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.11.1-cp39-abi3-manylinux_2_28_x86_64.whl (8.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.11.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.1 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

polars_readstat-0.11.1-cp39-abi3-macosx_11_0_arm64.whl (6.0 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.11.1-cp39-abi3-macosx_10_15_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.11.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.11.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f1dbb4bfbf63975e1ce91fe3824c2577f5db88ff66990f372caed9cdf222d3e5
MD5 0b0e44a59d0047ddb849c4b812e66de3
BLAKE2b-256 99e45281c3f184b6a8b05375331b0a8a958db41ab64e6b47c9be8ae2cb40379f

See more details on using hashes here.

File details

Details for the file polars_readstat-0.11.1-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.11.1-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b19b85ff33df1c50ba01f95b0ee5a2df0a4005f3220fc2a3a268865647f76c6b
MD5 04e4e2f00ae1a9d39fd3bbf756f6d659
BLAKE2b-256 4104adc84af911ab48d346996e46a6874d85a71af89d8a8f132ff6bccad6b975

See more details on using hashes here.

File details

Details for the file polars_readstat-0.11.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.11.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2e6fabe3689354f43085eb54aacb4633f793021667abba1a5f1e44dff6c8355d
MD5 3b014e8ac5fb1d437d05efdbfad9f4d8
BLAKE2b-256 3da5749f6c0d4f6fcea1bea32d7e377674c64b3aa7054f3ca127688265827425

See more details on using hashes here.

File details

Details for the file polars_readstat-0.11.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.11.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4e0dbd62d850e1c4d7e47bccc95bc62850e34ab7fa14041774554160ef15a92e
MD5 dbd185d5c2289f46f0d80f6f4017ee1d
BLAKE2b-256 01fe2874a555840ba94747f37e454512ed891f7120ad5ec8c9cdfb3f5fec1bbc

See more details on using hashes here.

File details

Details for the file polars_readstat-0.11.1-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.11.1-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 abab34060a9abbaeadd989b4a2005ad3bb2de065531dccf8b750f7712dc28ccb
MD5 f5d7657ce7020de3b1577bba3b0a07e2
BLAKE2b-256 bcda7308039667c97ee84065cb0947e55439bd462f1f2098041277787938a901

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page