Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars IO plugin to read SAS (sas7bdat), Stata (dta), and SPSS (sav) files

Basic usage

import polars as pl
from polars_readstat import scan_readstat
df_stata = scan_readstat("/path/file.dta")
df_sas = scan_readstat("/path/file.sas7bdat")
df_spss = scan_readstat("/path/file.sav")


# Then do any normal thing you'd do in polars
df_stata = (df_stata.head(1000)
                    .filter(pl.col("a") > 0.5)
                    .select(["b","c"]))
...
df_stata = df_stata.collect()
# That's it

:key: Dependencies

This plugin calls rust bindings to load files in chunks, it is only possible due to the following excellent projects:

This takes a modified version of the readstat-rs bindings to readstat's C functions. My modifications:

  • Swapped out the now unmaintained arrow2 crate for Polars
  • Removed the CLI and write capabilities
  • Added read support for Stata (dta) and SPSS (sav) files
  • Removed some intermediate steps that resulted in processing full vectors of data repeatedly before creating polars dataframe
  • Modified the parsing of SAS and Stata data formats (particularly dates and datetimes) to provide a better (?... hopefully) mapping to polars data types

Because of concerns about the performance of readstat reading large SAS files, I have also started integrating a different engine from the cpp-sas7bdat library. To do so, I have modified it as follows:

  • Added an Arrow sink to read the sas7bdat file to Arrow arrays using the c++ Arrow library
  • Updated the package build to use UV instead of pip for loading conan to manage the C++ packages
  • Added rust ffi bindings to the C++ code to zero-copy pass the Arrow array to rust and polars

Other notable features

  • Multithreaded using the number of pl.thread_pool_size (readstat serialization to Arrow only)
  • Currently comparable to pandas and pyreadstat or faster (see benchmarks below)

Pending tasks:

  • Write support for Stata (dta) and SPSS (sav) files. Readstat itself cannot write SAS (sas7bdat) files that SAS can read, and I'm not fool enough to try to figure that out. Also, any workflow that involves SAS should be one-way (SAS->something else) so you should only read SAS files, never write them.
  • Unit tests on the data sets used by pyreadstat to confirm that my output matches theirs

Benchmark

This was run on my computer, with the following specs (and reading the data from an external SSD):
CPU: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics
Cores: 16
RAM: 14Gi
OS: Linux Mint 22

This is not intended to be a scientific benchmark, just a test of loading realistic files. The Stata and SAS files used are different. One is tall and narrow (lots of rows, few columns) and the other is shorter and wider (fewer rows, many more columns).

For each file, I compared 4 different scenarios: 1) load the full file, 2) load a subset of columns, 3) filter to a subet of rows, 4) load a subset of columns and filter to a subset of rows.

All reported times are in seconds using python's time.time() (I know...).

Compared to Pyreadstat (using read_file_multiprocessing for parallel processing)

  • Stata

    • Subset: False, Filter: False
      • Polars: 2.245
      • Pyreadstat: 7.200
      • Are identical: True
    • Subset: True, Filter: False
      • Polars: 0.892
      • Pyreadstat: 2.209
      • Are identical: True
    • Subset: False, Filter: True
      • Polars: 2.250
      • Pyreadstat: 7.598
      • Are identical: True
    • Subset: True, Filter: True
      • Polars: 0.892
      • Pyreadstat: 2.361
      • Are identical: True
  • SAS

    • Subset: False, Filter: False
      • Polars (readstat engine): 5.262
      • Polars (cpp engine): 4.042
      • Pyreadstat: 13.693
      • Are identical: True
    • Subset: True, Filter: False
      • Polars (readstat engine): 2.575
      • Polars (cpp engine): 0.098
      • Pyreadstat: 0.855
      • Are identical: True
    • Subset: False, Filter: True
      • Polars (readstat engine): 3.912
      • Polars (cpp engine): 3.552
      • Pyreadstat: 13.485
      • Are identical: True
    • Subset: True, Filter: True
      • Polars (readstat engine): 1.0439
      • Polars (cpp engine): 0.097
      • Pyreadstat: 1.445
      • Are identical: True

Compared to Pandas

  • Stata
    • Subset: False, Filter: False
      • Polars: 2.312
      • Pandas: 1.276
      • Are identical: True
    • Subset: True, Filter: False
      • Polars: 0.877
      • Pandas: 1.312
      • Are identical: True
    • Subset: False, Filter: True
      • Polars: 2.335
      • Pandas: 1.366
      • Are identical: True
    • Subset: True, Filter: True
      • Polars: 0.908
      • Pandas: 1.378
      • Are identical: True
  • SAS
    • Subset: False, Filter: False
      • Polars: 3.843
      • Pandas: 8.432
      • Are identical: False
    • Subset: True, Filter: False
      • Polars: 1.027
      • Pandas: 4.125
      • Are identical: True
    • Subset: False, Filter: True
      • Polars: 3.848
      • Pandas: 8.534
      • Are identical: False
    • Subset: True, Filter: True
      • Polars: 1.624
      • Pandas: 4.531
      • Are identical: True

File details:

  • Stata (dta)
    • 2000 5% sample decennial census file from ipums
    • Schema: Schema({'index': Int32, 'YEAR': Int32, 'SAMPLE': Int32, 'SERIAL': Int32, 'HHWT': Float64, 'CLUSTER': Float64, 'STRATA': Int32, 'GQ': Int32, 'PERNUM': Int32, 'PERWT': Float64, 'RELATE': Int32, 'RELATED': Int32, 'SEX': Int32, 'AGE': Int32, 'MARST': Int32, 'BIRTHYR': Int32})
    • Rows: 14,081,466
  • SAS (sas7bdat)
    • American Community Survey 5-year file for Illinois, available here as the sas_pil.zip file.
    • Schema: Schema({'RT': String, 'SERIALNO': String, 'DIVISION': String, 'SPORDER': Float64, 'PUMA': String, 'REGION': String, 'STATE': String, 'ADJINC': String, 'PWGTP': Float64, 'AGEP': Float64, 'CIT': String, 'CITWP': Float64, 'COW': String, 'DDRS': String, 'DEAR': String, 'DEYE': String, 'DOUT': String, 'DPHY': String, 'DRAT': String, 'DRATX': String, 'DREM': String, 'ENG': String, 'FER': String, 'GCL': String, 'GCM': String, 'GCR': String, 'HINS1': String, 'HINS2': String, 'HINS3': String, 'HINS4': String, 'HINS5': String, 'HINS6': String, 'HINS7': String, 'INTP': Float64, 'JWMNP': Float64, 'JWRIP': Float64, 'JWTRNS': String, 'LANX': String, 'MAR': String, 'MARHD': String, 'MARHM': String, 'MARHT': String, 'MARHW': String, 'MARHYP': Float64, 'MIG': String, 'MIL': String, 'MLPA': String, 'MLPB': String, 'MLPCD': String, 'MLPE': String, 'MLPFG': String, 'MLPH': String, 'MLPIK': String, 'MLPJ': String, 'NWAB': String, 'NWAV': String, 'NWLA': String, 'NWLK': String, 'NWRE': String, 'OIP': Float64, 'PAP': Float64, 'RELSHIPP': String, 'RETP': Float64, 'SCH': String, 'SCHG': String, 'SCHL': String, 'SEMP': Float64, 'SEX': String, 'SSIP': Float64, 'SSP': Float64, 'WAGP': Float64, 'WKHP': Float64, 'WKL': String, 'WKWN': Float64, 'WRK': String, 'YOEP': Float64, 'ANC': String, 'ANC1P': String, 'ANC2P': String, 'DECADE': String, 'DIS': String, 'DRIVESP': String, 'ESP': String, 'ESR': String, 'FOD1P': String, 'FOD2P': String, 'HICOV': String, 'HISP': String, 'INDP': String, 'JWAP': String, 'JWDP': String, 'LANP': String, 'MIGPUMA': String, 'MIGSP': String, 'MSP': String, 'NAICSP': String, 'NATIVITY': String, 'NOP': String, 'OC': String, 'OCCP': String, 'PAOC': String, 'PERNP': Float64, 'PINCP': Float64, 'POBP': String, 'POVPIP': Float64, 'POWPUMA': String, 'POWSP': String, 'PRIVCOV': String, 'PUBCOV': String, 'QTRBIR': String, 'RAC1P': String, 'RAC2P19': String, 'RAC2P23': String, 'RAC3P': String, 'RACAIAN': String, 'RACASN': String, 'RACBLK': String, 'RACNH': String, 'RACNUM': String, 'RACPI': String, 'RACSOR': String, 'RACWHT': String, 'RC': String, 'SCIENGP': String, 'SCIENGRLP': String, 'SFN': String, 'SFR': String, 'SOCP': String, 'VPS': String, 'WAOB': String, 'FAGEP': String, 'FANCP': String, 'FCITP': String, 'FCITWP': String, 'FCOWP': String, 'FDDRSP': String, 'FDEARP': String, 'FDEYEP': String, 'FDISP': String, 'FDOUTP': String, 'FDPHYP': String, 'FDRATP': String, 'FDRATXP': String, 'FDREMP': String, 'FENGP': String, 'FESRP': String, 'FFERP': String, 'FFODP': String, 'FGCLP': String, 'FGCMP': String, 'FGCRP': String, 'FHICOVP': String, 'FHINS1P': String, 'FHINS2P': String, 'FHINS3C': String, 'FHINS3P': String, 'FHINS4C': String, 'FHINS4P': String, 'FHINS5C': String, 'FHINS5P': String, 'FHINS6P': String, 'FHINS7P': String, 'FHISP': String, 'FINDP': String, 'FINTP': String, 'FJWDP': String, 'FJWMNP': String, 'FJWRIP': String, 'FJWTRNSP': String, 'FLANP': String, 'FLANXP': String, 'FMARP': String, 'FMARHDP': String, 'FMARHMP': String, 'FMARHTP': String, 'FMARHWP': String, 'FMARHYP': String, 'FMIGP': String, 'FMIGSP': String, 'FMILPP': String, 'FMILSP': String, 'FOCCP': String, 'FOIP': String, 'FPAP': String, 'FPERNP': String, 'FPINCP': String, 'FPOBP': String, 'FPOWSP': String, 'FPRIVCOVP': String, 'FPUBCOVP': String, 'FRACP': String, 'FRELSHIPP': String, 'FRETP': String, 'FSCHGP': String, 'FSCHLP': String, 'FSCHP': String, 'FSEMP': String, 'FSEXP': String, 'FSSIP': String, 'FSSP': String, 'FWAGP': String, 'FWKHP': String, 'FWKLP': String, 'FWKWNP': String, 'FWRKP': String, 'FYOEP': String, 'PWGTP1': Float64, 'PWGTP2': Float64, 'PWGTP3': Float64, 'PWGTP4': Float64, 'PWGTP5': Float64, 'PWGTP6': Float64, 'PWGTP7': Float64, 'PWGTP8': Float64, 'PWGTP9': Float64, 'PWGTP10': Float64, 'PWGTP11': Float64, 'PWGTP12': Float64, 'PWGTP13': Float64, 'PWGTP14': Float64, 'PWGTP15': Float64, 'PWGTP16': Float64, 'PWGTP17': Float64, 'PWGTP18': Float64, 'PWGTP19': Float64, 'PWGTP20': Float64, 'PWGTP21': Float64, 'PWGTP22': Float64, 'PWGTP23': Float64, 'PWGTP24': Float64, 'PWGTP25': Float64, 'PWGTP26': Float64, 'PWGTP27': Float64, 'PWGTP28': Float64, 'PWGTP29': Float64, 'PWGTP30': Float64, 'PWGTP31': Float64, 'PWGTP32': Float64, 'PWGTP33': Float64, 'PWGTP34': Float64, 'PWGTP35': Float64, 'PWGTP36': Float64, 'PWGTP37': Float64, 'PWGTP38': Float64, 'PWGTP39': Float64, 'PWGTP40': Float64, 'PWGTP41': Float64, 'PWGTP42': Float64, 'PWGTP43': Float64, 'PWGTP44': Float64, 'PWGTP45': Float64, 'PWGTP46': Float64, 'PWGTP47': Float64, 'PWGTP48': Float64, 'PWGTP49': Float64, 'PWGTP50': Float64, 'PWGTP51': Float64, 'PWGTP52': Float64, 'PWGTP53': Float64, 'PWGTP54': Float64, 'PWGTP55': Float64, 'PWGTP56': Float64, 'PWGTP57': Float64, 'PWGTP58': Float64, 'PWGTP59': Float64, 'PWGTP60': Float64, 'PWGTP61': Float64, 'PWGTP62': Float64, 'PWGTP63': Float64, 'PWGTP64': Float64, 'PWGTP65': Float64, 'PWGTP66': Float64, 'PWGTP67': Float64, 'PWGTP68': Float64, 'PWGTP69': Float64, 'PWGTP70': Float64, 'PWGTP71': Float64, 'PWGTP72': Float64, 'PWGTP73': Float64, 'PWGTP74': Float64, 'PWGTP75': Float64, 'PWGTP76': Float64, 'PWGTP77': Float64, 'PWGTP78': Float64, 'PWGTP79': Float64, 'PWGTP80': Float64})
    • Rows: 623,757

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.5.1-cp39-abi3-win_amd64.whl (7.7 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.5.1-cp39-abi3-manylinux_2_28_x86_64.whl (7.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.5.1-cp39-abi3-macosx_11_0_arm64.whl (5.8 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.5.1-cp39-abi3-macosx_10_15_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.5.1-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.5.1-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b5e17ebc78cd05908efa7414316ab1d4a66baa4962a6d7bd5405cfa0523ba5e7
MD5 d430a15820451472b4dd65a2ffb0f729
BLAKE2b-256 3a202856aebbe1f9301fefdd9b12131f5722a74be8463e6d63ba489ec7867f28

See more details on using hashes here.

File details

Details for the file polars_readstat-0.5.1-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.5.1-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 456aa987e2b4100113713f852c298982dc668915019290ccee0b26515cb512f0
MD5 d6ea8a2d114d96ac2256ca09b582878b
BLAKE2b-256 9983cbdd94ad735e0d2bb474f532fe26807dcc68d5e557261202986da05d3a27

See more details on using hashes here.

File details

Details for the file polars_readstat-0.5.1-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.5.1-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fa586378ea1c416e798e4be1f34dc2cbac6f1eb1125261cbe3a3dbae5dcebd6d
MD5 ce3b07b3cf016aa7e8f6f11e6271b166
BLAKE2b-256 4323df5819aa81f933de70c27ee099c97919d2dba537933386572355aa112be6

See more details on using hashes here.

File details

Details for the file polars_readstat-0.5.1-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.5.1-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 aa49e952cc884bd984edd88be02d2bae6546646fb7ad9574d01e83fab0c5f963
MD5 41353441d6679bb983e943368282c8d9
BLAKE2b-256 9836bd07eede4ab47aa5e2129649de6c05d23795ce481c72f5adb23c7883e919

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page