Skip to main content

Read SAS (sas7bdat), Stata (dta), and SPSS (sav) files with polars

Project description

polars_readstat

Polars IO plugin to read SAS (sas7bdat), Stata (dta), and SPSS (sav) files

Basic usage

import polars as pl
from polars_readstat import scan_readstat
df_stata = scan_readstat("/path/file.dta")
df_sas = scan_readstat("/path/file.sas7bdat")
df_spss = scan_readstat("/path/file.sav")


# Then do any normal thing you'd do in polars
df_stata = (df_stata.head(1000)
                    .filter(pl.col("a") > 0.5)
                    .select(["b","c"]))
...
df_stata = df_stata.collect()
# That's it

:key: Dependencies

This plugin calls rust bindings to load files in chunks, it is only possible due to the following excellent projects:

This takes a modified version of the readstat-rs bindings to readstat's C functions. My modifications:

  • Swapped out the now unmaintained arrow2 crate for Polars
  • Removed the CLI and write capabilities
  • Added read support for Stata (dta) and SPSS (sav) files
  • Removed some intermediate steps that resulted in processing full vectors of data repeatedly before creating polars dataframe
  • Modified the parsing of SAS and Stata data formats (particularly dates and datetimes) to provide a better (?... hopefully) mapping to polars data types

Because of concerns about the performance of readstat reading large SAS files, I have also started integrating a different engine from the cpp-sas7bdat library. To do so, I have modified it as follows:

  • Added an Arrow sink to read the sas7bdat file to Arrow arrays using the c++ Arrow library
  • Updated the package build to use UV instead of pip for loading conan to manage the C++ packages
  • Added rust ffi bindings to the C++ code to zero-copy pass the Arrow array to rust and polars

Other notable features

  • Multithreaded using the number of pl.thread_pool_size (readstat serialization to Arrow only)
  • Currently comparable to pandas and pyreadstat or faster (see benchmarks below)

Pending tasks:

  • Write support for Stata (dta) and SPSS (sav) files. Readstat itself cannot write SAS (sas7bdat) files that SAS can read, and I'm not fool enough to try to figure that out. Also, any workflow that involves SAS should be one-way (SAS->something else) so you should only read SAS files, never write them.
  • Unit tests on the data sets used by pyreadstat to confirm that my output matches theirs

Benchmark

This was run on my computer, with the following specs (and reading the data from an external SSD):
CPU: AMD Ryzen 7 8845HS w/ Radeon 780M Graphics
Cores: 16
RAM: 14Gi
OS: Linux Mint 22

This is not intended to be a scientific benchmark, just a test of loading realistic files. The Stata and SAS files used are different. One is tall and narrow (lots of rows, few columns) and the other is shorter and wider (fewer rows, many more columns).

For each file, I compared 4 different scenarios: 1) load the full file, 2) load a subset of columns, 3) filter to a subet of rows, 4) load a subset of columns and filter to a subset of rows.

All reported times are in seconds using python's time.time() (I know...).

Compared to Pyreadstat (using read_file_multiprocessing for parallel processing)

  • Stata

    • Subset: False, Filter: False
      • Polars: 2.245
      • Pyreadstat: 7.200
      • Are identical: True
    • Subset: True, Filter: False
      • Polars: 0.892
      • Pyreadstat: 2.209
      • Are identical: True
    • Subset: False, Filter: True
      • Polars: 2.250
      • Pyreadstat: 7.598
      • Are identical: True
    • Subset: True, Filter: True
      • Polars: 0.892
      • Pyreadstat: 2.361
      • Are identical: True
  • SAS

    • Subset: False, Filter: False
      • Polars (readstat engine): 5.262
      • Polars (cpp engine): 4.042
      • Pyreadstat: 13.693
      • Are identical: True
    • Subset: True, Filter: False
      • Polars (readstat engine): 2.575
      • Polars (cpp engine): 0.098
      • Pyreadstat: 0.855
      • Are identical: True
    • Subset: False, Filter: True
      • Polars (readstat engine): 3.912
      • Polars (cpp engine): 3.552
      • Pyreadstat: 13.485
      • Are identical: True
    • Subset: True, Filter: True
      • Polars (readstat engine): 1.0439
      • Polars (cpp engine): 0.097
      • Pyreadstat: 1.445
      • Are identical: True

Compared to Pandas

  • Stata
    • Subset: False, Filter: False
      • Polars: 2.312
      • Pandas: 1.276
      • Are identical: True
    • Subset: True, Filter: False
      • Polars: 0.877
      • Pandas: 1.312
      • Are identical: True
    • Subset: False, Filter: True
      • Polars: 2.335
      • Pandas: 1.366
      • Are identical: True
    • Subset: True, Filter: True
      • Polars: 0.908
      • Pandas: 1.378
      • Are identical: True
  • SAS
    • Subset: False, Filter: False
      • Polars: 3.843
      • Pandas: 8.432
      • Are identical: False
    • Subset: True, Filter: False
      • Polars: 1.027
      • Pandas: 4.125
      • Are identical: True
    • Subset: False, Filter: True
      • Polars: 3.848
      • Pandas: 8.534
      • Are identical: False
    • Subset: True, Filter: True
      • Polars: 1.624
      • Pandas: 4.531
      • Are identical: True

File details:

  • Stata (dta)
    • 2000 5% sample decennial census file from ipums
    • Schema: Schema({'index': Int32, 'YEAR': Int32, 'SAMPLE': Int32, 'SERIAL': Int32, 'HHWT': Float64, 'CLUSTER': Float64, 'STRATA': Int32, 'GQ': Int32, 'PERNUM': Int32, 'PERWT': Float64, 'RELATE': Int32, 'RELATED': Int32, 'SEX': Int32, 'AGE': Int32, 'MARST': Int32, 'BIRTHYR': Int32})
    • Rows: 14,081,466
  • SAS (sas7bdat)
    • American Community Survey 5-year file for Illinois, available here as the sas_pil.zip file.
    • Schema: Schema({'RT': String, 'SERIALNO': String, 'DIVISION': String, 'SPORDER': Float64, 'PUMA': String, 'REGION': String, 'STATE': String, 'ADJINC': String, 'PWGTP': Float64, 'AGEP': Float64, 'CIT': String, 'CITWP': Float64, 'COW': String, 'DDRS': String, 'DEAR': String, 'DEYE': String, 'DOUT': String, 'DPHY': String, 'DRAT': String, 'DRATX': String, 'DREM': String, 'ENG': String, 'FER': String, 'GCL': String, 'GCM': String, 'GCR': String, 'HINS1': String, 'HINS2': String, 'HINS3': String, 'HINS4': String, 'HINS5': String, 'HINS6': String, 'HINS7': String, 'INTP': Float64, 'JWMNP': Float64, 'JWRIP': Float64, 'JWTRNS': String, 'LANX': String, 'MAR': String, 'MARHD': String, 'MARHM': String, 'MARHT': String, 'MARHW': String, 'MARHYP': Float64, 'MIG': String, 'MIL': String, 'MLPA': String, 'MLPB': String, 'MLPCD': String, 'MLPE': String, 'MLPFG': String, 'MLPH': String, 'MLPIK': String, 'MLPJ': String, 'NWAB': String, 'NWAV': String, 'NWLA': String, 'NWLK': String, 'NWRE': String, 'OIP': Float64, 'PAP': Float64, 'RELSHIPP': String, 'RETP': Float64, 'SCH': String, 'SCHG': String, 'SCHL': String, 'SEMP': Float64, 'SEX': String, 'SSIP': Float64, 'SSP': Float64, 'WAGP': Float64, 'WKHP': Float64, 'WKL': String, 'WKWN': Float64, 'WRK': String, 'YOEP': Float64, 'ANC': String, 'ANC1P': String, 'ANC2P': String, 'DECADE': String, 'DIS': String, 'DRIVESP': String, 'ESP': String, 'ESR': String, 'FOD1P': String, 'FOD2P': String, 'HICOV': String, 'HISP': String, 'INDP': String, 'JWAP': String, 'JWDP': String, 'LANP': String, 'MIGPUMA': String, 'MIGSP': String, 'MSP': String, 'NAICSP': String, 'NATIVITY': String, 'NOP': String, 'OC': String, 'OCCP': String, 'PAOC': String, 'PERNP': Float64, 'PINCP': Float64, 'POBP': String, 'POVPIP': Float64, 'POWPUMA': String, 'POWSP': String, 'PRIVCOV': String, 'PUBCOV': String, 'QTRBIR': String, 'RAC1P': String, 'RAC2P19': String, 'RAC2P23': String, 'RAC3P': String, 'RACAIAN': String, 'RACASN': String, 'RACBLK': String, 'RACNH': String, 'RACNUM': String, 'RACPI': String, 'RACSOR': String, 'RACWHT': String, 'RC': String, 'SCIENGP': String, 'SCIENGRLP': String, 'SFN': String, 'SFR': String, 'SOCP': String, 'VPS': String, 'WAOB': String, 'FAGEP': String, 'FANCP': String, 'FCITP': String, 'FCITWP': String, 'FCOWP': String, 'FDDRSP': String, 'FDEARP': String, 'FDEYEP': String, 'FDISP': String, 'FDOUTP': String, 'FDPHYP': String, 'FDRATP': String, 'FDRATXP': String, 'FDREMP': String, 'FENGP': String, 'FESRP': String, 'FFERP': String, 'FFODP': String, 'FGCLP': String, 'FGCMP': String, 'FGCRP': String, 'FHICOVP': String, 'FHINS1P': String, 'FHINS2P': String, 'FHINS3C': String, 'FHINS3P': String, 'FHINS4C': String, 'FHINS4P': String, 'FHINS5C': String, 'FHINS5P': String, 'FHINS6P': String, 'FHINS7P': String, 'FHISP': String, 'FINDP': String, 'FINTP': String, 'FJWDP': String, 'FJWMNP': String, 'FJWRIP': String, 'FJWTRNSP': String, 'FLANP': String, 'FLANXP': String, 'FMARP': String, 'FMARHDP': String, 'FMARHMP': String, 'FMARHTP': String, 'FMARHWP': String, 'FMARHYP': String, 'FMIGP': String, 'FMIGSP': String, 'FMILPP': String, 'FMILSP': String, 'FOCCP': String, 'FOIP': String, 'FPAP': String, 'FPERNP': String, 'FPINCP': String, 'FPOBP': String, 'FPOWSP': String, 'FPRIVCOVP': String, 'FPUBCOVP': String, 'FRACP': String, 'FRELSHIPP': String, 'FRETP': String, 'FSCHGP': String, 'FSCHLP': String, 'FSCHP': String, 'FSEMP': String, 'FSEXP': String, 'FSSIP': String, 'FSSP': String, 'FWAGP': String, 'FWKHP': String, 'FWKLP': String, 'FWKWNP': String, 'FWRKP': String, 'FYOEP': String, 'PWGTP1': Float64, 'PWGTP2': Float64, 'PWGTP3': Float64, 'PWGTP4': Float64, 'PWGTP5': Float64, 'PWGTP6': Float64, 'PWGTP7': Float64, 'PWGTP8': Float64, 'PWGTP9': Float64, 'PWGTP10': Float64, 'PWGTP11': Float64, 'PWGTP12': Float64, 'PWGTP13': Float64, 'PWGTP14': Float64, 'PWGTP15': Float64, 'PWGTP16': Float64, 'PWGTP17': Float64, 'PWGTP18': Float64, 'PWGTP19': Float64, 'PWGTP20': Float64, 'PWGTP21': Float64, 'PWGTP22': Float64, 'PWGTP23': Float64, 'PWGTP24': Float64, 'PWGTP25': Float64, 'PWGTP26': Float64, 'PWGTP27': Float64, 'PWGTP28': Float64, 'PWGTP29': Float64, 'PWGTP30': Float64, 'PWGTP31': Float64, 'PWGTP32': Float64, 'PWGTP33': Float64, 'PWGTP34': Float64, 'PWGTP35': Float64, 'PWGTP36': Float64, 'PWGTP37': Float64, 'PWGTP38': Float64, 'PWGTP39': Float64, 'PWGTP40': Float64, 'PWGTP41': Float64, 'PWGTP42': Float64, 'PWGTP43': Float64, 'PWGTP44': Float64, 'PWGTP45': Float64, 'PWGTP46': Float64, 'PWGTP47': Float64, 'PWGTP48': Float64, 'PWGTP49': Float64, 'PWGTP50': Float64, 'PWGTP51': Float64, 'PWGTP52': Float64, 'PWGTP53': Float64, 'PWGTP54': Float64, 'PWGTP55': Float64, 'PWGTP56': Float64, 'PWGTP57': Float64, 'PWGTP58': Float64, 'PWGTP59': Float64, 'PWGTP60': Float64, 'PWGTP61': Float64, 'PWGTP62': Float64, 'PWGTP63': Float64, 'PWGTP64': Float64, 'PWGTP65': Float64, 'PWGTP66': Float64, 'PWGTP67': Float64, 'PWGTP68': Float64, 'PWGTP69': Float64, 'PWGTP70': Float64, 'PWGTP71': Float64, 'PWGTP72': Float64, 'PWGTP73': Float64, 'PWGTP74': Float64, 'PWGTP75': Float64, 'PWGTP76': Float64, 'PWGTP77': Float64, 'PWGTP78': Float64, 'PWGTP79': Float64, 'PWGTP80': Float64})
    • Rows: 623,757

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

polars_readstat-0.4.2-cp39-abi3-win_amd64.whl (7.7 MB view details)

Uploaded CPython 3.9+Windows x86-64

polars_readstat-0.4.2-cp39-abi3-manylinux_2_28_x86_64.whl (7.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.28+ x86-64

polars_readstat-0.4.2-cp39-abi3-macosx_11_0_arm64.whl (5.8 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

polars_readstat-0.4.2-cp39-abi3-macosx_10_15_x86_64.whl (6.3 MB view details)

Uploaded CPython 3.9+macOS 10.15+ x86-64

File details

Details for the file polars_readstat-0.4.2-cp39-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.4.2-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 4f576f42f36c715a37fee70c288b763a7266a0bdf269aa834ed5313dc7896beb
MD5 0a49fbec9b2a38baf633b3b7ee3c7112
BLAKE2b-256 c2976029e07dadac214065578fa72f2aec5bea70a6d4e9ccf744ef8c71f409eb

See more details on using hashes here.

File details

Details for the file polars_readstat-0.4.2-cp39-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.4.2-cp39-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 318f855580619992defddb0ebe3774132744f0929d1fd2a4a01e2c4fc240e98b
MD5 a678e8e319fa213f10847bf66ea8f9df
BLAKE2b-256 57cb161e886533c038051048fd5e0b0955c394ab32aeb0228564f1aea0a80e6d

See more details on using hashes here.

File details

Details for the file polars_readstat-0.4.2-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.4.2-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e4f5cb7dadd84b42a31fbd7ef878e1dffa7f485a13e35dd854403f6d881af12a
MD5 8881788f42fec95554cb523ea310748d
BLAKE2b-256 a50fbe2104d0ace498d531de97dd58c6465396648f4ac8fa88b4686b0bc27ffd

See more details on using hashes here.

File details

Details for the file polars_readstat-0.4.2-cp39-abi3-macosx_10_15_x86_64.whl.

File metadata

File hashes

Hashes for polars_readstat-0.4.2-cp39-abi3-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 19f9876688a9f5f29c90cd9a8a5cdfc169b82f50c9a1594eb669ef5ada11f315
MD5 88941c3317553f8e6a6e6bad57278cd9
BLAKE2b-256 b693ff593580105f9303d3c0c1c7e33818750bcbfd09c0c0e1887dc2d2187692

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page