Python tools working with data from the Healthcare Cost and Utilization Program (http://hcup-us.ahrq.gov).
Project description
PyHCUP is a Python library for parsing and importing data obtained from the Healthcare Cost and Utilization Program (http://hcup-us.ahrq.gov).
In particular, most of the data provided by HCUP is in fixed-width text (ASCII or *.asc) files, with meta data available in separate load files. This library is built to use the SAS format load files (*.sas).
Example Usage
Load a datafile/loadfile combination.:
import pyhcup #specify where your data and loadfiles live datafile = 'D:\\Users\\hcup\\sid\\NY_SID_2009_CORE.asc' loadfile = 'D:\\Users\\hcup\\sid\\sasload\\NY_SID_2009_CORE.sas' #pull basic meta from SAS loadfile meta_df = pyhcup.sas.meta_from_sas(loadfile) #use meta knowledge to parse datafile into a pandas DataFrame df = pyhcup.sas.df_from_sas(datafile, meta_df)
Deal with very large files that cannot be held in memory in two ways.
To import a subset of rows, such as for preliminary work or troubleshooting, specify skiprows and/or readrows using sas.df_from_sas():
#optionally specify readrows and/or skiprows to handle larger files df = pyhcup.sas.df_from_sas(datafile, meta_df, readrows=5*10**5, skiprows=10**6)
To iterate through chunks of rows, such as for importing into a database, first use the metadata to build lists of column names and widths. Next, use pandas built-in read_fwf to create a generator yielding manageable-sized chunks.:
names = [x for x in meta_df.field] widths = [int(x) for x in meta_df.width] chunk_size = 500000 import pandas as pd reader = pd.read_fwf(datafile, header=None, widths=widths, names=names, chunksize=chunk_size) for df in reader: #do your business #such as replacing sentinel values (below) #or inserting into a database with another Python library
Whether you are pulling in all records or just a chunk of records, you can also replace all those pesky missing/invalid data placeholders from HCUP (this is less useful for generically parsing missing values for non-HCUP files).:
#also, this bulldozes through all values in all columns with no per-column control replaced = pyhcup.parser.replace_df_sentinels(df)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.