Format data for National Death Index (NDI) requests.
Project description
Simple module to convert data table (CSV/SAS7BDAT/JSON) into National Death Index (NDI) format released under the MIT license.
About
The formatting and validation convert supported data files into acceptable NDI datasets for submission. The validation is not intended to support an arbitrary NDI file, but one which has been generated by the included formatter.
Disclaimer
No guarantee of any kind is made that this code produces the desired output. Please inspect your own data to ensure that it is correct, and contribute to improve the current formatter/validator.
References
- National Center for Health Statistics. National Death Index user’s guide. Hyattsville, MD. 2013.
- Above cited is available at http://www.cdc.gov/nchs/data/ndi/NDI_Users_Guide.pdf
Requirements
- Python 3.3+
- Optional packages:
- dateutil: enables inference of date (for birthdate)
- sas7bdat: enables parsing of sas7bdat files
Prerequisites
- A supported data file with information that needs to be converted to NDI format.
- Each subject/record must have either...
- FIRST and LAST NAME and SOCIAL SECURITY NUMBER
- FIRST and LAST NAME and MONTH and YEAR OF BIRTH
- SOCIAL SECURITY NUMBER and full DATE OF BIRTH and SEX
- Install Python 3.6+
- (Optional) Install optional packages:
- Install sas7bdat by running
pip install sas7bdat
- Install dateutil by running
pip install dateutil
- For issues with proxy, try the answers to this SO question: http://stackoverflow.com/questions/14149422/
- Install sas7bdat by running
Doco
Installation
Either with pip:
pip install ndi_formatter
Or download the repository:
git clone git@bitbucket.org:dcronkite/ndi_formatter.git
cd ndi_formatter
python setup.py install
Basics
The best way to get started is to figure out which options you need to pass.
# create a sample configuration file
ndi-formatter --create-sample >> sample.config
# see all arguments
ndi-formatter --help
# run with a config file
ndi-formatter "@configfile.conf"
Program Options
Once the sample config has been created, you can customize the parameters. The following should be helpful in more explicitly documenting the parameters. Most of these options are mapping a variable/column name in a CSV, SAS, etc. dataset to the type of data which that variable/column contains.
-i INPUT_FILE, --input-file INPUT_FILE
Input file path.
-o OUTPUT_FILE, --output-file OUTPUT_FILE
NDI-formatted output file.
-f {sas,csv,json}, --input-format {sas,csv,json}
Input file format.
-L LOG_FILE, --log-file LOG_FILE
Logfile name.
--fname FNAME Name/index of column with first name
--lname LNAME Name/index of column with last name
--mname MNAME Name/index of column with middle name/initial
--sname SNAME Name/index of column with father name
--name NAME Name/index of column with full name
--ssn SSN Name/index of column with ssn; accepts multiple
columns
--birth-day BIRTH_DAY
Name/index of column with birth day
--birth-month BIRTH_MONTH
Name/index of column with birth month
--birth-year BIRTH_YEAR
Name/index of column with birth year
--birthdate BIRTHDATE
Name/index of column with birthdate
--sex SEX Name/index of column with sex; accepts multiple
columns
--death-age DEATH_AGE
Name/index of column with age at death (in years)
--race RACE Name/index of column with race; accepts multiple
columns
--marital-status MARITAL_STATUS
Name/index of column with marital status; accepts
multiple columns
--state-of-residence STATE_OF_RESIDENCE
Name/index of column with state of residence; accepts
multiple columns
--state-of-birth STATE_OF_BIRTH
Name/index of column with state of birth; accepts
multiple columns
--id ID Name/index of column with id number
--race-mapping OA/PI WH BA NA/IN CH JP HI Onon-WH FL
Mapping of variable to NDI race in following order:
Other Asian/Pacific Islander, White, Black, Native American,
Chinese, Japanese, Hawaiian, Other nonwhite, Filipino;
everything else will be treated as unknown; use an "X"
instead of a value to skip a race
--marital-status-mapping Single Married Widowed Divorced
Mapping of variable to ND marital status in following
order: Never married/single, Married, Widowed,
Divorced; everything else will be treated as unknown;
use an "X" instead of a value to skip a status
--same-state-of-residence-for-all SAME_STATE_OF_RESIDENCE_FOR_ALL
State abbreviation/number for all subjects
--same-state-of-birth-for-all SAME_STATE_OF_BIRTH_FOR_ALL
State abbreviation/number for all subjects
--age-at-death-units-for-all {MONTH,WEEK,DAY,HOUR,MINUTE}
Specify units for age of death it not years.
--name-format NAME_FORMAT
Format to parse full names. L=Last name, F=first name,
M=Middle name, S=father name, X=ignore; algorithm will
continue to add any character found to the name until
the next non-[LFMSX] character is found
--date-format DATE_FORMAT
Date format for parsing year/month/day from a date;
for more documentation, see https://docs.python.org/de
v/library/datetime.html#strftime-and-strptime-behavior
--sex-format SEX_FORMAT
Specify the values for male/female if different than
NDI using "MALE,FEMALE"; NDI default is "M,F" or "1,2"
or "M1,F2"
--validate-generated-file [VALIDATE_GENERATED_FILE]
Validate NDI file and output results to specified
file.
--strip-lname-suffix [STRIP_LNAME_SUFFIX]
Look for suffixes in lname column and strip them out;
default: JR, SR, II, III, IV; if specifying an
argument, use a comma-separated list as a single
string
--strip-lname-suffix-attached [STRIP_LNAME_SUFFIX_ATTACHED]
Look for suffixes in last word of lname column and
strip them out even if they are attached to the word
itself; default: JR, SR, II, III, IV; if specifying an
argument, use a comma-separated list as a single
string
optional arguments:
--duplicate-records-on-lname
If space or hyphen in last name, duplicate the subject
into three records: 1) both together; 2) only the
first part; 3) only the second part
--female-hyphen-lname-to-sname
If hyphen in last name of female, duplicate the
subject into two records: 1) both together; 2) only
the first part with the second part in the father last
name field
--duplicate-records-on-year-only
Create 12 duplicate records if only a year and no
month
--ignore-invalid-records
Ignore records which invalid per NDI requirements due
to insufficient information
--include-invalid-records
Include records which invalid per NDI requirements due
to insufficient information
--case-sensitive-columns
All columns will be treated as case-sensitive.
Advanced
Multiple Columns
You can output multiple columns on most options (not names or birthdate due to complexities with how they are handled, and not id because that wouldn't make any sense) by inserting a comma-separated set of values to arguments.
If the columns have the same input, only one output will be produced. If the columns have different values, then multiple records will be output.
# option to look at two columns for state of residence
# if PRIMARY_STATE == SECONDARY_STATE, only one record will be output
--state-of-residence=PRIMARY_STATE,SECONDARY_STATE
Validation
Validation is done during formatting to ensure that patients are eligible to be submitted to NDI (unless suppressed by --ignore-invalid-record
option).
Additional validation is available by including the [recommended] --validate-generated-file VALIDATION_ERROR_FILE
option and to optionally supply a file. This will launch the validator on the NDI file generated by formatter.
Validation comes in two forms:
- Is the data formatted correctly? (Done by validator)
- Is the record eligibile for NDI review? (Done by both formatter and validator)
License
MIT licensed: https://kpwhri.mit-license.org/
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ndi_formatter-1.1.0.tar.gz
.
File metadata
- Download URL: ndi_formatter-1.1.0.tar.gz
- Upload date:
- Size: 20.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | df8b1a5837f4823430095d895af889345875a5a3c51315fc0324398d05e4b6e3 |
|
MD5 | 4bd001800711c02d6100f39daac03509 |
|
BLAKE2b-256 | d1f970955daf62200a530f1ac15e4857b61cd67ae64df6c58ebcfc4812dd3739 |
File details
Details for the file ndi_formatter-1.1.0-py3-none-any.whl
.
File metadata
- Download URL: ndi_formatter-1.1.0-py3-none-any.whl
- Upload date:
- Size: 18.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 444f5d49e3df23c666969d2145ca5bce8c3ca41e7bd56e6c0519e9c5efb84a33 |
|
MD5 | cac85134fcfdba1740226132ca78b5a0 |
|
BLAKE2b-256 | cfff987ae0f032de611056202c06b7e1ccba6e121f3ec90b63356c48dcba3f97 |