A pre-processed longitudinal aggregate table of NVSS birth data in the US from 1968 onward
Project description
US Birth Data
This package simplifies the analysis of official birth records maintained by the National Vital Statistics System (NVSS). It does this by aggregating a limited set of common attributes across all years that the data are available, then storing the resulting data set in the highly compressed parquet format, which is small enough that it can be included as part of this package.
Install
The recommended method to install is via pip. This package requires python version 3.8 or higher.
pip install us_birth_data
Use
import us_birth_data as usb
df = usb.load_full_data()
print(df)
year month day_of_week state births
0 1968 April NaN Alabama 4838
1 1968 August NaN Alabama 5754
2 1968 December NaN Alabama 5490
3 1968 February NaN Alabama 4916
4 1968 January NaN Alabama 5172
.. ... ... ... ... ...
79 2015 September Saturday NaN 36236
80 2015 September Sunday NaN 31619
81 2015 September Thursday NaN 53171
82 2015 September Tuesday NaN 65511
83 2015 September Wednesday NaN 64926
[442944 rows x 5 columns]
Documentation
Please see the full documentation at readthedocs.
Why
The birth records are quite comprehensive, and go back to 1968. However, longitudinal analysis of these records is challenging. The data sets have gone through numerous schema changes over the decades. Some information that used to be available is no longer included in the public data sets (e.g. state of occurrence), some new information has been added (e.g delivery method), and many of the fields have undergone transformations over time (e.g. place of delivery used to include "En route or born on arrival (BOA)", but this value was dropped from the records in 1988). None of this is terribly problematic when analysis is performed on only one or two years of records, but spanning the entire length of these data sets requires complex processing.
The raw birth certificate data exceed 5 GB when compressed. Simultaneous decompression of these data is problematic on the typical workstation, and even after aggressive pruning of columns, loading hundreds of millions of records directly into memory will overflow most workstations.
This issue is solved via a multi-step data processing pipeline that incrementally decompresses the raw birth record data, prunes columns, and then reduces rows through aggregation of grouped records. The years are then combined, with additional logic to map similar attributes to consistent values over time. The result is a data set which can easily be shared, but still rich enough to perform meaningful analysis.
Most attributes of the birth data are excluded. If you need additional detail, you can use this package to generate your own data sets.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for us_birth_data-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1c0ba7cba43dc81b1bf8d5ee1495561a756d576cddbc0b8cc3be71062d1ae29d |
|
MD5 | 2f5bb3fc61c9c1b3f2164db06030ae53 |
|
BLAKE2b-256 | b12f3bf96f65bad70ea4371e6653adc7c674900280db8cfcc329c976ad1ff089 |