Skip to main content

A pre-processed longitudinal aggregate table of NVSS birth data in the US from 1968 onward

Project description

US Birth Data

PyPI Documentation Status codecov

This package simplifies the analysis of official birth records maintained by the National Vital Statistics System (NVSS). It does this by aggregating a limited set of common attributes across all years that the data are available, then storing the resulting data set in the highly compressed parquet format, which is small enough that it can be included as part of this package.

Install

The recommended method to install is via pip. This package requires python version 3.8 or higher.

pip install us_birth_data

Due to the large size of the data set, it cannot be included as part of the pip installation. However, this package includes a function to easily obtain the data and make it available for use.

Use the download_full_data command after installation to obtain the data from the GitHub repo where the source code is hosted.

from us_birth_data import download_full_data
download_full_data()

Use

import us_birth_data as usb
df = usb.load_full_data()
print(df)
        year      month day_of_week  ... age_of_mother parity births
0       1968      April         NaN  ...          13.0    NaN      2
1       1968      April         NaN  ...          14.0    NaN     10
2       1968      April         NaN  ...          15.0    NaN     22
3       1968      April         NaN  ...          16.0    NaN     56
4       1968      April         NaN  ...          17.0    NaN    102
      ...        ...         ...  ...           ...    ...    ...
100279  2019  September   Wednesday  ...          27.0    3.0      1
100280  2019  September   Wednesday  ...          28.0    NaN      1
100281  2019  September   Wednesday  ...          30.0    7.0      1
100282  2019  September   Wednesday  ...          35.0    NaN      1
100283  2019  September   Wednesday  ...          36.0    NaN      1

Documentation

Please see the full documentation at readthedocs.

Why

The birth records are quite comprehensive, and go back to 1968. However, longitudinal analysis of these records is challenging. The data sets have gone through numerous schema changes over the decades. Some information that used to be available is no longer included in the public data sets (e.g. state of occurrence), some new information has been added (e.g delivery method), and many of the fields have undergone transformations over time (e.g. place of delivery used to include "En route or born on arrival (BOA)", but this value was dropped from the records in 1988). None of this is terribly problematic when analysis is performed on only one or two years of records, but spanning the entire length of these data sets requires complex processing.

The raw birth certificate data exceed 5 GB when compressed. Simultaneous decompression of these data is problematic on the typical workstation, and even after aggressive pruning of columns, loading hundreds of millions of records directly into memory will overflow most workstations.

This issue is solved via a multi-step data processing pipeline that incrementally decompresses the raw birth record data, prunes columns, and then reduces rows through aggregation of grouped records. The years are then combined, with additional logic to map similar attributes to consistent values over time. The result is a data set which can easily be shared, but still rich enough to perform meaningful analysis.

Most attributes of the birth data are excluded. If you need additional detail, you can use this package to generate your own data sets.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

us_birth_data-0.1.4.tar.gz (161.6 kB view details)

Uploaded Source

Built Distribution

us_birth_data-0.1.4-py3-none-any.whl (173.9 kB view details)

Uploaded Python 3

File details

Details for the file us_birth_data-0.1.4.tar.gz.

File metadata

  • Download URL: us_birth_data-0.1.4.tar.gz
  • Upload date:
  • Size: 161.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.9.0

File hashes

Hashes for us_birth_data-0.1.4.tar.gz
Algorithm Hash digest
SHA256 c6a996d8a60fc9fa81d1f973dff6b162de29198e8d2a8311835dfc1cbc821a55
MD5 a04a61ad467dfef8290e15edf8a29afc
BLAKE2b-256 ca39320f06206fa77e52715c077ed3495d00ca9f37360986487dc12a4e1d1703

See more details on using hashes here.

File details

Details for the file us_birth_data-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: us_birth_data-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 173.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.9.0

File hashes

Hashes for us_birth_data-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ecceb8c3948049dad24b65024257e964104d8cea3717e1a3196ea8feb86f375e
MD5 e8b08ae715d2d7f7701c08672eee72db
BLAKE2b-256 8bf250507fad160c53618310373565275ba00057111f916f8b1db56663bede6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page