Some helper functions for working with Census 2020 data
Project description
Helper functions for Census 2020 data
Every decade the US Census Bureau releases data from its decennial census. However, the files they provide are quite complicated. And while they provide SAS and R, they don't provide any help for Python.
This package provides some convenience functions for playing around with all of this Census data in Python.
Requirements
We require Python 3.7.1 or above. This package does use pyarrow to make manipulating these large data sets easier. However, on some systems, you may encounter installation troubles. If you do, feel free to file an issue!
To install the package, simply run
pip install census2020
Usage
Getting the data
To use this package, you should first download the Census data. We've included a simple CLI for you to grab all of the data and preprocess it:
census2020 pull-all --output data
Here data
is a folder into which all the processed data will be dumped. WARNING: It
totals about 1.4GB after it's processed.
If for some reason CLI doesn't work, you can pull it by hand as follows:
from pathlib import Path
import pyarrow.parquet as pq
import us
from census2020 import downloader
output_dir = Path("data")
for state in sorted(set(us.STATES) | {us.states.DC}):
print(f"Downloading {state.name}...")
table = downloader.get_state(state.abbr)
pq.write_table(table, output_dir / f"{state.abbr.lower()}.parquet")
print(f"Done with {state.name}")
Historical data
There's a good chance you're downloading this data to compare it to historical data. If so, you can download the PL94 data from the 2010 Census similarly to above. Just run
census2020 pull-all --output data2010 --year 2010
Or if the CLI doesn't work for you, change the loop above to read:
table = downloader.get_state(state.abbr, year=2010)
Reading the data
Reading in all the data into memory can be a bit of a difficult task, so we have
provided some interfaces to pyarrow
's filtering features to help.
For example, suppose you wanted the total population of people who identify as both White and Asian in all Census Tracts in Kentucky, Indiana, and Ohio. Assuming you have downloaded all the data, you can run the following code:
from census2020 import readers
from census2020.constants import SummaryLevel
df = readers.read_filtered_dataset(
"data",
states=["KY", "IN", "OH"],
levels=SummaryLevel.STATE_COUNTY_TRACT,
columns="P0010013",
).to_pandas()
Here "data"
is the location to which you downloaded the Census data, which can be
either the 2020 or 2010 data.
Each of states
, columns
, and levels
can be either singular values or lists of
values. If no value is specified, then all states, columns, and levels available
will be returned.
Codebook
More detail on the information in these files is available from the Census Bureau. In particular, a summary of the fields meanings is available in this Excel file, reporduced in this repo as field_names.xlsx
.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for census2020-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e3cb9e207f6b72afc997a936529948fa07d8dad82dfe2380b3f4e0558f4c08e |
|
MD5 | 9a42262521227b710bd63f1df494557f |
|
BLAKE2b-256 | f48fba2f2142061b725376ccba8c1c7150beaef27ea6a9773cd81ae475b54e59 |