Some helper functions for working with Census 2020 data

# Helper functions for Census 2020 data

Every decade the US Census Bureau releases data from its decennial census. However, the files they provide are quite complicated. And while they provide SAS and R, they don't provide any help for Python.

This package provides some convenience functions for playing around with all of this Census data in Python.

## Requirements

We require Python 3.7.1 or above. This package does use pyarrow to make manipulating these large data sets easier. However, on some systems, you may encounter installation troubles. If you do, feel free to file an issue!

To install the package, simply run

pip install census2020


## Usage

### Getting the data

To use this package, you should first download the Census data. We've included a simple CLI for you to grab all of the data and preprocess it:

census2020 pull-all --output data


Here data is a folder into which all the processed data will be dumped. WARNING: It totals about 1.4GB after it's processed.

If for some reason CLI doesn't work, you can pull it by hand as follows:

from pathlib import Path

import pyarrow.parquet as pq
import us

output_dir = Path("data")

for state in sorted(set(us.STATES) | {us.states.DC}):
pq.write_table(table, output_dir / f"{state.abbr.lower()}.parquet")
print(f"Done with {state.name}")


### Historical data

There's a good chance you're downloading this data to compare it to historical data. If so, you can download the PL94 data from the 2010 Census similarly to above. Just run

census2020 pull-all --output data2010 --year 2010


Or if the CLI doesn't work for you, change the loop above to read:

table = downloader.get_state(state.abbr, year=2010)


Reading in all the data into memory can be a bit of a difficult task, so we have provided some interfaces to pyarrow's filtering features to help.

For example, suppose you wanted the total population of people who identify as both White and Asian in all Census Tracts in Kentucky, Indiana, and Ohio. Assuming you have downloaded all the data, you can run the following code:

from census2020 import readers
from census2020.constants import SummaryLevel

"data",
states=["KY", "IN", "OH"],
levels=SummaryLevel.STATE_COUNTY_TRACT,
columns="P0010013",
).to_pandas()


Here "data" is the location to which you downloaded the Census data, which can be either the 2020 or 2010 data.

Each of states, columns, and levels can be either singular values or lists of values. If no value is specified, then all states, columns, and levels available will be returned.

## Codebook

More detail on the information in these files is available from the Census Bureau. In particular, a summary of the fields meanings is available in this Excel file, reporduced in this repo as field_names.xlsx.

MIT

## Project details

Uploaded source
Uploaded py3