Skip to main content

Tool to get and query US Census data

Project description

Imports: isort pre-commit Code style: black

Release Staging CI

The Census

Want to work with US Census data? Look no further.

Getting started

View all datasets

If you you're not sure what Census dataset you're interested in, the following code will take care of you:

from the_census import Census

Census.list_available_datasets()

This will present you with a pandas DataFrame listing all available datasets from the US Census API. (This includes only aggregate datasets, as they other types [of which there are very few] don't play nice with the client).

Help with terminology

Some of the terms used in the data returned can be a bit opaque. To get a clearer sense of what some of those mean, run this:

Census.help()

This will print out links to documentation for various datasets, along with what their group/variable names mean, and how statistics were calculated.

Selecting a dataset

Before getting started, you need to get a Census API key, and set the following the environment variable CENSUS_API_KEY to whatever that key is, either with

export CENSUS_API_KEY=<your key>

or in a .env file:

CENSUS_API_KEY=<your key>

Say you're interested in the American Community Survey 1-year estimates for 2019. Look up the dataset and survey name in the table provided by list_available_datasets, and execute the following code:

>>> from the_census import Census
>>> Census(year=2019, dataset="acs", survey="acs1")

<Census year=2019 dataset=acs survey=acs1>

The dataset object will now let you query any census data for the the ACS 1-year estimates of 2019. We'll now dive into how to query this dataset with the tool. However, if you aren't familiar with dataset "architecture", check out this section.

Arguments to Census

This is the signature of Census:

class Census
    def __init__(self,
                 year: int,
                 dataset: str = "acs",
                 survey: str = "acs1",
                 cache_dir: str = CACHE_DIR,        # cache
                 should_load_from_existing_cache: bool = False,
                 should_cache_on_disk: bool = False,
                 replace_column_headers: bool = True,
                 log_file: str = DEFAULT_LOG_FILE): # census.log
        pass
  • year: the year of the dataset
  • dataset: type of the dataset, specified by list_available_datasets
  • survey: type of the survey, specified by list_available_datasets
  • cache_dir: if you opt in to on-disk caching (more on this below), the name of the directory in which to store cached data
  • should_load_from_existing_cache: if you have cached data from a previous session, this will reload cached data into the Census object, instead of hitting the Census API when that data is queried
  • should_cache_on_disk: whether or not to cache data on disk, to avoid repeat API calls. The following data will be cached:
    • Supported Geographies
    • Group codes
    • Variable codes
  • replace_column_headers: whether or not to replace column header names for variables with more intelligible names instead of their codes
  • log_file: name of the file in which to store logging information
A note on caching

While on-disk caching is optional, this tool, by design, performs in-memory caching. So a call to census.get_groups() will hit the Census API one time at most. All subsequent calls will retrieve the value cached in-memory.

Making queries

Supported geographies

Getting the supported geographies for a dataset as as simple as this:

census.get_supported_geographies()

This will output a DataFrame will all possible supported geographies (e.g., if I can query all school districts across all states).

Supported geographies autocomplete

If you don't want to have to keep on typing supported geographies after this, you can use tab-completion in Jupyter by typing:

census.supported_geographies.<TAB>

Geography codes

If you decide you want to query a particular geography (e.g., a particular school district within a particular state), you'll need the FIPS codes for that school district and state.

So, if you're interested in all school districts in Colorado, here's what you'd do:

  1. Get FIPS codes for all states:
from the_census import GeoDomain

census.get_geography_codes(GeoDomain("state", "*"))

Or, if you don't want to import GeoDomain, and prefer to use tuples:

census.get_geography_codes(("state", "*"))
  1. Get FIPS codes for all school districts within Colorado (FIPS code 08):
census.get_geography_codes(GeoDomain("school district", "*"),
                           GeoDomain("state", "08"))

Or, if you don't want to import GeoDomain, and prefer to use tuples:

census.get_geography_codes(("school district", "*"),
                           ("state", "08"))

Note that geography code queries must follow supported geography guidelines.

Groups

Want to figure out what groups are available for your dataset? No problem. This will do the trick for ya:

census.get_groups()

...and you'll get a DataFrame with all groups for your census.

Searching groups

census.get_groups() will return a lot of data that might be difficult to slog through. In that case, run this:

census.search_groups(regex=r"my regex")

and you'll get a filtered DataFrame with matches to your regex.

Groups autocomplete

If you're working in a Jupyter notebook and have autocomplete enabled, running census.groups., followed by a tab, will trigger an autocomplete menu for possible groups by their name (as opposed to their code, which doesn't have any inherent meaning in and of itself).

census.groups.SexByAge   # code for this group

Variables

You can either get a DataFrame of variables based on a set of groups:

census.get_variables_by_group(census.groups.SexByAge,
                              census.groups.MedianAgeBySex)

Or, you can get a DataFrame with all variables for a given dataset:

census.get_all_variables()

This second operation, can, however, take a lot of time.

Searching variables

Similar to groups, you can search variables by regex:

census.search_variables(r"my regex")

And, you can limit that search to variables of a particular group or groups:

census.search_variables(r"my regex", census.groups.SexByAge)

Variables autocomplete

Variables also support autocomplete for their codes, as with groups.

census.variables.EstimateTotal_B01001  # code for this variable

(These names must be suffixed with the group code, since, while variable codes are unique across groups, their names are not unique across groups.)

Statistics

Once you have the variables you want to query, along with the geography you're interested in, you can now make statistics queries from your dataset:

from the_census import GeoDomain

variables = census.get_variables_for_group(census.groups.SexByAge)

census.get_stats(variables["code"].tolist(),
                 GeoDomain("school district", "*"),
                 GeoDomain("state", "08"))

Or, if you'd rather use tuples instead of GeoDomain:

variables = census.get_variables_for_group(census.groups.SexByAge)

census.get_stats(variables["code"].tolist(),
                 ("school district", "*"),
                 ("state", "08"))

General notes on autocomplete

Jupyter notebook/lab has been having an issue with autocomplete lately (see this GitHub issue), so running the following in your environment should help you take advantage of the autocomplete offerings of this package:

pip install jedi==0.17.2

Dataset "architecture"

US Census datasets have 3 primary components:

  1. Groups
  2. Variables
  3. Supported Geographies

Groups

A group is a "category" of data gathered for a particular census. For example, the SEX BY AGE group would provide breakdowns of gender and age demographics in a given region in the United States.

Some of these groups' names, however, are a not as clear as SEX BY AGE. In that case, I recommend heading over to the survey in question's technical documentation which elaborates on what certain terms mean with respect to particular groups. Unfortunately, the above link might be complicated to navigate, but if you're looking for ACS group documentation, here's a handy link.

(You can also get these links by running Census.help().)

Variables

Variables measure a particular data-point. While they have their own codes, you might find variables which share the same name (e.g., Estimate!!:Total:). This is because each variable belongs to a group. So, the Estimate!!:Total variable for SEX BY AGE group is the total of all queried individuals in that group; but the Estimate!!:Total variable for POVERTY STATUS IN THE PAST 12 MONTHS BY AGE group is the total of queried individuals for that group. (It's important when calculating percentages that you work within the same group. So if I want the percent of men in the US, whose total number I got from SEX BY AGE I should use the Estimate!!:Total: of that group as my denominator, and not the Estimate!!:Total: of the POVERTY STATUS group).

Variables on their own, however, do nothing. They mean something only when you query a particular geography for them.

Supported Geographies

Supported geographies dictate the kinds of queries you can make for a given census. For example, in the ACS-1, I might be interested in looking at stats across all school districts. The survey's supported geographies will tell me if I can actually do that; or, if I need to refine my query to look at school districts in a given state or smaller region.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

the_census-2.1.2.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

the_census-2.1.2-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file the_census-2.1.2.tar.gz.

File metadata

  • Download URL: the_census-2.1.2.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.9.1 Linux/5.4.0-1032-azure

File hashes

Hashes for the_census-2.1.2.tar.gz
Algorithm Hash digest
SHA256 4ffa20ee12b90e39417c07fed433dd3e891f5c12b0e7c9397bd16fb42f59a4fa
MD5 f0d681fd35aa24304002694f9617d3f9
BLAKE2b-256 ef7a75a03cf2a567b3f22c27c9ca6f84986957069467cd0ed3839cf07c77b90a

See more details on using hashes here.

File details

Details for the file the_census-2.1.2-py3-none-any.whl.

File metadata

  • Download URL: the_census-2.1.2-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.4 CPython/3.9.1 Linux/5.4.0-1032-azure

File hashes

Hashes for the_census-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1333136d4ba8397750992573a5d30d4ebb1d89153e9beb3289235462b174adaa
MD5 d80e7bba02294c95022f77c7c1c36989
BLAKE2b-256 217d62ddaad934e72fa28ae996d32416855534559a222ec47823c92ecf996313

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page