Tool to get and query US Census data
Project description
The Census
Want to work with US Census data? Look no further.
Getting started
View all datasets
If you you're not sure what Census dataset you're interested in, the following code will take care of you:
from the_census import Census
Census.list_available_datasets()
This will present you with a pandas DataFrame listing all available datasets from the US Census API. (This includes only aggregate datasets, as they other types [of which there are very few] don't play nice with the client).
Help with terminology
Some of the terms used in the data returned can be a bit opaque. To get a clearer sense of what some of those mean, run this:
Census.help()
This will print out links to documentation for various datasets, along with what their group/variable names mean, and how statistics were calculated.
Selecting a dataset
Before getting started, you need to get a Census API key, and set the following the environment variable CENSUS_API_KEY to whatever that key is, either with
export CENSUS_API_KEY=<your key>
or in a .env file:
CENSUS_API_KEY=<your key>
Say you're interested in the American Community Survey 1-year estimates for 2019. Look up the dataset and survey name in the table provided by list_available_datasets, and execute the following code:
>>> from the_census import Census
>>> Census(year=2019, dataset="acs", survey="acs1")
<Census year=2019 dataset=acs survey=acs1>
The dataset object will now let you query any census data for the the ACS 1-year estimates of 2019. We'll now dive into how to query this dataset with the tool. However, if you aren't familiar with dataset "architecture", check out this section.
Arguments to Census
This is the signature of Census:
class Census
def __init__(self,
year: int,
dataset: str = "acs",
survey: str = "acs1",
cache_dir: str = CACHE_DIR, # cache
should_load_from_existing_cache: bool = False,
should_cache_on_disk: bool = False,
replace_column_headers: bool = True,
log_file: str = DEFAULT_LOG_FILE): # census.log
pass
year: the year of the datasetdataset: type of the dataset, specified bylist_available_datasetssurvey: type of the survey, specified bylist_available_datasetscache_dir: if you opt in to on-disk caching (more on this below), the name of the directory in which to store cached datashould_load_from_existing_cache: if you have cached data from a previous session, this will reload cached data into theCensusobject, instead of hitting the Census API when that data is queriedshould_cache_on_disk: whether or not to cache data on disk, to avoid repeat API calls. The following data will be cached:- Supported Geographies
- Group codes
- Variable codes
replace_column_headers: whether or not to replace column header names for variables with more intelligible names instead of their codeslog_file: name of the file in which to store logging information
A note on caching
While on-disk caching is optional, this tool, by design, performs in-memory caching. So a call to census.get_groups() will hit the Census API one time at most. All subsequent calls will retrieve the value cached in-memory.
Making queries
Supported geographies
Getting the supported geographies for a dataset as as simple as this:
census.get_supported_geographies()
This will output a DataFrame will all possible supported geographies (e.g., if I can query all school districts across all states).
Supported geographies autocomplete
If you don't want to have to keep on typing supported geographies after this, you can use tab-completion in Jupyter by typing:
census.supported_geographies.<TAB>
Geography codes
If you decide you want to query a particular geography (e.g., a particular school district within a particular state), you'll need the FIPS codes for that school district and state.
So, if you're interested in all school districts in Colorado, here's what you'd do:
- Get FIPS codes for all states:
from the_census import GeoDomain
census.get_geography_codes(GeoDomain("state", "*"))
Or, if you don't want to import GeoDomain, and prefer to use tuples:
census.get_geography_codes(("state", "*"))
- Get FIPS codes for all school districts within Colorado (FIPS code
08):
census.get_geography_codes(GeoDomain("school district", "*"),
GeoDomain("state", "08"))
Or, if you don't want to import GeoDomain, and prefer to use tuples:
census.get_geography_codes(("school district", "*"),
("state", "08"))
Note that geography code queries must follow supported geography guidelines.
Groups
Want to figure out what groups are available for your dataset? No problem. This will do the trick for ya:
census.get_groups()
...and you'll get a DataFrame with all groups for your census.
Searching groups
census.get_groups() will return a lot of data that might be difficult to slog through. In that case, run this:
census.search_groups(regex=r"my regex")
and you'll get a filtered DataFrame with matches to your regex.
Groups autocomplete
If you're working in a Jupyter notebook and have autocomplete enabled, running census.groups., followed by a tab, will trigger an autocomplete menu for possible groups by their name (as opposed to their code, which doesn't have any inherent meaning in and of itself).
census.groups.SexByAge # code for this group
Variables
You can either get a DataFrame of variables based on a set of groups:
census.get_variables_by_group(census.groups.SexByAge,
census.groups.MedianAgeBySex)
Or, you can get a DataFrame with all variables for a given dataset:
census.get_all_variables()
This second operation, can, however, take a lot of time.
Searching variables
Similar to groups, you can search variables by regex:
census.search_variables(r"my regex")
And, you can limit that search to variables of a particular group or groups:
census.search_variables(r"my regex", census.groups.SexByAge)
Variables autocomplete
Variables also support autocomplete for their codes, as with groups.
census.variables.EstimateTotal_B01001 # code for this variable
(These names must be suffixed with the group code, since, while variable codes are unique across groups, their names are not unique across groups.)
Statistics
Once you have the variables you want to query, along with the geography you're interested in, you can now make statistics queries from your dataset:
from the_census import GeoDomain
variables = census.get_variables_for_group(census.groups.SexByAge)
census.get_stats(variables["code"].tolist(),
GeoDomain("school district", "*"),
GeoDomain("state", "08"))
Or, if you'd rather use tuples instead of GeoDomain:
variables = census.get_variables_for_group(census.groups.SexByAge)
census.get_stats(variables["code"].tolist(),
("school district", "*"),
("state", "08"))
General notes on autocomplete
Jupyter notebook/lab has been having an issue with autocomplete lately (see this GitHub issue), so running the following in your environment should help you take advantage of the autocomplete offerings of this package:
pip install jedi==0.17.2
Dataset "architecture"
US Census datasets have 3 primary components:
Groups
A group is a "category" of data gathered for a particular census. For example, the SEX BY AGE group would provide breakdowns of gender and age demographics in a given region in the United States.
Some of these groups' names, however, are a not as clear as SEX BY AGE. In that case, I recommend heading over to the survey in question's technical documentation which elaborates on what certain terms mean with respect to particular groups. Unfortunately, the above link might be complicated to navigate, but if you're looking for ACS group documentation, here's a handy link.
(You can also get these links by running Census.help().)
Variables
Variables measure a particular data-point. While they have their own codes, you might find variables which share the same name (e.g., Estimate!!:Total:). This is because each variable belongs to a group. So, the Estimate!!:Total variable for SEX BY AGE group is the total of all queried individuals in that group; but the Estimate!!:Total variable for POVERTY STATUS IN THE PAST 12 MONTHS BY AGE group is the total of queried individuals for that group. (It's important when calculating percentages that you work within the same group. So if I want the percent of men in the US, whose total number I got from SEX BY AGE I should use the Estimate!!:Total: of that group as my denominator, and not the Estimate!!:Total: of the POVERTY STATUS group).
Variables on their own, however, do nothing. They mean something only when you query a particular geography for them.
Supported Geographies
Supported geographies dictate the kinds of queries you can make for a given census. For example, in the ACS-1, I might be interested in looking at stats across all school districts. The survey's supported geographies will tell me if I can actually do that; or, if I need to refine my query to look at school districts in a given state or smaller region.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file the_census-2.1.2.tar.gz.
File metadata
- Download URL: the_census-2.1.2.tar.gz
- Upload date:
- Size: 27.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.9.1 Linux/5.4.0-1032-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ffa20ee12b90e39417c07fed433dd3e891f5c12b0e7c9397bd16fb42f59a4fa
|
|
| MD5 |
f0d681fd35aa24304002694f9617d3f9
|
|
| BLAKE2b-256 |
ef7a75a03cf2a567b3f22c27c9ca6f84986957069467cd0ed3839cf07c77b90a
|
File details
Details for the file the_census-2.1.2-py3-none-any.whl.
File metadata
- Download URL: the_census-2.1.2-py3-none-any.whl
- Upload date:
- Size: 34.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.4 CPython/3.9.1 Linux/5.4.0-1032-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1333136d4ba8397750992573a5d30d4ebb1d89153e9beb3289235462b174adaa
|
|
| MD5 |
d80e7bba02294c95022f77c7c1c36989
|
|
| BLAKE2b-256 |
217d62ddaad934e72fa28ae996d32416855534559a222ec47823c92ecf996313
|