A helper library for exploring and fetching data from the U.S. Census Bureau API.
Project description
cendat: A Python Helper for the Census API
Introduction
cendat is a Python library designed to simplify the process of exploring and retrieving data from the U.S. Census Bureau’s API. It provides a high-level, intuitive workflow for discovering available datasets, filtering geographies and variables, and fetching data concurrently.
The library handles the complexities of the Census API’s structure, such as geographic hierarchies and inconsistent product naming, allowing you to focus on getting the data you need.
You can find regular cendat updates and musings on the developer blog.
Workflow
The library is designed around a simple, four-step “List -> Set -> Get -> Convert/Analyze” workflow:
- List: Use the
list_*methods (list_products,list_geos,list_groups,list_variables) with patterns to explore what’s available and filter down to what you need. - Set: Use the
set_*methods (set_products,set_geos,set_groups,set_variables) to lock in your selections. You can call these methods without arguments to use the results from your last “List” call. Thedescribe_groupsmethod is especially helpful for variable selection in programs with many variables, like the ACS. - Get: Call the
get_data()method to build and execute all the necessary API calls. This method handles complex geographic requirements automatically and utilizes thread pooling for speed. - Convert & Analyze: Use the
to_polars()orto_pandas()methods on the response object to get your data in a ready-to-use DataFrame format. The response object also includes a powerfultabulate()method for quick, Stata-like frequency tables.
Installation
You can install cendat using pip.
pip install cendat
The library has optional dependencies for converting the response data into pandas or polars DataFrames. You can install the support you need:
Install with pandas support
pip install cendat[pandas]
Install with geopandas support
pip install cendat[geopandas]
Install with polars support
pip install cendat[polars]
Install with all three
pip install cendat[all]
API Reference
CenDatHelper Class
This is the main class for building and executing queries.
__init__(self, years=None, key=None)
Initializes the helper object.
years(int|list[int], optional): The year or years of interest. Can be a single integer or a list of integers. Defaults toNone.key(str, optional): Your Census API key. Providing a key is recommended to avoid strict rate limits. Defaults toNone.
set_years(self, years)
Sets the primary year or years for data queries.
years(int|list[int]): The year or years to set.
load_key(self, key=None)
Loads a Census API key for authenticated requests.
key(str, optional): The API key to load.
list_products(self, years=None, patterns=None, to_dicts=True, logic=all, match_in='title')
Lists available data products, filtered by year and search patterns.
years(int|list[int], optional): Filters products available for the specified year(s). Defaults to the years set on the object.patterns(str|list[str], optional): Regex pattern(s) to search for within the product metadata.to_dicts(bool): IfTrue(default), returns a list of dictionaries with full product details. IfFalse, returns a list of product titles.logic(callable): The logic to use when multiple patterns are provided. Can beall(default) orany.match_in(str): The field to match patterns against. Can be'title'(default) or'desc'.
set_products(self, titles=None)
Sets the active data products for the session and unsets any previously set variables, geos, and groups.
titles(str|list[str], optional): The title or list of titles of the products to set. IfNone, it sets all products from the lastlist_products()call.
list_geos(self, to_dicts=False, patterns=None, logic=all)
Lists available geographies for the currently set products.
to_dicts(bool): IfTrue, returns a list of dictionaries with full geography details. IfFalse(default), returns a list of unique summary level (sumlev) strings.patterns(str|list[str], optional): Regex pattern(s) to search for within the geography description.logic(callable): The logic to use when multiple patterns are provided. Can beall(default) orany.
set_geos(self, values=None, by='sumlev')
Sets the active geographies for the session.
values(str|list[str], optional): The geography values to set. IfNone, sets all geos from the lastlist_geos()call.by(str): The key to use for matchingvalues. Must be either'sumlev'(default) or'desc'.
list_groups(self, to_dicts=True, patterns=None, logic=all, match_in='description')
Lists available variable groups for the currently set products. Not all products have groups, in which case the resulting list will be empty.
to_dicts(bool): IfTrue(default), returns a list of dictionaries with full group details. IfFalse, returns a list of unique group names.patterns(str|list[str], optional): Regex pattern(s) to search for within the group metadata.logic(callable): The logic to use when multiple patterns are provided. Can beall(default) orany.match_in(str): The field to match patterns against. Can be'description'(default) or'name'.
set_groups(self, names=None)
Sets the active variable groups for the session. If the call to set_groups results in a single group for each product vintage and all group variables are wanted, set_variables may be skipped.
names(str|list[str], optional): The name or list of names of the groups to set. IfNone, sets all groups from the lastlist_groups()call.
describe_groups(self, groups=None)
Print hierarchically-nested group descriptions to facilitate variable selection.
groups(str|list[str], optional): The name or list of names of the groups to describe. IfNone, describes all set groups or reports an error with instructions to set groups or use thegroupsparameter.
list_variables(self, to_dicts=True, patterns=None, logic=all, match_in='label', groups=None)
Lists available variables for the currently set products.
to_dicts(bool): IfTrue(default), returns a list of dictionaries with full variable details. IfFalse, returns a list of unique variable names.patterns(str|list[str], optional): Regex pattern(s) to search for within the variable metadata.logic(callable): The logic to use when multiple patterns are provided. Can beall(default) orany.match_in(str): The field to match patterns against. Can be'label'(default),'name'or'concept'.groups(str|list[str], optional): Variable groups within which the listing will be limited. Groups provided here override whatever groups may be set, and set groups will be used if this isNone.
set_variables(self, names=None)
Sets the active variables for the session. If exactly one group is set for each product vintage and all group variables are wanted, set_variables may be skipped. Doing so allows for more than the standard API max of 50 variables.
names(str|list[str], optional): The name or list of names of the variables to set. IfNone, sets all variables from the lastlist_variables()call.
get_data(self, within='us', max_workers=100, timeout=30, preview_only=False, include_names=False, include_geoids=False, include_attributes=False)
Executes the API calls based on the set parameters and retrieves the data.
within(str|dict|list[dict], optional): Defines the geographic scope of the query.- For aggregate data, this can be a dictionary filtering parent geographies (e.g.,
{'state': '06'}for California). A list of dictionaries can be provided to query multiple scopes. - For microdata, this must be a dictionary specifying the target geography and its codes (e.g.,
{'public use microdata area': ['7701', '7702']}). - Defaults to
'us'for nationwide data where applicable.
- For aggregate data, this can be a dictionary filtering parent geographies (e.g.,
max_workers(int, optional): The maximum number of concurrent threads to use for making API calls. For requests generating thousands of calls, it's wise to keep this value lower (e.g.,< 100) to avoid server-side connection issues. Defaults to100.timeout(int, optional): Request timeout in seconds for each API call. Defaults to30.preview_only(bool): IfTrue, builds the list of API calls but does not execute them. Useful for debugging. Defaults toFalse.include_names(bool): IfTrue, includes geography name (NAME) in API request--this variable is a special keyword understood by the data endpoint but is not included invariables.jsonand is therefore not discoverable throughlist_variables(). Note that NAME requests for microdata products will be ignored (with a message). Defaults toFalse.include_geoids(bool): IfTrue, includes geography ID (GEO_ID) in API request--this variable is a special keyword understood by the data endpoint but is not included invariables.jsonand is therefore not discoverable throughlist_variables(). Note that GEO_ID requests for microdata products will be ignored (with a message). Defaults toFalse.include_attributes(bool): IfTrue, includes attributes associated with set variables (e.g., margins of error) in API request if available. Defaults toFalse.include_geometry(bool): IfTrue, concurrent queries are issued to the TIGERweb REST Services for eligible products and geographies. Defaults toFalse. Note that only aggregate data products and certain geographies (currentlyregion: 020,division: 030,state: 040,county: 050,county subdivision: 060,census tract: 140,census block group: 150, andplace: 160) are supported.in_place(bool): IfTrue, data and geometries are not purged from the instantiated helper object'sparamsand the method returnsNone. Defaults toFalse.
CenDatResponse Class
A container for the data returned by CenDatHelper.get_data().
to_polars(self, schema_overrides=None, concat=False, destring=False)
Converts the raw response data into a list of Polars DataFrames.
schema_overrides(dict, optional): A dictionary mapping column names to Polars data types to override the inferred schema. Example:{'POP': pl.Int64}.concat(bool): IfTrue, concatenates all resulting DataFrames into a single DataFrame. Defaults toFalse.destring(bool): IfTrue, attempts to convert string representations of numbers into native numeric types. Defaults toFalse.
to_pandas(self, dtypes=None, concat=False, destring=False)
Converts the raw response data into a list of Pandas DataFrames.
dtypes(dict, optional): A dictionary mapping column names to Pandas data types, which is passed to the.astype()method. Example:{'POP': 'int64'}.concat(bool): IfTrue, concatenates all resulting DataFrames into a single DataFrame. Defaults toFalse.destring(bool): IfTrue, attempts to convert string representations of numbers into native numeric types. Defaults toFalse.
to_gpd(self, dtypes=None, destring=False, join_strategy='left')
Converts the raw response data into a single Pandas GeoDataFrame with geometries included.
dtypes(dict, optional): A dictionary mapping column names to Pandas data types, which is passed to the.astype()method. Example:{'POP': 'int64'}.destring(bool): IfTrue, attempts to convert string representations of numbers into native numeric types. Defaults toFalse.join_strategy(str): Determines how geometries are joined onto data. Can be'left'(default) or'outer'. Note that'left'may result in data rows with no geometries. This can happen for data products with no directly matching TIGERweb map server, and generally should not be the case for ACS or Decennial (> 2010) products.
tabulate(self, *variables, strat_by=None, weight_var=None, weight_div=None, where=None, logic=all, digits=1)
Generates and prints a frequency table.
*variables(str): One or more column names to include in the tabulation.strat_by(str, optional): A column name to stratify the results by. Percentages and cumulative stats will be calculated within each stratum. Defaults toNone.weight_var(str, optional): The name of the column to use for weighting. IfNone, each row has a weight of 1. Defaults toNone.weight_div(int, optional): A positive integer to divide the weight by, useful for pooled tabulations across multiple product vintages.weight_varmust be provided if this is used. Defaults toNone.where(str|list[str], optional): A string or list of strings representing conditions to filter the data before tabulation. Each condition should be in a format like"variable operator value"(e.g.,"AGE > 30") or"variable1 / variable2 operator value"(e.g.,"B17001_002E / B17001_001E < 0.01"). Defaults toNone.logic(callable): The function to apply when multiplewhereconditions are provided. Useallfor AND logic (default) oranyfor OR logic.digits(int): The number of decimal places to display for floating-point numbers in the output table. Defaults to1.
Usage Examples
import os
from cendat import CenDatHelper
from dotenv import load_dotenv
load_dotenv()
# --- ACS PUMS ANALYSIS ---
# 1. Initialize and set up the query
cdh = CenDatHelper(years=[2022], key=os.getenv("CENSUS_API_KEY"))
cdh.list_products(patterns=r"acs/acs1/pums\b")
cdh.set_products()
cdh.set_geos(values="state", by="desc")
cdh.set_variables(names=["SEX", "AGEP", "ST", "PWGTP"])
# 2. Get data for two states
response = cdh.get_data(
within={"state": ["06", "48"]}, # California and Texas
)
# 3. Create a stratified tabulation
print("Age Distribution by Sex, Stratified by State")
response.tabulate(
"SEX", "AGEP",
strat_by="ST",
weight_var="PWGTP",
where="AGEP > 17" # Filter for adults
)
# 4. Convert to DataFrame for further analysis
df = response.to_polars(concat=True, destring=True)
print(df.head())
# --- ACS 5YR AGGREGATE ANALYSIS ---
cdh = CenDatHelper(key=os.getenv("CENSUS_API_KEY"))
cdh.list_products(years=[2023], patterns=r"acs/acs5\)")
cdh.set_products()
cdh.list_groups(patterns="sex by age")
cdh.set_groups("B17001")
cdh.describe_groups()
cdh.set_variables(["B17001_001E", "B17001_002E"])
cdh.set_geos(["160"])
response = cdh.get_data(
include_names=True,
include_attributes=True,
)
df = response.to_polars(concat=True, destring=True)
df.glimpse()
# --- ACS 5YR AGGREGATE ANALYSIS ---
cdh = CenDatHelper(key=os.getenv("CENSUS_API_KEY"))
cdh.list_products(years=[2023], patterns=r"/acs/acs5\)")
cdh.set_products()
cdh.set_variables("B01001_001E") # total population
cdh.set_geos("150")
response = cdh.get_data()
# how many counties
response.tabulate("state", where="B01001_001E > 10_000")
# how many people in those counties
response.tabulate("state", weight_var="B01001_001E", where="B01001_001E > 10_000")
# --- CPS MICRODATA ANALYSIS ---
cdh = CenDatHelper(key=os.getenv("CENSUS_API_KEY"))
cdh.list_products(years=[2022, 2023], patterns="/cps/tobacco")
cdh.set_products()
cdh.list_groups()
cdh.set_variables(["PEA1", "PEA3", "PWNRWGT"])
cdh.set_geos("state", "desc")
response = cdh.get_data(within={"state": ["06", "48"]})
response.tabulate(
"PEA1",
"PEA3",
strat_by="state",
weight_var="PWNRWGT",
weight_div=3,
)
# --- ACS ANALYSIS: see Colorado incorporated places with very low poverty across years ---
cdh = CenDatHelper(key=os.getenv("CENSUS_API_KEY"))
cdh.list_products(years=[2020, 2021, 2022, 2023], patterns=r"acs/acs5\)")
cdh.set_products()
cdh.list_groups(patterns="sex by age")
cdh.set_groups(["B17001"])
cdh.describe_groups()
cdh.set_geos(["160"])
response = cdh.get_data(
include_names=True,
within={"state": "08"},
)
response.tabulate(
"NAME",
"B17001_002E",
"B17001_001E",
where=[
"B17001_001E > 1_000",
"B17001_002E / B17001_001E < 0.01",
"'CDP' not in NAME",
],
weight_var="B17001_001E",
strat_by="vintage",
)
# --- ACS ANALYSIS: get race group variables and geometry for regions in 2011 ---
cdh = CenDatHelper(key=os.getenv("CENSUS_API_KEY"))
cdh.list_products(years=[2011], patterns=r"acs/acs5\)")
cdh.set_products()
cdh.list_groups(patterns=r"^race")
cdh.set_groups(["B02001"])
cdh.describe_groups()
cdh.set_geos(["020"])
response = cdh.get_data(
include_names=True,
include_geometry=True,
)
gdf = response.to_gpd(destring=True, join_strategy="inner")
print(gdf)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cendat-0.7.0a4.tar.gz.
File metadata
- Download URL: cendat-0.7.0a4.tar.gz
- Upload date:
- Size: 6.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1f8a1adce86a034f120719cbf3ae90664685720e7a6b17136b15e92bf3d509f0
|
|
| MD5 |
ea4809f909055e6ef1b448774a3221f1
|
|
| BLAKE2b-256 |
76668ffb5a5c5ac1a69093515ffa32fd9324a3230aa29ed64103c944900e167d
|
File details
Details for the file cendat-0.7.0a4-py3-none-any.whl.
File metadata
- Download URL: cendat-0.7.0a4-py3-none-any.whl
- Upload date:
- Size: 31.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
135bca261e5299790665619f11aed87904c08d9eebece90bd270c8ac06fdd2b1
|
|
| MD5 |
a727b1f85c6d6bc39d8893abc2f712a5
|
|
| BLAKE2b-256 |
1afb0a3379cea91cbe152987c56d13d027d37afa59f2133e245b3d165972274d
|