Pythonic wrapper for downloading data from CKAN databases.
Project description
ckanpy
Purpose
ckanpy aims to simplify the process of downloading datasets from
CKAN databases. Existing CKAN python packages (namely ckanapi) seem
designed with system administrators in mind, not so much data consumers.
By contrast, this package is designed solely for data consumers.
There are two intended audiences:
- Data engineers, who wish to create a wrapper around a specific CKAN Package, making it trivial for data analysts to then download from it.
- Data analysts, who, seeking to download data from a CKAN Package that does not yet have a python wrapper, wish to hack together a simple script that satisfies their specific use-case.
Dependencies
ckanpy only truly depends on pydantic and requests. All other dependencies
may be removed in the future, for the sake of supply chain cybersecurity.
pydantic
pydantic is used for creating type-validated, easy-to-access schemas representing
CKAN data structures.
requests
Certain requests do not seem possible through ckanapi,
and are thus done instead through requests.
ckanapi
ckanapi simplifies downloading most CKAN data. In the future,
it may be removed, as ckanapi is primarily a sysadmin package, such that
ckanpy barely uses it.
pandas
pandas is used for parsing CSVs. This may be replaced in the
future by an in-house solution.
numpy
numpy is used solely to access np.nan, when cleaning
downloaded CSVs of None elements. May be replaced in the future.
Use Cases
Downloading Tabular Data
The whole point of ckanpy is to facilitate downloading
tabular data, be that through a SQL database or a CSV file.
download_sql(ckan_url, query)→ download from a CKAN Resource using a SQL query. This is the preferred method of download, so that as much data cleaning may be done server-side as possible.download_csv(url)→ download a CSV. Unless the user wants to download literally all the data available, this option serves more as a fallback in case either:- a given
Resourcelacks a Resource ID, and therefore cannot be, SQL-queried, or - the user, for whatever reason, cannot generate their desired SQL query, and therefore must filter the data client-side.
- a given
Modeling CKAN Data Structures
To facilitate downloading tabular data, ckanpy creates type-validated
models of CKAN data structures. Below are examples of each modeled
data structure from the CCRS CKAN package:
Package→ CKAN package, e.g. California Crash Reporting SystemResource→ CKAN resource, e.g. Crashes_2021ResourceCollection→ group of CKANResources with a name pattern, e.g. r"Crashes_[0-9]+"DatastoreField→ Maps each column / field of a given Resource to its SQL type, e.g. {"NumberInjured": "numeric"}DatastoreInfo→ Collection ofDatastoreFields pertaining to a givenResource
Downloading Package info is easy:
from ckanpy import Package
package_ccrs = Package(
ckan_url="https://data.ca.gov/",
name_or_id="ccrs"
)
# Download occurs when Pacakge.resources is called, and is cached afterward
print(package_ccrs.resources)
# Information about each Resource may then be easily accessed
# Note that resources are stored as a list; this is because Resource names,
# for whatever reason, are not necessarily unique
# (e.g. the sysadmin uploaded a test duplicate)
print(package_ccrs.resources[0].resource_id)
package_duplicate = Package(
ckan_url="https://data.ca.gov/",
name_or_id="ccrs"
)
# Package downloads are cached, meaning this second package triggered no superfluous downloads
print(package_duplicate.resources)
Utility
download_package_names(ckan_url)→ Downloads list ofPackagenames within a CKAN database. Though the CKAN web GUI is very useful, it does not seem easy to find the internal name of a givenPackage, so this function fills the gap.
Constructing SQL Statements
Although ckanpy allows the user to input custom queries, doing so is somewhat
unwieldy, in large part due to tables being named after Resource IDs. As an
alternative for users who wish to make SQL queries through a pythonic interface,
ckanpy comes packaged with the following tools:
StatementAssembler→ given inputs, it outputs a simple SQL query, SELECT'ing data from a single table, and filtering with zero or more WHERE statements (seeStatementWhere).StatementWhere→ given inputs, it outputs a WHERE statement. Depending on the inputted assumption of what type the column is (as it can vary over time), it automatically CAST's the column so that the operation may take place (e.g. filtering by longitude, but the longitude is a string column, so it's CAST as a numeric column instead.)
Examples with and without WHERE statements:
from uuid import UUID
from ckanpy import (
StatementAssembler,
StatementWhere,
)
ckan_url = "https://data.ca.gov/"
resource_id_crashes_2025 = UUID("9f4fc839-122d-4595-a146-43bc4ed16f46")
columns_to_select = ["CollisionId","City Name"]
# Without WHERE statements
assembler = StatementAssembler(
column_names=columns_to_select,
resource_id=resource_id_crashes_2025,
ckan_url=ckan_url
)
print(assembler.assemble())
# returns:
# 'SELECT "col1", "col2", "col3" FROM "f57a81da-32ba-4306-8be1-1bf27ced5a03"'
# With WHERE statements
where1 = StatementWhere(
column_name="City Name",
column_value="San Diego",
operator="equals"
)
where2 = StatementWhere(
column_name="Day Of Week",
column_value="Monday",
operator="equals"
)
assembler2 = StatementAssembler(
column_names=columns_to_select,
resource_id=resource_id_crashes_2025,
ckan_url=ckan_url,
where_statements=[
where1,
where2
]
)
print(assembler2.assemble())
# returns:
# SELECT "CollisionId", "City Name" FROM "9f4fc839-122d-4595-a146-43bc4ed16f46" WHERE ("City Name" = 'San Diego') AND ("Day Of Week" = 'Monday')
Developing CKAN Package Wrappers
The main enterprise of ckanpy, however, is serving as a framework for
creating Package-specific wrappers. In fact, I wrote ckanpy in order
to write a wrapper for the CCRS package, pyccrs.
ResourceMapper→ MapsResourceCollections to named attributes. For example,pyccrsusesResourceMapperto map aResourceCollectionfor each CCRS table, so Crashes, Parties, and InjuredWitnessPassenger (or "People", as I renamed it).Downloader→ Implements, however necessary, adownload_records(**kwargs) → list[dict]method. Additional download public methods may be added, as is the case withpyccrs, but they should be derived fromdownload_records.- Pydantic Table Models → each table in the Resource should have a Pydantic model representing a given row of data. This serves as the parsing and validation layer.
ColumnNames→EnumStrwhich maps pythonic column names to the CKAN Resource's actual names. This mapping is necessary not just for cleanliness, but also compatibility withpydantic, as well as facilitating context-switching when a given representation is needed (e.g. the user provides Pythonic key names, which are then translated to the original when constructing the SQL query).
Contributing
Though functional, there are still many ways this package can be improved.
Feel free to look around for leftover TODOs, send a pull request suggesting
a change, or reach out to me by email to discuss specific improvements.
My dream is for ckanpy to help improve data quality for public datasets,
enabling data analysts to focus on what they know best: analyzing the data!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ckanpy-0.2.7.tar.gz.
File metadata
- Download URL: ckanpy-0.2.7.tar.gz
- Upload date:
- Size: 23.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
433a9f1bc0592d4aa87afd42bbd9b5b1886e8d6bda67786dc72c9595d3a1852c
|
|
| MD5 |
32b591d0834260d6e8bea8c2b2d743a4
|
|
| BLAKE2b-256 |
bc7962d1e8bdb9c2b8effa9964ebfcf08ce1d7ccf9cc1cb74ab4adcf844442f2
|
File details
Details for the file ckanpy-0.2.7-py3-none-any.whl.
File metadata
- Download URL: ckanpy-0.2.7-py3-none-any.whl
- Upload date:
- Size: 28.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36fb3e378f683e9ac125a3f0ae3f1e9f4d1837b0f90251049991f562c4eb3607
|
|
| MD5 |
7e6e7081fe57beb790c8d84a32a93b3c
|
|
| BLAKE2b-256 |
575e1bca38527310908cb748e20055b333e590fcd8b7b6fabc5649edea025a1d
|