Skip to main content

Pythonic wrapper for downloading data from CKAN databases.

Project description

ckanpy

Purpose

ckanpy aims to simplify the process of downloading datasets from CKAN databases. Existing CKAN python packages (namely ckanapi) seem designed with system administrators in mind, not so much data consumers. By contrast, this package is designed solely for data consumers.

There are two intended audiences:

  • Data engineers, who wish to create a wrapper around a specific CKAN Package, making it trivial for data analysts to then download from it.
  • Data analysts, who, seeking to download data from a CKAN Package that does not yet have a python wrapper, wish to hack together a simple script that satisfies their specific use-case.

Dependencies

ckanpy only truly depends on pydantic and requests. All other dependencies may be removed in the future, for the sake of supply chain cybersecurity.

pydantic

pydantic is used for creating type-validated, easy-to-access schemas representing CKAN data structures.

requests

Certain requests do not seem possible through ckanapi, and are thus done instead through requests.

ckanapi

ckanapi simplifies downloading most CKAN data. In the future, it may be removed, as ckanapi is primarily a sysadmin package, such that ckanpy barely uses it.

pandas

pandas is used for parsing CSVs. This may be replaced in the future by an in-house solution.

numpy

numpy is used solely to access np.nan, when cleaning downloaded CSVs of None elements. May be replaced in the future.

Use Cases

Downloading Tabular Data

The whole point of ckanpy is to facilitate downloading tabular data, be that through a SQL database or a CSV file.

  • download_sql(ckan_url, query) → download from a CKAN Resource using a SQL query. This is the preferred method of download, so that as much data cleaning may be done server-side as possible.
  • download_csv(url) → download a CSV. Unless the user wants to download literally all the data available, this option serves more as a fallback in case either:
    1. a given Resource lacks a Resource ID, and therefore cannot be, SQL-queried, or
    2. the user, for whatever reason, cannot generate their desired SQL query, and therefore must filter the data client-side.

Modeling CKAN Data Structures

To facilitate downloading tabular data, ckanpy creates type-validated models of CKAN data structures. Below are examples of each modeled data structure from the CCRS CKAN package:

  • Package → CKAN package, e.g. California Crash Reporting System
  • Resource → CKAN resource, e.g. Crashes_2021
  • ResourceCollection → group of CKAN Resources with a name pattern, e.g. r"Crashes_[0-9]+"
  • DatastoreField → Maps each column / field of a given Resource to its SQL type, e.g. {"NumberInjured": "numeric"}
  • DatastoreInfo → Collection of DatastoreFields pertaining to a given Resource

Downloading Package info is easy:

from ckanpy import Package

package_ccrs = Package(
  ckan_url="https://data.ca.gov/",
  name_or_id="ccrs"
)
# Download occurs when Pacakge.resources is called, and is cached afterward
print(package_ccrs.resources)

# Information about each Resource may then be easily accessed
# Note that resources are stored as a list; this is because Resource names,
# for whatever reason, are not necessarily unique
# (e.g. the sysadmin uploaded a test duplicate)
print(package_ccrs.resources[0].resource_id)

package_duplicate = Package(
  ckan_url="https://data.ca.gov/",
  name_or_id="ccrs"
)
# Package downloads are cached, meaning this second package triggered no superfluous downloads
print(package_duplicate.resources)

Utility

  • download_package_names(ckan_url) → Downloads list of Package names within a CKAN database. Though the CKAN web GUI is very useful, it does not seem easy to find the internal name of a given Package, so this function fills the gap.

Constructing SQL Statements

Although ckanpy allows the user to input custom queries, doing so is somewhat unwieldy, in large part due to tables being named after Resource IDs. As an alternative for users who wish to make SQL queries through a pythonic interface, ckanpy comes packaged with the following tools:

  • StatementAssembler → given inputs, it outputs a simple SQL query, SELECT'ing data from a single table, and filtering with zero or more WHERE statements (see StatementWhere).
  • StatementWhere → given inputs, it outputs a WHERE statement. Depending on the inputted assumption of what type the column is (as it can vary over time), it automatically CAST's the column so that the operation may take place (e.g. filtering by longitude, but the longitude is a string column, so it's CAST as a numeric column instead.)

Examples with and without WHERE statements:

from uuid import UUID
from ckanpy import (
    StatementAssembler,
    StatementWhere,
)

ckan_url = "https://data.ca.gov/"
resource_id_crashes_2025 = UUID("9f4fc839-122d-4595-a146-43bc4ed16f46")
columns_to_select = ["CollisionId","City Name"]

# Without WHERE statements
assembler = StatementAssembler(
    column_names=columns_to_select,
    resource_id=resource_id_crashes_2025,
    ckan_url=ckan_url
)
print(assembler.assemble())
# returns: 
# 'SELECT "col1", "col2", "col3" FROM "f57a81da-32ba-4306-8be1-1bf27ced5a03"'



# With WHERE statements
where1 = StatementWhere(
    column_name="City Name",
    column_value="San Diego",
    operator="equals"
)
where2 = StatementWhere(
    column_name="Day Of Week",
    column_value="Monday",
    operator="equals"
)

assembler2 = StatementAssembler(
    column_names=columns_to_select,
    resource_id=resource_id_crashes_2025,
    ckan_url=ckan_url,
    where_statements=[
        where1,
        where2
    ]
)
print(assembler2.assemble())
# returns:
# SELECT "CollisionId", "City Name" FROM "9f4fc839-122d-4595-a146-43bc4ed16f46" WHERE ("City Name" = 'San Diego') AND ("Day Of Week" = 'Monday')

Developing CKAN Package Wrappers

The main enterprise of ckanpy, however, is serving as a framework for creating Package-specific wrappers. In fact, I wrote ckanpy in order to write a wrapper for the CCRS package, pyccrs.

  • ResourceMapper → Maps ResourceCollections to named attributes. For example, pyccrs uses ResourceMapper to map a ResourceCollection for each CCRS table, so Crashes, Parties, and InjuredWitnessPassenger (or "People", as I renamed it).
  • Downloader → Implements, however necessary, a download_records(**kwargs) → list[dict] method. Additional download public methods may be added, as is the case with pyccrs, but they should be derived from download_records.
  • Pydantic Table Models → each table in the Resource should have a Pydantic model representing a given row of data. This serves as the parsing and validation layer.
  • ColumnNamesEnumStr which maps pythonic column names to the CKAN Resource's actual names. This mapping is necessary not just for cleanliness, but also compatibility with pydantic, as well as facilitating context-switching when a given representation is needed (e.g. the user provides Pythonic key names, which are then translated to the original when constructing the SQL query).

Contributing

Though functional, there are still many ways this package can be improved. Feel free to look around for leftover TODOs, send a pull request suggesting a change, or reach out to me by email to discuss specific improvements. My dream is for ckanpy to help improve data quality for public datasets, enabling data analysts to focus on what they know best: analyzing the data!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckanpy-0.2.8.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckanpy-0.2.8-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file ckanpy-0.2.8.tar.gz.

File metadata

  • Download URL: ckanpy-0.2.8.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ckanpy-0.2.8.tar.gz
Algorithm Hash digest
SHA256 65515147a01faad843d2eb117bce2a6e3171d7fc13491f77ab47bf7f66a5d845
MD5 1807d691dffaebee4b05bb6f5544f28d
BLAKE2b-256 2f5893bcfa54e6080aad83981503e64bf20e06595d3e452bbe8dcd71a40c0254

See more details on using hashes here.

File details

Details for the file ckanpy-0.2.8-py3-none-any.whl.

File metadata

  • Download URL: ckanpy-0.2.8-py3-none-any.whl
  • Upload date:
  • Size: 28.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ckanpy-0.2.8-py3-none-any.whl
Algorithm Hash digest
SHA256 df23fd8bb0ba87bf5fa1dd284fb7c1884f48563fef5d3a41986c9ab7dca0f948
MD5 1c5a0ba333bce89bab58c4a941ed9549
BLAKE2b-256 dc32d77c61ad22decf18378bd961a93e4372b30924c727dd347e14dc2eb12d66

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page