Skip to main content

Pythonic wrapper for downloading data from CKAN databases.

Project description

ckanpy

Purpose

ckanpy aims to simplify the process of downloading datasets from CKAN databases. Existing CKAN python packages (namely ckanapi) seem designed with system administrators in mind, not so much data consumers. By contrast, this package is designed solely for data consumers.

There are two intended audiences:

  • Data engineers, who wish to create a wrapper around a specific CKAN Package, making it trivial for data analysts to then download from it.
  • Data analysts, who, seeking to download data from a CKAN Package that does not yet have a python wrapper, wish to hack together a simple script that satisfies their specific use-case.

Dependencies

ckanpy only truly depends on pydantic and requests. All other dependencies may be removed in the future, for the sake of supply chain cybersecurity.

pydantic

pydantic is used for creating type-validated, easy-to-access schemas representing CKAN data structures.

requests

Certain requests do not seem possible through ckanapi, and are thus done instead through requests.

ckanapi

ckanapi simplifies downloading most CKAN data. In the future, it may be removed, as ckanapi is primarily a sysadmin package, such that ckanpy barely uses it.

pandas

pandas is used for parsing CSVs. This may be replaced in the future by an in-house solution.

numpy

numpy is used solely to access np.nan, when cleaning downloaded CSVs of None elements. May be replaced in the future.

Use Cases

Downloading Tabular Data

The whole point of ckanpy is to facilitate downloading tabular data, be that through a SQL database or a CSV file.

  • download_sql(ckan_url, query) → download from a CKAN Resource using a SQL query. This is the preferred method of download, so that as much data cleaning may be done server-side as possible.
  • download_csv(url) → download a CSV. Unless the user wants to download literally all the data available, this option serves more as a fallback in case either:
    1. a given Resource lacks a Resource ID, and therefore cannot be, SQL-queried, or
    2. the user, for whatever reason, cannot generate their desired SQL query, and therefore must filter the data client-side.

Modeling CKAN Data Structures

To facilitate downloading tabular data, ckanpy creates type-validated models of CKAN data structures. Below are examples of each modeled data structure from the CCRS CKAN package:

  • Package → CKAN package, e.g. California Crash Reporting System
  • Resource → CKAN resource, e.g. Crashes_2021
  • ResourceCollection → group of CKAN Resources with a name pattern, e.g. r"Crashes_[0-9]+"
  • DatastoreField → Maps each column / field of a given Resource to its SQL type, e.g. {"NumberInjured": "numeric"}
  • DatastoreInfo → Collection of DatastoreFields pertaining to a given Resource

Downloading Package info is easy:

from ckanpy import Package

package_ccrs = Package(
  ckan_url="https://data.ca.gov/",
  name_or_id="ccrs"
)
# Download occurs when Pacakge.resources is called, and is cached afterward
print(package_ccrs.resources)

# Information about each Resource may then be easily accessed
# Note that resources are stored as a list; this is because Resource names,
# for whatever reason, are not necessarily unique
# (e.g. the sysadmin uploaded a test duplicate)
print(package_ccrs.resources[0].resource_id)

package_duplicate = Package(
  ckan_url="https://data.ca.gov/",
  name_or_id="ccrs"
)
# Package downloads are cached, meaning this second package triggered no superfluous downloads
print(package_duplicate.resources)

Utility

  • download_package_names(ckan_url) → Downloads list of Package names within a CKAN database. Though the CKAN web GUI is very useful, it does not seem easy to find the internal name of a given Package, so this function fills the gap.

Constructing SQL Statements

Although ckanpy allows the user to input custom queries, doing so is somewhat unwieldy, in large part due to tables being named after Resource IDs. As an alternative for users who wish to make SQL queries through a pythonic interface, ckanpy comes packaged with the following tools:

  • StatementAssembler → given inputs, it outputs a simple SQL query, SELECT'ing data from a single table, and filtering with zero or more WHERE statements (see StatementWhere).
  • StatementWhere → given inputs, it outputs a WHERE statement. Depending on the inputted assumption of what type the column is (as it can vary over time), it automatically CAST's the column so that the operation may take place (e.g. filtering by longitude, but the longitude is a string column, so it's CAST as a numeric column instead.)

Examples with and without WHERE statements:

from uuid import UUID
from ckanpy import (
    StatementAssembler,
    StatementWhere,
)

ckan_url = "https://data.ca.gov/"
resource_id_crashes_2025 = UUID("9f4fc839-122d-4595-a146-43bc4ed16f46")
columns_to_select = ["CollisionId","City Name"]

# Without WHERE statements
assembler = StatementAssembler(
    column_names=columns_to_select,
    resource_id=resource_id_crashes_2025,
    ckan_url=ckan_url
)
print(assembler.assemble())
# returns: 
# 'SELECT "col1", "col2", "col3" FROM "f57a81da-32ba-4306-8be1-1bf27ced5a03"'



# With WHERE statements
where1 = StatementWhere(
    column_name="City Name",
    column_value="San Diego",
    operator="equals"
)
where2 = StatementWhere(
    column_name="Day Of Week",
    column_value="Monday",
    operator="equals"
)

assembler2 = StatementAssembler(
    column_names=columns_to_select,
    resource_id=resource_id_crashes_2025,
    ckan_url=ckan_url,
    where_statements=[
        where1,
        where2
    ]
)
print(assembler2.assemble())
# returns:
# SELECT "CollisionId", "City Name" FROM "9f4fc839-122d-4595-a146-43bc4ed16f46" WHERE ("City Name" = 'San Diego') AND ("Day Of Week" = 'Monday')

Developing CKAN Package Wrappers

The main enterprise of ckanpy, however, is serving as a framework for creating Package-specific wrappers. In fact, I wrote ckanpy in order to write a wrapper for the CCRS package, pyccrs.

  • ResourceMapper → Maps ResourceCollections to named attributes. For example, pyccrs uses ResourceMapper to map a ResourceCollection for each CCRS table, so Crashes, Parties, and InjuredWitnessPassenger (or "People", as I renamed it).
  • Downloader → Implements, however necessary, a download_records(**kwargs) → list[dict] method. Additional download public methods may be added, as is the case with pyccrs, but they should be derived from download_records.
  • Pydantic Table Models → each table in the Resource should have a Pydantic model representing a given row of data. This serves as the parsing and validation layer.
  • ColumnNamesEnumStr which maps pythonic column names to the CKAN Resource's actual names. This mapping is necessary not just for cleanliness, but also compatibility with pydantic, as well as facilitating context-switching when a given representation is needed (e.g. the user provides Pythonic key names, which are then translated to the original when constructing the SQL query).

Contributing

Though functional, there are still many ways this package can be improved. Feel free to look around for leftover TODOs, send a pull request suggesting a change, or reach out to me by email to discuss specific improvements. My dream is for ckanpy to help improve data quality for public datasets, enabling data analysts to focus on what they know best: analyzing the data!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckanpy-0.2.7.tar.gz (23.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ckanpy-0.2.7-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file ckanpy-0.2.7.tar.gz.

File metadata

  • Download URL: ckanpy-0.2.7.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ckanpy-0.2.7.tar.gz
Algorithm Hash digest
SHA256 433a9f1bc0592d4aa87afd42bbd9b5b1886e8d6bda67786dc72c9595d3a1852c
MD5 32b591d0834260d6e8bea8c2b2d743a4
BLAKE2b-256 bc7962d1e8bdb9c2b8effa9964ebfcf08ce1d7ccf9cc1cb74ab4adcf844442f2

See more details on using hashes here.

File details

Details for the file ckanpy-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: ckanpy-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 28.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ckanpy-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 36fb3e378f683e9ac125a3f0ae3f1e9f4d1837b0f90251049991f562c4eb3607
MD5 7e6e7081fe57beb790c8d84a32a93b3c
BLAKE2b-256 575e1bca38527310908cb748e20055b333e590fcd8b7b6fabc5649edea025a1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page