Skip to main content

No project description provided

Project description

Tests

ckanext-federated-index

Lightweight solution for storing and searching remote datasets locally and redirecting to the original portal upon when dataset details page is opened.

Current extension is similar to ckanext-harvest. Main differences are:

  • ckanext-harvest is a generic solution for harvesting data from any kind of source. ckanext-federated-index works only with CKAN instances
  • ckanext-harvest uses background processes for harvesting. It's more sophisticated, customizable and flexible, but at the same time, it's more complex. ckanext-federated-index relies only on CKAN API and can be triggered via HTTP requests, CLI commands or cron-tasks without additional complexity.
  • ckanext-harvest creates copies of remote datasets locally. This is more appropriate if you want to create references to these dataasets, edit them or modify local copies. ckanext-federated-index adds datasets to search index, but does not create real datasets locally. So you can search these datasets, but cannot open them locally. Instead you can use original URL of the dataset and redirect user to the original portal.

As result, ckanext-federated-index works best if you are building lightweight aggregator of data from multiple portals, but do not provide any view or edit capabilities. ckanext-harvest suits better for any other case, as it basically allows you to do anything with remote datasets.

Requirements

Compatibility with core CKAN versions:

CKAN version Compatible?
2.9 no
2.10 yes
2.11 yes

Installation

To install ckanext-federated-index:

  1. Install it via pip:

    pip install ckanext-federated-index
    
  2. Add federated-index to the ckan.plugins setting in your CKAN config file.

Usage

To index remote datasets, you need to configure one or multiple federation profiles. Each profile describes the remote portal and defines, how its data is fetched and stored.

Each profile must have a unique name and its configuration options are defined as ckanext.federated_index.profile.<PROFILE_NAME>.<OPTION>. For example, if you decided to index demo.ckan.org and want to use name demo for it, you have to add the following option to the config file:

ckanext.federated_index.profile.demo.url = https://demo.ckan.org

If, in addition to URL you want to specify an API Token for requests:

ckanext.federated_index.profile.demo.url = https://demo.ckan.org
ckanext.federated_index.profile.demo.api_key = 123-abc

All available config options are mentioned in config settings section.

When profile is configured, the only thin you need to do is to refresh portal data. If you local and remote portals have similar metadata schemas, it should work without any additional efforts. If schemas are different, check advanced usage section.

ckanapi action federated_index_profile_refresh profile=demo index=true

Advanced usage

Align metadata schemas

Usually, when remote portal is heavily customized and defines a lot of custom metadata fields, the easiest option is to drop all fields that are not defined in the local metadata schema. It can be done via config option:

ckanext.federated_index.align_with_local_schema = true

It it's not enough, you can hook into indexation process and alter dataset dictionary before it's sent to search index. For this purpose you can use IFederatedIndex interface:

import ckan.plugins as p
from ckanext.federated_index.interfaces import IFederatedIndex
from ckanext.federated_index.shared import Profile

class CustomFederatedIndexPlugin(p.SingletonPlugin):
    p.implements(interfaces.IFederatedIndex, inherit=True)

    def federated_index_before_index(
        self,
        pkg_dict: dict[str, Any],
        profile: Profile,
    ) -> dict[str, Any]:

        # modify data. For example, remove all tags with vocabulary_id,
        # because local instance usually does not have same vocabulary IDs
        pkg_dict["tags"] = [
            t for t in pkg_dict.setdefault("tags", []) if "vocabulary_id" not in t
        ]

        return pkg_dict

Fetch datasets that are newer than the newest locally-indexed dataset

On initial synchronization you often need to pull all the datasets from remote portal. But after that you are only interested in datasets with metadata_modified value greater that the newest metadata_modified among synchronized datasets. Basically, you want to fetch only updated datasets to speed-up the process. To achieve it, add since_last_refresh flag to the action that refreshes the index:

ckanapi action federated_index_profile_refresh profile=demo index=true since_last_refresh=true

Configure remote data fetch process

ckanext-federated-index fetches data from the remote portal using package_search API action with default parameters. If you want to increase the number of packages fetched via single request, or filter out certains datasets using q/fq, you can add search configuration to profile via settings:

ckanext.federated_index.profile.demo.extras = {"search_payload": {"rows": 100, "q": "test"}}

extras option of the profile contains a valid JSON object with additional settings of the profile. search_payload specifies default parameters used for package_search.

In addition to it, if you want to use custom search payload only once, you can pass search_payload to the refresh action:

ckanapi action federated_index_profile_refresh profile=demo index=true search_payload='{"q": "test"}'

Configure storage for remote data

By default, remote data stored inside a separate DB table. It allows you to pull remote data once, re-build index of remote packages multiple times without making additional requests to the remote portal. DB table is chosen as default value because it can efficiently use space, allows fast access to the data and is available on every CKAN instance, as CKAN doesn't work without DB.

But there are other storage types and every federation profile can be configured to use a different storage type via extras option:

ckanext.federated_index.profile.demo.extras = {"storage": {"type": "redis"}}

type key is a required member of the storage object. Depending on the storage type, other keys can be supported as well. For example, filesystem storage allows you to specify path, where the data is stored:

ckanext.federated_index.profile.demo.extras = {"storage": {"type": "fs", "path": "/tmp/demo_profile"}}

Storage types:

  • db: default storage. Keeps data inside a custom table created via migration
  • fs: keeps data as separate JSON files in the filesystem. By default files created under ckan.storage_path/federated_index/PROFILENAME. Path can be changed via path option.
  • redis: keeps data inside Redis
  • sqlite: keeps data as separate SQLite DB for eachprofile. By default database is created at ckan.storage_path/federated_index/PROFILENAME.sqlite3.db. Path can be changed via url option.

Config settings

# Remove from dataset any field that is not defined in the local dataset
# schema.
# (optional, default: false)
ckanext.federated_index.align_with_local_schema = false

# Redirect user to the original dataset URL when user opens federated dataset
# that is not recorded in local DB.
# (optional, default: true)
ckanext.federated_index.redirect_missing_federated_datasets = true

# Endpoints that are affected by `redirect_missing_federated_datasets` config
# option.
# (optional, default: dataset.read)
ckanext.federated_index.dataset_read_endpoints = dataset.read dataset.edit

# Name of the dataset extra field that holds original URL of the federated
# dataset.
# (optional, default: federated_index_remote_url)
ckanext.federated_index.index_url_field = federated_index_remote_url

# Name of the dataset extra field that holds federation profile name.
# (optional, default: federated_index_profile)
ckanext.federated_index.index_profile_field = federated_index_profile

# URL of the federation profile.
ckanext.federated_index.profile.<profile>.url = https://demo.ckan.org

# API Token for the federation profile.
ckanext.federated_index.profile.<profile>.api_key = 123-abc

# Extra configuration for federation profile. Must be a valid JSON object
# with the following keys:
#  * search_payload: payload sent to remote portal with
#    `package_search` API action when profile is refreshed
#  * storage: storage configuration for remote data. Requires `type`
#    parameter with one of the following values: redis, db, sqlite, fs.
ckanext.federated_index.profile.<profile>.extras = {"search_payload": {"rows": 100}, "storage": {"type": "fs"}}

# Request timeout for remote portal requests.
ckanext.federated_index.profile.<profile>.timeout = 5

Developer installation

To install ckanext-federated-index for development, activate your CKAN virtualenv and do:

git clone https://github.com/DataShades/ckanext-federated-index.git
cd ckanext-federated-index
pip install -e.

Tests

To run the tests, do:

pytest

License

AGPL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckanext_federated_index-0.1.0.tar.gz (34.8 kB view details)

Uploaded Source

Built Distribution

ckanext_federated_index-0.1.0-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file ckanext_federated_index-0.1.0.tar.gz.

File metadata

File hashes

Hashes for ckanext_federated_index-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5697a39d7863ee3776e026c7bfef640f7c726b244eb6e37a021fec616513edc8
MD5 9ee24cbad5502183c41980b8e817ca71
BLAKE2b-256 d1576019b1b678f2fe1202ce4c50c7282193ada9c86c93c664f4279e41be9d25

See more details on using hashes here.

File details

Details for the file ckanext_federated_index-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ckanext_federated_index-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce2482c6f02f889e8e22b96e13167f391180206d584fcaf432b44e85dd9ff57b
MD5 adbad94b8951d14df3d3bbd84277de6b
BLAKE2b-256 0bd5c0e1b937af319e06b243a10791daaac74bab102101ca410b2008714d80a1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page