Skip to main content

No project description provided

Project description

Tests

ckanext-federated-index

Lightweight solution for storing and searching remote datasets locally and redirecting to the original portal upon when dataset details page is opened.

Current extension is similar to ckanext-harvest. Main differences are:

  • ckanext-harvest is a generic solution for harvesting data from any kind of source. ckanext-federated-index works only with CKAN instances
  • ckanext-harvest uses background processes for harvesting. It's more sophisticated, customizable and flexible, but at the same time, it's more complex. ckanext-federated-index relies only on CKAN API and can be triggered via HTTP requests, CLI commands or cron-tasks without additional complexity.
  • ckanext-harvest creates copies of remote datasets locally. This is more appropriate if you want to create references to these dataasets, edit them or modify local copies. ckanext-federated-index adds datasets to search index, but does not create real datasets locally. So you can search these datasets, but cannot open them locally. Instead you can use original URL of the dataset and redirect user to the original portal.

As result, ckanext-federated-index works best if you are building lightweight aggregator of data from multiple portals, but do not provide any view or edit capabilities. ckanext-harvest suits better for any other case, as it basically allows you to do anything with remote datasets.

Requirements

Compatibility with core CKAN versions:

CKAN version Compatible?
2.9 no
2.10 yes
2.11 yes

Installation

To install ckanext-federated-index:

  1. Install it via pip:

    pip install ckanext-federated-index
    
  2. Add federated-index to the ckan.plugins setting in your CKAN config file.

Usage

To index remote datasets, you need to configure one or multiple federation profiles. Each profile describes the remote portal and defines, how its data is fetched and stored.

Each profile must have a unique name and its configuration options are defined as ckanext.federated_index.profile.<PROFILE_NAME>.<OPTION>. For example, if you decided to index demo.ckan.org and want to use name demo for it, you have to add the following option to the config file:

ckanext.federated_index.profile.demo.url = https://demo.ckan.org

If, in addition to URL you want to specify an API Token for requests:

ckanext.federated_index.profile.demo.url = https://demo.ckan.org
ckanext.federated_index.profile.demo.api_key = 123-abc

All available config options are mentioned in config settings section.

When profile is configured, the only thin you need to do is to refresh portal data. If you local and remote portals have similar metadata schemas, it should work without any additional efforts. If schemas are different, check advanced usage section.

ckanapi action federated_index_profile_refresh profile=demo index=true

Advanced usage

Align metadata schemas

Usually, when remote portal is heavily customized and defines a lot of custom metadata fields, the easiest option is to drop all fields that are not defined in the local metadata schema. It can be done via config option:

ckanext.federated_index.align_with_local_schema = true

It it's not enough, you can hook into indexation process and alter dataset dictionary before it's sent to search index. For this purpose you can use IFederatedIndex interface:

import ckan.plugins as p
from ckanext.federated_index.interfaces import IFederatedIndex
from ckanext.federated_index.shared import Profile

class CustomFederatedIndexPlugin(p.SingletonPlugin):
    p.implements(interfaces.IFederatedIndex, inherit=True)

    def federated_index_before_index(
        self,
        pkg_dict: dict[str, Any],
        profile: Profile,
    ) -> dict[str, Any]:

        # modify data. For example, remove all tags with vocabulary_id,
        # because local instance usually does not have same vocabulary IDs
        pkg_dict["tags"] = [
            t for t in pkg_dict.setdefault("tags", []) if "vocabulary_id" not in t
        ]

        return pkg_dict

Fetch datasets that are newer than the newest locally-indexed dataset

On initial synchronization you often need to pull all the datasets from remote portal. But after that you are only interested in datasets with metadata_modified value greater that the newest metadata_modified among synchronized datasets. Basically, you want to fetch only updated datasets to speed-up the process. To achieve it, add since_last_refresh flag to the action that refreshes the index:

ckanapi action federated_index_profile_refresh profile=demo index=true since_last_refresh=true

Configure remote data fetch process

ckanext-federated-index fetches data from the remote portal using package_search API action with default parameters. If you want to increase the number of packages fetched via single request, or filter out certains datasets using q/fq, you can add search configuration to profile via settings:

ckanext.federated_index.profile.demo.extras = {"search_payload": {"rows": 100, "q": "test"}}

extras option of the profile contains a valid JSON object with additional settings of the profile. search_payload specifies default parameters used for package_search.

In addition to it, if you want to use custom search payload only once, you can pass search_payload to the refresh action:

ckanapi action federated_index_profile_refresh profile=demo index=true search_payload='{"q": "test"}'

Configure storage for remote data

By default, remote data stored inside a separate DB table. It allows you to pull remote data once, re-build index of remote packages multiple times without making additional requests to the remote portal. DB table is chosen as default value because it can efficiently use space, allows fast access to the data and is available on every CKAN instance, as CKAN doesn't work without DB.

But there are other storage types and every federation profile can be configured to use a different storage type via extras option:

ckanext.federated_index.profile.demo.extras = {"storage": {"type": "redis"}}

type key is a required member of the storage object. Depending on the storage type, other keys can be supported as well. For example, filesystem storage allows you to specify path, where the data is stored:

ckanext.federated_index.profile.demo.extras = {"storage": {"type": "fs", "path": "/tmp/demo_profile"}}

Storage types:

  • db: default storage. Keeps data inside a custom table created via migration
  • fs: keeps data as separate JSON files in the filesystem. By default files created under ckan.storage_path/federated_index/PROFILENAME. Path can be changed via path option.
  • redis: keeps data inside Redis
  • sqlite: keeps data as separate SQLite DB for eachprofile. By default database is created at ckan.storage_path/federated_index/PROFILENAME.sqlite3.db. Path can be changed via url option.

Config settings

# Remove from dataset any field that is not defined in the local dataset
# schema.
# (optional, default: false)
ckanext.federated_index.align_with_local_schema = false

# Redirect user to the original dataset URL when user opens federated dataset
# that is not recorded in local DB.
# (optional, default: true)
ckanext.federated_index.redirect_missing_federated_datasets = true

# Endpoints that are affected by `redirect_missing_federated_datasets` config
# option.
# (optional, default: dataset.read)
ckanext.federated_index.dataset_read_endpoints = dataset.read dataset.edit

# Name of the dataset extra field that holds original URL of the federated
# dataset.
# (optional, default: federated_index_remote_url)
ckanext.federated_index.index_url_field = federated_index_remote_url

# Name of the dataset extra field that holds federation profile name.
# (optional, default: federated_index_profile)
ckanext.federated_index.index_profile_field = federated_index_profile

# URL of the federation profile.
ckanext.federated_index.profile.<profile>.url = https://demo.ckan.org

# API Token for the federation profile.
ckanext.federated_index.profile.<profile>.api_key = 123-abc

# Extra configuration for federation profile. Must be a valid JSON object
# with the following keys:
#  * search_payload: payload sent to remote portal with
#    `package_search` API action when profile is refreshed
#  * storage: storage configuration for remote data. Requires `type`
#    parameter with one of the following values: redis, db, sqlite, fs.
ckanext.federated_index.profile.<profile>.extras = {"search_payload": {"rows": 100}, "storage": {"type": "fs"}}

# Request timeout for remote portal requests.
ckanext.federated_index.profile.<profile>.timeout = 5

Developer installation

To install ckanext-federated-index for development, activate your CKAN virtualenv and do:

git clone https://github.com/DataShades/ckanext-federated-index.git
cd ckanext-federated-index
pip install -e.

Tests

To run the tests, do:

pytest

License

AGPL

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ckanext_federated_index-0.1.1.post1.tar.gz (35.0 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file ckanext_federated_index-0.1.1.post1.tar.gz.

File metadata

File hashes

Hashes for ckanext_federated_index-0.1.1.post1.tar.gz
Algorithm Hash digest
SHA256 45242c7f44d3919f65246f655edc0273c6cca1018dbd7ef01ddd2225ee37e287
MD5 b5816602350d05c887b4396cee761fa9
BLAKE2b-256 6516fa7a9e980a62f312384e3d1eed7754b8e99b1a3b6ed71f44398703c086ed

See more details on using hashes here.

File details

Details for the file ckanext_federated_index-0.1.1.post1-py3-none-any.whl.

File metadata

File hashes

Hashes for ckanext_federated_index-0.1.1.post1-py3-none-any.whl
Algorithm Hash digest
SHA256 fa13726252e03e9efcd34cbe05bc025519d1068957a16bb69a9c85d03c754648
MD5 bd0571c490e6daec6ecc87d2971f45bd
BLAKE2b-256 a15beefd0a60f8b3ab707c96a05df20fdfbe0b1d5117ed71bdb25ea304c1978e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page