Worker repo to execute the steps for CCI Opensearch Ingestion, namely facet and tag scanning and updates to the CEDA FBI

These details have not been verified by PyPI

Project description

CCI Opensearch Worker repository

Static Badge GitHub Release

CEDA Dependencies

Static Badge

See release notes for change history

This package serves as a wrapper for the CCI Opensearch Workflow, which involves several independent packages with multiple dependencies. Primarily the CCI Tagger (cci-tag-scanner) and Facet scanner (cci-facet-scanner) are combined, with elements from the CEDA FBS (ceda-fbs-cci) package to create the components for Opensearch records in Elasticsearch.

NOTE: When publishing a new tagged release of this package, please make sure to rebuild the corresponding Docker image, in the CEDA gitlab repository cci_opensearch_base. This repository has a single build-image step that should be rerun (following the steps found there) to ensure any changes to this package are picked up by the OS worker deployment.

CCI Opensearch Workflow

1. Installation

This package can be cloned directly or used as a dependency in a pyproject file.

Set up a python virtual environment:

 $ python -m venv .venv
 $ source .venv/bin/activate
 $ pip install cci-os-worker

NOTE: As of 22nd Jan 2025 the cci-os-worker repository has been upgraded for use with Poetry version 2. The temporary solution to use a requirements_fix.txt file has been removed as this package is now on Pypi.

1.1. Use in other packages

Poetry 1.8.5 and older For use in another package as a dependency, use the following in your pyproject [tool.poetry.dependencies]:

cci-os-worker = { git = "https://github.com/cedadev/cci-os-worker.git", tag="v0.3.1"}

Poetry 2.0.1 and later This package is now a pip-installable published package as of 11th April 2025! That means for packages using Poetry 2 or higher, the cci-os-worker can be added via Poetry at version 0.5.0 or higher.

poetry add cci-os-worker^0.5.0

2. Usage

2.1 Find datasets

Determining the set of files to operate over can be done in two ways using built-in scripts here, or indeed by any other means. If the intention is to submit to a rabbit queue however, this script is required with the additional -R parameter to submit to a queue, and the configuration for the queue given by a yaml file provided as --conf.

rescan_dir path/to/json/directory/ --extension nc -l 1 -o path/to/dataset/filelist.txt

NOTE: As of v0.5.0 this changed from fbi_rescan_dir to simply rescan_dir.

In the above command:

r represents a recursive look through identified directories.
l means the scan level. Scan level 1 will involve finding all the JSON files and expanding each datasets path into a list.
o is the output file to send the list of datasets.
--extension applies to the files identified and added to the output file. nc is the default value so is redundant here.
--file-regex alternative to supplying just the extension, if a valid regex pattern can be matched to identify specific files it can be submitted here.

This command can also be run for a known directory to expand into a list of datasets:

rescan_dir my/datasets/path/ -l 2 -o path/to/dataset/filelist.txt

In this case we specify l as 2 since there are no JSON files involved. The extension/file_regex options can also be added here, but as the nc option is a default value we have omitted it here.

2.2 Run the facet scan workflow

The facet scanner workflow utilises both the facet and tag scanners to produce the set of facets under project.opensearch in the resulting opensearch records. This workflow can be run using the facetscan entrypoint script installed with this package.

The environment variable JSON_TAGGER_ROOT should be set, which should be the path to the top-level directory under which all JSON files are placed. These JSON files provide defaults and mappings to values placed in the opensearch records - supplementary material to aid facet scanning or replace found values.

As of v0.5.0 the two workflows (facet and FBI) have been combined into one singular workflow to generate all portions of the Opensearch records. This can be run with the following command:

 $ cci_os_update path/to/dataset/filelist.txt path/to/config/file.yaml

(Note: Verbose flag -v can be added to the above command.)

Where the yaml file should look something like this:

elasticsearch:
  # Fill in with key value
  x-api-key: ""
facet_files_index:
  name: facet-index-staging
facet_files_test_index:
  name: facet-index-staging
ldap_configuration:
  hosts:
    - ldap://homer.esc.rl.ac.uk
    - ldap://marge.esc.rl.ac.uk

Project details

These details have not been verified by PyPI

Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
- Python
- Python :: 3.8

Release history Release notifications | RSS feed

0.8.1

Oct 16, 2025

0.8

Sep 16, 2025

0.7.2

Jul 8, 2025

This version

0.7.1

Jun 2, 2025

0.7.0

May 19, 2025

0.6.0

May 14, 2025

0.5.4

Apr 14, 2025

0.5.3

Apr 11, 2025

0.5.2

Apr 11, 2025

0.5.1

Apr 11, 2025

0.5.0

Apr 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cci_os_worker-0.7.1.tar.gz (30.5 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cci_os_worker-0.7.1-py3-none-any.whl (37.2 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file cci_os_worker-0.7.1.tar.gz.

File metadata

Download URL: cci_os_worker-0.7.1.tar.gz
Upload date: Jun 2, 2025
Size: 30.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.8 Darwin/24.5.0

File hashes

Hashes for cci_os_worker-0.7.1.tar.gz
Algorithm	Hash digest
SHA256	`74753b0569bf4c301c2932839541e466b2d8860efcc9e4af1ed9b9fc9fc3e3d8`
MD5	`281c25a19f81fa4fdc0c7af4f63d8af7`
BLAKE2b-256	`71edf30e43ae973cb346f14fc8e6e6c7125661d7820bb88857b703f0967918ea`

See more details on using hashes here.

File details

Details for the file cci_os_worker-0.7.1-py3-none-any.whl.

File metadata

Download URL: cci_os_worker-0.7.1-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 37.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.12.8 Darwin/24.5.0

File hashes

Hashes for cci_os_worker-0.7.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`99c769d6e5f24944e64d5e96b95cc4722b679cb2de8dd1db7de34133c89c9c2c`
MD5	`0b24fcbc49ea5001af4dd73eb61b8d42`
BLAKE2b-256	`6c6e80cd916e946fd735257b8dfdd945e3bded55a42ef5e3054d2ee677b988fd`

See more details on using hashes here.

cci-os-worker 0.7.1

Navigation

Verified details

Owner

Unverified details

Meta

Classifiers

Project description

CCI Opensearch Worker repository

CEDA Dependencies

1. Installation

1.1. Use in other packages

2. Usage

2.1 Find datasets

2.2 Run the facet scan workflow

Project details

Verified details

Owner

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes