Skip to main content

Extracts facets from datasets to add to elasticsearch.

Reason this release was yanked:

Syntax issue with superseded check

Project description

CEDA Facet Scanner

Static Badge GitHub Release PyPI version

May 2026 - Merge with Tag Scanner

The dependency package cci-tag-scanner has been merged into this repository, as it no longer serves a useful purpose to keep the two packages separated. All required content will be migrated from the tag scanner into this package.

  • tagger.processDatasets
  • es_connection_kwargs/ElasticsearchConnection

CEDA Dependencies

Static Badge Static Badge

Takes datasets and extracts the facets from the files/filepath. The scanner works to map a file path to the specific handler for that collection of files. For example, ESA CCI datasets are scanned by the cci handler. This is because each collection of datasets will have different characteristics and will need to be treated in a different way.

ReadTheDocs Documentation

Installation

Create a Python virtual environment: Must be Python 3

python -m venv venv
source venv/bin/activate

This package can be installed with the following:

git clone https://github.com/facet-scanner
cd facet-scanner
pip install -e .

NOTE: As of 22nd Jan 2025 the facet-scanner repository has been upgraded for use with Poetry version 2. This requires the use of an additional requirements_fix.txt patch while a solution for poetry dependencies in github is worked on. The above installation MUST be supplemented with:

pip install -r requirements_fix.txt

This is a temporary fix and will be removed when poetry is patched.

Running the code

facet_scanner (path_to_scan) [--conf <path_to_config.ini>]

Required:

Argument Description
path_to_scan File path in the archive to use as the basis of the scan. The scanner will take this path and retrieve all the files in the elasticsearch index at this point.

Optional:

Option No. Arguments Description
--conf 1 Allows you to set a different location for the config file.This defaults to ../conf/facet_scanner.ini relative to the script.

Adding a new Collection

Adding a Handler Documentation

Facet Indexing with Rabbit

This repo also provides the code to read from the rabbit queue and process updates to the files index, adding facets.

Watched events:

  • DEPOSIT

Exposed Queue Consumer Classes:

  • rabbit_facet_indexer.queue_consumers.FacetScannerQueueConsumer

Configuration

Configuration is handled using a YAML file. The full configuration options are described in the rabbit_indexer repo

This process also requires an environment variable JSON_TAGGER_ROOT. This should be set to the json directory which contains the tagging json.

The required sections for the facet indexer are:

  • rabbit_server
  • indexer
  • logging
  • moles
  • elasticsearch
  • files_index

An example YAML file (secrets noted by ***** ):

---
rabbit_server:
  name: "*****"
  user: "*****"
  password: "*****"
  vhost: "*****"
  source_exchange:
    name: deposit_logs
    type: fanout
  dest_exchange:
    name: fbi_fanout
    type: fanout
  queues:
    - name: elasticsearch_update_queue_opensearch_tags_test
      kwargs:
        auto_delete: false
    - name: elasticsearch_update_queue_opensearch_tags_test
      bind_kwargs:
        routing_key: opensearch.tagger.cci
indexer:
  queue_consumer_class: rabbit_facet_indexer.queue_consumers.FacetScannerQueueConsumer
  path_filter:
    paths:
      - /neodc/esacci
    filter_policy: 2
logging:
  log_level: info
moles:
  moles_obs_map_url: http://api.catalogue.ceda.ac.uk/api/v2/observations.json/?publicationState__in=citable,published,preview,removed&fields=publicationState,result_field,title,uuid
elasticsearch:
  es_api_key: "*****"
files_index:
  name: ceda-fbi
  calculate_md5: false
  scan_level: 2

Running

The indexer can be run using the helper script provided by rabbit_indexer repo. This uses an entry script and parses the config file to run your selected queue_consumer_class:

rabbit_event_indexer --conf <path_to_configuration_file>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cci_facet_scanner-0.8.6.tar.gz (55.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cci_facet_scanner-0.8.6-py3-none-any.whl (73.4 kB view details)

Uploaded Python 3

File details

Details for the file cci_facet_scanner-0.8.6.tar.gz.

File metadata

  • Download URL: cci_facet_scanner-0.8.6.tar.gz
  • Upload date:
  • Size: 55.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.9 Darwin/25.5.0

File hashes

Hashes for cci_facet_scanner-0.8.6.tar.gz
Algorithm Hash digest
SHA256 dd381a8b5422d17d037dfd0dc0dabe965c19db4912fd23245f4842d4893d0245
MD5 288e6ca8f23da41ce2e5cae3bd8dd553
BLAKE2b-256 4a6e7123629539be86caf79377a6adcfcb9034e2e5006ec0b6b9301aaf1b6062

See more details on using hashes here.

File details

Details for the file cci_facet_scanner-0.8.6-py3-none-any.whl.

File metadata

  • Download URL: cci_facet_scanner-0.8.6-py3-none-any.whl
  • Upload date:
  • Size: 73.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.9 Darwin/25.5.0

File hashes

Hashes for cci_facet_scanner-0.8.6-py3-none-any.whl
Algorithm Hash digest
SHA256 16a442b9e90540c2fd4300008885248f8a4b66f27a3910df7efae5f54d10473f
MD5 c4ee5844c2bb377e7764075c992dcbc6
BLAKE2b-256 0888e566d10b11845b7c8ec1b3ea85371fa1b45ca75b6daeb475d48acd89811c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page