Extracts facets from datasets to add to elasticsearch.
Reason this release was yanked:
Fatal bug with filtering
Project description
CEDA Facet Scanner
Takes datasets and extracts the facets from the files/filepath. The scanner works to map a file path to the specific handler for that collection of files. For example, ESA CCI datasets are scanned by the cci handler. This is because each collection of datasets will have different characteristics and will need to be treated in a different way.
Installation
Create a Python virtual environment: Must be Python 3
python -m venv venv
source venv/bin/activate
This package can be installed with the following:
git clone https://github.com/facet-scanner
cd facet-scanner
pip install -e .
NOTE: As of 22nd Jan 2025 the facet-scanner repository has been upgraded for use with Poetry version 2. This requires the use of an additional requirements_fix.txt patch while a solution for poetry dependencies in github is worked on. The above installation MUST be supplemented with:
pip install -r requirements_fix.txt
This is a temporary fix and will be removed when poetry is patched.
Running the code
facet_scanner (path_to_scan) [--conf <path_to_config.ini>]
Required:
| Argument | Description |
|---|---|
| path_to_scan | File path in the archive to use as the basis of the scan. The scanner will take this path and retrieve all the files in the elasticsearch index at this point. |
Optional:
| Option | No. Arguments | Description |
|---|---|---|
--conf |
1 | Allows you to set a different location for the config file.This defaults to ../conf/facet_scanner.ini relative to the script. |
Adding a new Collection
Adding a Handler Documentation
Facet Indexing with Rabbit
This repo also provides the code to read from the rabbit queue and process updates to the files index, adding facets.
Watched events:
- DEPOSIT
Exposed Queue Consumer Classes:
rabbit_facet_indexer.queue_consumers.FacetScannerQueueConsumer
Configuration
Configuration is handled using a YAML file. The full configuration options are described in the rabbit_indexer repo
This process also requires an environment variable JSON_TAGGER_ROOT. This should be set to
the json directory which contains the tagging json.
The required sections for the facet indexer are:
- rabbit_server
- indexer
- logging
- moles
- elasticsearch
- files_index
An example YAML file (secrets noted by ***** ):
---
rabbit_server:
name: "*****"
user: "*****"
password: "*****"
vhost: "*****"
source_exchange:
name: deposit_logs
type: fanout
dest_exchange:
name: fbi_fanout
type: fanout
queues:
- name: elasticsearch_update_queue_opensearch_tags_test
kwargs:
auto_delete: false
- name: elasticsearch_update_queue_opensearch_tags_test
bind_kwargs:
routing_key: opensearch.tagger.cci
indexer:
queue_consumer_class: rabbit_facet_indexer.queue_consumers.FacetScannerQueueConsumer
path_filter:
paths:
- /neodc/esacci
filter_policy: 2
logging:
log_level: info
moles:
moles_obs_map_url: http://api.catalogue.ceda.ac.uk/api/v2/observations.json/?publicationState__in=citable,published,preview,removed&fields=publicationState,result_field,title,uuid
elasticsearch:
es_api_key: "*****"
files_index:
name: ceda-fbi
calculate_md5: false
scan_level: 2
Running
The indexer can be run using the helper script provided by rabbit_indexer repo. This uses an entry script and parses the config file to run your selected queue_consumer_class:
rabbit_event_indexer --conf <path_to_configuration_file>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cci_facet_scanner-0.4.0.tar.gz.
File metadata
- Download URL: cci_facet_scanner-0.4.0.tar.gz
- Upload date:
- Size: 18.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.8 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b444cc24b04e09d68dda9ce410f3472b956c8ac95160707dcc51832fe84c793b
|
|
| MD5 |
4e7b492894801fd1119229944a23e52c
|
|
| BLAKE2b-256 |
3cc38aadd097b239217758341d3cec8cd0204cbb01ebe153cae989806138a9fc
|
File details
Details for the file cci_facet_scanner-0.4.0-py3-none-any.whl.
File metadata
- Download URL: cci_facet_scanner-0.4.0-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.8 Darwin/24.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9629df0d1a1a6149b059db1e7f28d847f601d936c1c2b132e9e9c2c2bd819708
|
|
| MD5 |
d78bf49a33a35228007b7713114c931f
|
|
| BLAKE2b-256 |
83ad6640451420ea4c738717bcb1f4b065e50e6e011e749e81c1d89009ccdb49
|