A set of utility scripts which transform/augment PDS registry metadata to support additional capabilities.
Project description
Registry Sweepers
This package provides supplementary metadata generation for registry documents, which is required for registry-api to function correctly, and for common user queries. Execution is idempotent and should be scheduled on a recurring basis.
Components
RepairKit
The repairkit sweeper applies idempotent transformations to targeted subsets of properties, for example ensuring that all properties expected to have array-like values are in fact arrays (as opposed to single-element arrays being flattened to strings during harvest). Documents are processed based on whether their ops:Provenance/ops:registry_sweepers_repairkit_version metadata value is up-to-date relative to the sweeper codebase.
Provenance
The provenance sweeper generates metadata for linking each version-superseded product with the versioned product which supersedes it. The value of the successor is stored in the ops:Provenance/ops:superseded_by property. This property will not be set for the latest version of any product. All documents are processed, but db writes are optimised based on whether their ops:Provenance/ops:registry_sweepers_provenance_version metadata value is up-to-date relative to the sweeper codebase.
Ancestry
The ancestry sweeper generates membership metadata for each product, i.e. which bundle lidvids and which collection lidvids reference a given product. These values will be stored in properties ops:Provenance/ops:parent_bundle_identifier and ops:Provenance/ops:parent_collection_identifier, respectively. All bundles/collections are processed to populate a lookup table, but db writes are optimised based on whether their ops:Provenance/ops:registry_sweepers_provenance_version metadata value is up-to-date relative to the sweeper codebase, and collection non-aggregate reference pages in registry-refs are skipped entirely if they are marked as up-to-date.
Accepts environment variables to tune performance, primarily trading increased runtime duration for reduced peak memory usage.
Reindexer
The reindexer sweeper ensures that the registry index mappings are updated with all fields available in the registry-dd index, and then triggers reindexation on all products which have not yet been successfully processed previously. This ensures that all products are searchable on all fields, provided a field type mapping is defined in the registry-dd index at the time of processing.
Legacy Registry Sync
The legacy registry sync tool migrates data from the legacy Solr-based PDS registry to the new OpenSearch-based registry. It includes a dry-run mode for testing Solr data retrieval without affecting OpenSearch.
Console Script Usage:
# Install the package first
pip install -e .
# Run dry-run mode (Solr data retrieval only, no OpenSearch operations)
pds-legacy-registry-sync --dry-run --max-docs 10
# Comprehensive dry-run with logging
pds-legacy-registry-sync --dry-run --max-docs 100 --log-file dry_run.log
# Dry-run without showing sample documents
pds-legacy-registry-sync --dry-run --max-docs 50 --no-samples
# Dry-run with more sample documents
pds-legacy-registry-sync --dry-run --max-docs 20 --sample-size 10
# Get help
pds-legacy-registry-sync --dry-run --help
Dry-Run Features:
- Solr-only operations: No OpenSearch connection required
- Data analysis: Shows node distribution, product class breakdown, and data quality metrics
- Sample documents: Displays sample documents with key fields
- Progress tracking: Logs progress every 1000 documents
- Error handling: Continues processing and reports errors
Dry-Run Output: The tool provides comprehensive statistics including:
- Total documents processed
- Documents with/without lidvid
- Node distribution (PDS_ENG, PDS_IMG, etc.)
- Product class distribution
- Node-by-product-class breakdown with percentages
- Sample documents with key fields
- Domain and node ID analysis
Full synchronization (Solr + OpenSearch) is available programmatically via the run() function but is not yet implemented as a console script option.
Developer Quickstart
Prerequisites
Dependencies
- Python >=3.12
Environment Variables
## General Config
LOGLEVEL=INFO # an integer log level (0, 1, 2, 3) or string matching a
# log level (e.g. `INFO`). Default: `INFO`
## Applicable when using AWS AOSS
MULTITENANCY_NODE_ID= # If running in a multitenant environment, the id of the node.
# Previously, distinguished registry/registry-refs index instances
SWEEPERS_IAM_ROLE_NAME=<value> # AWS IAM role name, if targeting AWS AOSS
PROV_ENDPOINT=https://localhost:9200 # OpenSearch host url and port
## Applicable when using non-AWS OpenSearch
PROV_CREDENTIALS={"admin": "admin"} # OpenSearch username/password
# Only applicable for hosts other than AWS AOSS
## For Developers
DEV_MODE=1 #disables host verification
# tqdm dependency may cause fatal crashes on some architectures when
# breakpoints are used in debug mode with Cython speedup extension enabled
PYDEVD_USE_CYTHON=NO // disables Cython speedup extension
With --legacy-sync option, the "registry" alias mapping all the discipline nodes indexes is required.
Use the connection aliases found in the 'Connections' tab of the Engineering Node OpenSearch Domain on AWS.
After cloning the repository, and setting the repository root as the current working directory install the package with pip install -e .
The wrapper script for the suite of components may be run with python ./docker/sweepers_driver.py
Alternatively, registry-sweepers may be built from its Dockerfile with docker image build --file ./docker/Dockerfile . and run as a container, providing those same environment variables when running the container.
Performance
Rough Benchmarks
When run against the production OpenSearch instance with ~1.1M products, no cross-cluster remotes, and (only) ~1k multi-version products, from a local development machine, the runtime is ~20min on first run and ~12min subsequently. It appears that OpenSearch optimizes away no-op update calls, resulting in significant speedup despite the fact that registry-sweepers reprocesses metadata from scratch, every run.
The overwhelming bottleneck ops are the O(docs_count) db writes in ancestry.
Code of Conduct
All users and developers of the NASA-PDS software are expected to abide by our Code of Conduct. Please read this to ensure you understand the expectations of our community.
Development
To develop this project, use your favorite text editor, or an integrated development environment with Python support, such as PyCharm.
Contributing
For information on how to contribute to NASA-PDS codebases please take a look at our Contributing guidelines.
Build
pip install build
python3 -m build .
Publication
NASA PDS packages can publish automatically using the Roundup Action, which leverages GitHub Actions to perform automated continuous integration and continuous delivery. A default workflow that includes the Roundup is provided in the .github/workflows/unstable-cicd.yaml file. (Unstable here means an interim release.)
Manual Publication
Create the package:
python3 -m build .
Publish it as a Github release.
Publish on PyPI (you need a PyPI account and configure $HOME/.pypirc):
pip install twine
twine upload dist/*
Or publish on the Test PyPI (you need a Test PyPI account and configure $HOME/.pypirc):
pip install twine
twine upload --repository testpypi dist/*
CI/CD
The template repository comes with our two "standard" CI/CD workflows, stable-cicd and unstable-cicd. The unstable build runs on any push to main (± ignoring changes to specific files) and the stable build runs on push of a release branch of the form release/<release version>. Both of these make use of our GitHub actions build step, Roundup. The unstable-cicd will generate (and constantly update) a SNAPSHOT release. If you haven't done a formal software release you will end up with a v0.0.0-SNAPSHOT release (see NASA-PDS/roundup-action#56 for specifics).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pds_registry_sweepers-1.6.2-py3-none-any.whl.
File metadata
- Download URL: pds_registry_sweepers-1.6.2-py3-none-any.whl
- Upload date:
- Size: 77.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
312c54f4193390908f30b9fdfb028bc99440f8f4f823d68a6824236f6b609e71
|
|
| MD5 |
73a43165ffbaec4f7f866ca026d1ad1c
|
|
| BLAKE2b-256 |
f3fdc723b826680c1a7dc1e3045e3c1506a49ef175175f23be200bac94264f25
|