Skip to main content

A set of utility scripts which transform/augment PDS registry metadata to support additional capabilities.

Project description

Registry Sweepers

This package provides supplementary metadata generation for registry documents, which is required for registry-api to function correctly, and for common user queries. Execution is idempotent and should be scheduled on a recurring basis.

Components

RepairKit

The repairkit sweeper applies idempotent transformations to targeted subsets of properties, for example ensuring that all properties expected to have array-like values are in fact arrays (as opposed to single-element arrays being flattened to strings during harvest). Documents are processed based on whether their ops:Provenance/ops:registry_sweepers_repairkit_version is up-to-date relative to the sweeper codebase.

Provenance

The provenance sweeper generates metadata for linking each version-superseded product with the versioned product which supersedes it. The value of the successor is stored in the ops:Provenance/ops:superseded_by property. This property will not be set for the latest version of any product.

Ancestry

The ancestry sweeper generates membership metadata for each product, i.e. which bundle lidvids and which collection lidvids reference a given product. These values will be stored in properties ops:Provenance/ops:parent_bundle_identifier and ops:Provenance/ops:parent_collection_identifier, respectively.

Accepts environment variables to tune performance, primarily trading increased runtime duration for reduced peak memory usage.

Developer Quickstart

Prerequisites

Dependencies

  • Python >=3.9

Environment Variables

PROV_CREDENTIALS={"admin": "admin"}  // OpenSearch username/password
PROV_ENDPOINT=https://localhost:9200  // OpenSearch host url and port
LOGLEVEL - an integer log level or anycase string matching a python log level like `INFO` (optional - defaults to `INFO`))
DEV_MODE=1  // disables host verification

With --legacy-sync option, you also need the list of the cross-cluster-search node configured to access all the node's OpensSearch domains:

CCS_CONN=naif-prod-ccs,rms-prod,sbnumd-prod-ccs,geo-prod-ccs,atm-prod-ccs,sbnpsi-prod-ccs,ppi-prod-ccs,img-prod-ccs

Use the connection aliases found in the 'Connections' tab of the Engineering Node OpenSearch Domain on AWS.

https://us-west-2.console.aws.amazon.com/aos/home?region=us-west-2#opensearch/domains/en-prod?tabId=ccs

After cloning the repository, and setting the repository root as the current working directory install the package with pip install -e .

The wrapper script for the suite of components may be run with python ./docker/sweepers_driver.py

Alternatively, registry-sweepers may be built from its Dockerfile with docker image build --file ./docker/Dockerfile . and run as a container, providing those same environment variables when running the container.

Performance

Rough Benchmarks

When run against the production OpenSearch instance with ~1.1M products, no cross-cluster remotes, and (only) ~1k multi-version products, from a local development machine, the runtime is ~20min on first run and ~12min subsequently. It appears that OpenSearch optimizes away no-op update calls, resulting in significant speedup despite the fact that registry-sweepers reprocesses metadata from scratch, every run.

The overwhelming bottleneck ops are the O(docs_count) db writes in ancestry.

Code of Conduct

All users and developers of the NASA-PDS software are expected to abide by our Code of Conduct. Please read this to ensure you understand the expectations of our community.

Development

To develop this project, use your favorite text editor, or an integrated development environment with Python support, such as PyCharm.

Contributing

For information on how to contribute to NASA-PDS codebases please take a look at our Contributing guidelines.

Installation

Install in editable mode and with extra developer dependencies into your virtual environment of choice:

pip install --editable '.[dev]'

Configure the pre-commit hooks:

pre-commit install
pre-commit install -t pre-push
pre-commit install -t prepare-commit-msg
pre-commit install -t commit-msg

These hooks check code formatting and also aborts commits that contain secrets such as passwords or API keys. However, a one time setup is required in your global Git configuration. See the wiki entry on Git Secrets to learn how.

Packaging

To isolate and be able to re-produce the environment for this package, you should use a Python Virtual Environment. To do so, run:

python -m venv venv

Then exclusively use venv/bin/python, venv/bin/pip, etc.

If you have tox installed and would like it to create your environment and install dependencies for you run:

tox --devenv <name you'd like for env> -e dev

Dependencies for development are specified as the dev extras_require in setup.cfg; they are installed into the virtual environment as follows:

pip install --editable '.[dev]'

All the source code is in a sub-directory under src.

Tests

This section describes testing for your package.

A complete "build" including test execution, linting (mypy, black, flake8, etc.), and documentation build is executed via:

tox

Unit tests

Your project should have built-in unit tests, functional, validation, acceptance, etc., tests.

For unit testing, check out the unittest module, built into Python 3.

Tests objects should be in packages test modules or preferably in project 'tests' directory which mirrors the project package structure.

Our unit tests are launched with command:

pytest

If you want your tests to run automatically as you make changes start up pytest in watch mode with:

ptw

Build

pip install wheel
python setup.py sdist bdist_wheel

Publication

NASA PDS packages can publish automatically using the Roundup Action, which leverages GitHub Actions to perform automated continuous integration and continuous delivery. A default workflow that includes the Roundup is provided in the .github/workflows/unstable-cicd.yaml file. (Unstable here means an interim release.)

Manual Publication

Create the package:

python setup.py bdist_wheel

Publish it as a Github release.

Publish on PyPI (you need a PyPI account and configure $HOME/.pypirc):

pip install twine
twine upload dist/*

Or publish on the Test PyPI (you need a Test PyPI account and configure $HOME/.pypirc):

pip install twine
twine upload --repository testpypi dist/*

CI/CD

The template repository comes with our two "standard" CI/CD workflows, stable-cicd and unstable-cicd. The unstable build runs on any push to main (± ignoring changes to specific files) and the stable build runs on push of a release branch of the form release/<release version>. Both of these make use of our GitHub actions build step, Roundup. The unstable-cicd will generate (and constantly update) a SNAPSHOT release. If you haven't done a formal software release you will end up with a v0.0.0-SNAPSHOT release (see NASA-PDS/roundup-action#56 for specifics).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

pds.registry_sweepers-1.2.1-py3-none-any.whl (44.3 kB view details)

Uploaded Python 3

File details

Details for the file pds.registry_sweepers-1.2.1-py3-none-any.whl.

File metadata

  • Download URL: pds.registry_sweepers-1.2.1-py3-none-any.whl
  • Upload date:
  • Size: 44.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/7.0.1 pkginfo/1.9.6 requests/2.23.0 requests-toolbelt/1.0.0 tqdm/4.66.1 CPython/3.9.16

File hashes

Hashes for pds.registry_sweepers-1.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f7ebe6654077f6ce131ad7e2d6037d208ddf3dbaf65c2c021d3aced6e729849e
MD5 89dd60903bbe7537ef2b75d71b620ee0
BLAKE2b-256 25be307be303ff6fe94cf83dc1f5a91b300a2ebaea2cdd9c125652f10ba553a8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page