Skip to main content

Store file metadata information in a file catalog

Project description

PyPI GitHub release (latest by date including pre-releases) PyPI - License Lines of code GitHub issues GitHub pull requests

file_catalog

Store file metadata information in a file catalog

Prerequisites

To get the prerequisites necessary for the file catalog:

pip install -r requirements.txt

Running the server

To start an instance of the server running:

python -m file_catalog

Configuration

All configuration is done using environment variables. To get the list of possible configuration parameters and their defaults, run

python -m file_catalog --show-config-spec

Interface

The primary interface is an HTTP server. TLS and other security hardening mechanisms are handled by a reverse proxy server as for normal web applications.

Browser

Requests to the main url / are browsable like a standard website. They will use javascript to activate the REST API as necessary.

REST API

Requests with urls of the form /api/RESOURCE can access the REST API. Responses are in HAL JSON format.

File-Entry Fields

File-Metadata Schema:

Mandatory Fields:

  • uuid (provided by File Catalog)
  • logical_name
  • locations (with at least one non-empty URL)
  • file_size
  • checksum.sha512

Route: /api/files

Resource representing the collection of all files in the catalog.

Method: GET

Obtain list of files

REST-Query Parameters
HTTP Response Status Codes
  • 200: Response contains collection of file resources
  • 400: Bad request (query parameters invalid)
  • 429: Too many requests (if server is being hammered)
  • 500: Unspecified server error
  • 503: Service unavailable (maintenance, etc.)

Method: POST

Create a new file or add a replica

If a file exists and the checksum is the same, a replica is added. If the checksum is different a conflict error is returned.

REST-Body
HTTP Response Status Codes
  • 200: Replica has been added. Response contains link to file resource
  • 201: Response contains link to newly created file resource
  • 400: Bad request (metadata failed validation)
  • 409: Conflict (if the file-version already exists); includes link to existing file
  • 429: Too many requests (if server is being hammered)
  • 500: Unspecified server error
  • 503: Service unavailable (maintenance, etc.)

Method: DELETE

Not supported

Method: PUT

Not supported

Method: PATCH

Not supported

Route: /api/files/{uuid}

Resource representing the metadata for a file in the file catalog.

Method: GET

Obtain file metadata information

REST-Query Parameters
  • None
HTTP Response Status Codes
  • 200: Response contains metadata of file resource
  • 404: Not Found (file resource does not exist)
  • 429: Too many requests (if server is being hammered)
  • 500: Unspecified server error
  • 503: Service unavailable (maintenance, etc.)

Method: POST

Not supported

Method: DELETE

Delete the metadata for the file

REST-Query Parameters
  • None
HTTP Response Status Codes
  • 204: No Content (file resource is successfully deleted)
  • 404: Not Found (file resource does not exist)
  • 429: Too many requests (if server is being hammered)
  • 500: Unspecified server error
  • 503: Service unavailable (maintenance, etc.)

Method: PUT

Fully update/replace file metadata information

REST-Body
HTTP Response Status Codes
  • 200: Response indicates metadata of file resource has been updated/replaced
  • 404: Not Found (file resource does not exist) + link to “files” resource for POST
  • 409: Conflict (if updating an outdated resource - use ETAG hash to compare)
  • 429: Too many requests (if server is being hammered)
  • 500: Unspecified server error
  • 503: Service unavailable (maintenance, etc.)

Method: PATCH

Partially update/replace file metadata information

The JSON provided as body to PATCH need not contain all the keys, only the need to be updated. If a key is provided with a value null, then that key can be removed from the metadata.

REST-Body
HTTP Response Status Codes
  • 200: Response indicates metadata of file resource has been updated/replaced
  • 404: Not Found (file resource does not exist) + link to “files” resource for POST
  • 409: Conflict (if updating an outdated resource - use ETAG hash to compare)
  • 429: Too many requests (if server is being hammered)
  • 500: Unspecified server error
  • 503: Service unavailable (maintenance, etc.)

More About REST-Query Parameters

limit
  • positive integer; number of results to provide (default: 10000)
  • NOTE: The server may honor the limit parameter. In cases where the server does not honor the limit parameter, it should do so by providing fewer resources (limit should be considered the client’s upper limit for the number of resources in the response).
start
  • non-negative integer; result at which to start at (default: 0)
  • NOTE: the server should honor the start parameter
  • TIP: increment start by limit to paginate through many results
query
  • MongoDB query; use to specify file-entry fields/ranges; forwarded to MongoDB daemon
keys
  • a |-delimited string-list of keys; defines what fields to include in result(s)
  • ex: "foo|bar|baz"
  • different routes/methods define differing defaults
  • NOTE: there is no performance hit for including more fields
  • see all-keys
max_time_ms
  • non-negative integer OR None; timeout to kill long queries in MILLISECONDS
  • overrides the default timeout of 600000 ms (10 minutes)
  • None indicates no timeout (this can hang the server -- you have been warned)
Shortcut Parameters: logical-name-regex, logical_name, directory, filename

In decreasing order of precedence...

  • logical-name-regex

    • query by regex pattern (at your own risk... performance-wise)
    • equivalent to: query: {"logical_name": {"$regex": p}}
  • logical_name

    • equivalent to: query["logical_name"]
  • directory

    • query by absolute directory filepath
    • equivalent to: query: {"logical_name": {"$regex": "^/your/path/.*"}}
    • NOTE: a trailing-/ will be inserted if you don't provide one
    • TIP: use in conjunction with filename (ie: /root/dirs/.../filename)
  • filename

    • query by filename (no parent-directory path needed)
    • equivalent to: query: {"logical_name": {"$regex": ".*/your-file$"}}
    • NOTE: a leading-/ will be inserted if you don't provide one
    • TIP: use in conjunction with directory (ie: /root/dirs/.../filename)
Shortcut Parameter: run_number
  • equivalent to: query["run.run_number"]
Shortcut Parameter: dataset
  • equivalent to: query["iceprod.dataset"]
Shortcut Parameter: event_id
  • equivalent to: query: {"run.first_event":{"$lte": e}, "run.last_event":{"$gte": e}}
Shortcut Parameter: processing_level
  • equivalent to: query["processing_level"]
Shortcut Parameter: season
  • equivalent to: query["offline_processing_metadata.season"]
Shortcut Parameter: all-keys
  • boolean (True/"True"/"true"/1); include all fields in result(s)
  • NOTE: there is no performance hit for including more fields
  • TIP: this is preferred over querying /api/files, grabbing the uuid, then querying /api/files/{uuid}

Development

Establishing a development environment

Follow these steps to get a development environment for the File Catalog:

cd ~/projects
git clone git@github.com:WIPACrepo/file_catalog.git
cd file_catalog
./setupenv.sh

MongoDB Instance for Testing

This command will spin up a disposable MongoDB instance using Docker:

docker run \
    --detach \
    --name test-mongo \
    --network=host \
    --rm \
    circleci/mongo:latest-ram

Building a Docker container

The following commands will create a Docker container for the file-catalog:

docker build -t file-catalog:{version} -f Dockerfile .
docker image tag file-catalog:{version} file-catalog:latest

Where {version} is found in file_catalog/__init__py; e.g.:

__version__ = '1.2.0'       # For {version} use: 1.2.0

Pushing Docker containers to local registry in Kubernetes

Here are some commands to get the Docker container pushed to our Docker register in our Kubernetes cluster:

kubectl -n kube-system port-forward $(kubectl get pods --namespace kube-system -l "app=docker-registry,release=docker-registry" -o jsonpath="{.items[0].metadata.name}") 5000:5000 &
docker tag file-catalog:{version} localhost:5000/file-catalog:{version}
docker push localhost:5000/file-catalog:{version}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wipac-file-catalog-1.9.12.tar.gz (29.6 kB view details)

Uploaded Source

Built Distribution

wipac_file_catalog-1.9.12-py2.py3-none-any.whl (30.3 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file wipac-file-catalog-1.9.12.tar.gz.

File metadata

  • Download URL: wipac-file-catalog-1.9.12.tar.gz
  • Upload date:
  • Size: 29.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/43.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.1 tqdm/4.66.2 importlib-metadata/7.1.0 keyring/25.1.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.14

File hashes

Hashes for wipac-file-catalog-1.9.12.tar.gz
Algorithm Hash digest
SHA256 e9225579fd2946e7b0f3a54d1b35869915f12c7ec83135f80401dc134528cf63
MD5 aa328521c97b8d4af678a6a378b6fdc9
BLAKE2b-256 c6ee4eee892189cb97522e8ffa921eff7ee5aa90e180772cfe14cdbcdf51e232

See more details on using hashes here.

File details

Details for the file wipac_file_catalog-1.9.12-py2.py3-none-any.whl.

File metadata

  • Download URL: wipac_file_catalog-1.9.12-py2.py3-none-any.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.10.0 readme-renderer/43.0 requests/2.31.0 requests-toolbelt/1.0.0 urllib3/2.2.1 tqdm/4.66.2 importlib-metadata/7.1.0 keyring/25.1.0 rfc3986/2.0.0 colorama/0.4.6 CPython/3.10.14

File hashes

Hashes for wipac_file_catalog-1.9.12-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 16d980ebbaf85d62d3b1b2015e452961ad3c1ab9d8972c76e5bf4601725aba3c
MD5 27a7c31b05be809fe4c81c028ccc0ff2
BLAKE2b-256 a1efb15296ec0658ba40fa8195a49830726182601c588d77ae50c22c75408dd1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page