Skip to main content

Indexing package and scripts for the File Catalog

Project description

CircleCI PyPI GitHub release (latest by date including pre-releases) PyPI - License Lines of code GitHub issues GitHub pull requests

file-catalog-indexer

Indexing package and scripts for the File Catalog

How To

API

from indexer.index import index

  • The flagship indexing function
  • Find files rooted at given path(s), compute their metadata, and upload it to File Catalog
  • Configurable for multi-processing (default: 1 process) and recursive file-traversing (default: on)
  • Internally communicates asynchronously with File Catalog
  • Note: Symbolic links are never followed.
  • Note: index() runs the current event loop (asyncio.get_event_loop().run_until_complete())
  • Ex:
index(
    fc_token,
    'WIPAC',
    paths=['/data/exp/IceCube/2018/filtered/level2/0820', '/data/exp/IceCube/2018/filtered/level2/0825'],
    blacklist=['/data/exp/IceCube/2018/filtered/level2/0820/Run00131410_74'],
    n_processes=4,
)

from indexer.index import index_file

  • Compute metadata of a single file, and upload it to File Catalog, i.e. index one file
  • Single-processed, single-threaded
await index_file(
    filepath='/data/exp/IceCube/2018/filtered/level2/0820/Run00131410_74/Level2_IC86.2018_data_Run00131410_Subrun00000000_00000172.i3.zst',
    manager=MetadataManager(...),
    fc_rc=RestClient(...),
)

from indexer.index import index_paths

  • A wrapper around index_file() which indexes multiple files, and returns any nested sub-directories
  • Single-processed, single-threaded
  • Note: Symbolic links are never followed.
sub_dirs = await index_paths(
    paths=['/data/exp/IceCube/2018/filtered/level2/0820', '/data/exp/IceCube/2018/filtered/level2/0825'],
    manager=MetadataManager(...),
    fc_rc=RestClient(...),
)

from indexer.metadata_manager import MetadataManager

  • The internal brain of the Indexer. This has minimal guardrails, does not communicate to File Catalog, and does not traverse file directory tree.
  • Metadata is produced for an individual file, at a time.
  • Ex:
manager = MetadataManager(...)  # caches connections & directory info, manages metadata collection
metadata_file = manager.new_file(filepath)  # returns an instance (computationally light)
metadata = metadata_file.generate()  # returns a dict (computationally intense)

Scripts

python -m indexer.index
  • A command-line alternative to using from indexer.index import index
  • Use with -h to see usage.
  • Note: Symbolic links are never followed.
python -m indexer.generate
  • Like python -m indexer.index, but prints (using pprint) the metadata instead of posting to File Catalog.
  • Simply, uses file-traversing logic around calls to indexer.metadata_manager.MetadataManager
  • Note: Symbolic links are never followed.
python -m indexer.delocate
  • Find files rooted at given path(s); for each, remove the matching location entry from its File Catalog record.
  • Note: Symbolic links are never followed.

.i3 File Processing-Level Detection and Embedded Filename-Metadata Extraction

Regex is used heavily to detect the processing level of a .i3 file, and extract any embedded metadata in the filename. The exact process depends on the type of data:

Real Data (/data/exp/*)

This is a two-stage process (see MetadataManager._new_file_real()):

  1. Processing-Level Detection (Base Pattern Screening)
    • The filename is applied to multiple generic patterns to detect if it is L2, PFFilt, PFDST, or PFRaw
    • If the filename does not trigger a match, only basic metadata is collected (logical_name, checksum, file_size, locations, and create_date)
  2. Embedded Filename-Metadata Extraction
    • After the processing level is known, the filename is parsed using one of (possibly) several tokenizing regex patterns for the best match (greedy matching)
    • If the filename does not trigger a match, the function will raise an exception (script will exit). This probably indicates that a new pattern needs to be added to the list.
      • see indexer.metadata.real.filename_patterns

Simulation Data (/data/sim/*)

This is a three-stage process (see MetadataManager._new_file_simulation()):

  1. Base Pattern Screening
    • The filename is checked for .i3 file extensions: .i3, .i3.gz, .i3.bz2, .i3.zst
    • If the filename does not trigger a match, only basic metadata is collected (logical_name, checksum, file_size, locations, and create_date)
      • there are a couple hard-coded "anti-patterns" used for rejecting known false-positives (see code)
  2. Embedded Filename-Metadata Extraction
    • The filename is parsed using one of MANY (around a thousand) tokenizing regex patterns for the best match (greedy matching)
    • If the filename does not trigger a match, the function will raise an exception (script will exit). This probably indicates that a new pattern needs to be added to the list.
      • see indexer.metadata.sim.filename_patterns
  3. Processing-Level Detection
    • The filename is parsed for substrings corresponding to a processing level
      • see DataSimI3FileMetadata.figure_processing_level()
    • If there is no match, processing_level will be set to None, since the processing level is less important for simulation data.

Metadata Schema

see: Google Doc also: [File Catalog Types]https://github.com/WIPACrepo/file_catalog/blob/master/file_catalog/schema/types.py

Warnings

Re-indexing Files is Tricky (Two Scenarios)

  1. Indexing files that have not changed locations is okay--this probably means that the rest of the metadata has also not changed. A guardrail query will check if the file exists in the FC with that locations entry, and will not process the file further.
  2. HOWEVER, don't point the indexer at restored files (of the same file-version)--those that had their initial locations entry removed (ie. removed from WIPAC, then moved back). Unlike re-indexing an unchanged file, this file will be fully locally processed (opened, read, and check-summed) before encountering the checksum-conflict then aborting. These files will be skipped (not sent to FC), unless you use --patch (replaces the locations list, wholesale), which is DANGEROUS.
    • Example Conflict: It's possible a file-version exists in FC after initial guardrails
      1. file was at WIPAC & indexed
      2. then moved to NERSC (location added) & deleted from WIPAC (location removed)
      3. file was brought back to WIPAC
      4. now is being re-indexed at WIPAC
      5. CONFLICT -> has the same logical_name+checksum.sha512 but differing locations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wipac-file-catalog-indexer-0.2.1.tar.gz (38.7 kB view details)

Uploaded Source

Built Distribution

wipac_file_catalog_indexer-0.2.1-py3-none-any.whl (44.2 kB view details)

Uploaded Python 3

File details

Details for the file wipac-file-catalog-indexer-0.2.1.tar.gz.

File metadata

  • Download URL: wipac-file-catalog-indexer-0.2.1.tar.gz
  • Upload date:
  • Size: 38.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12

File hashes

Hashes for wipac-file-catalog-indexer-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c3dd0f1e753d7e7ef30d27328129b79c5fc69567648c033772fad1d63c6c3127
MD5 d80fb95e951374ac56eb0ba921720603
BLAKE2b-256 2c2467ba86eb108d671151a7805370bb80cc8e9fc95319167bfb73d6ce420501

See more details on using hashes here.

File details

Details for the file wipac_file_catalog_indexer-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: wipac_file_catalog_indexer-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 44.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12

File hashes

Hashes for wipac_file_catalog_indexer-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9f8a062205b8bca5e3be3213af405b8dac6546e1d6888fb372172db48bb8ccf8
MD5 3b167abdf706e54a142e5a5b17093628
BLAKE2b-256 05ddd92a1affd9e7b29e248abb418f48bffcd698f264e762204e5ce17e0fda1b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page