Indexing package and scripts for the File Catalog
Project description
file-catalog-indexer
Indexing package and scripts for the File Catalog
How To
API
from indexer.index import index
- The flagship indexing function
- Find files rooted at given path(s), compute their metadata, and upload it to File Catalog
- Configurable for multi-processing (default: 1 process) and recursive file-traversing (default: on)
- Internally communicates asynchronously with File Catalog
- Note: Symbolic links are never followed.
- Note:
index()
runs the current event loop (asyncio.get_event_loop().run_until_complete()
) - Ex:
index(
fc_token,
'WIPAC',
paths=['/data/exp/IceCube/2018/filtered/level2/0820', '/data/exp/IceCube/2018/filtered/level2/0825'],
blacklist=['/data/exp/IceCube/2018/filtered/level2/0820/Run00131410_74'],
n_processes=4,
)
from indexer.index import index_file
- Compute metadata of a single file, and upload it to File Catalog, i.e. index one file
- Single-processed, single-threaded
await index_file(
filepath='/data/exp/IceCube/2018/filtered/level2/0820/Run00131410_74/Level2_IC86.2018_data_Run00131410_Subrun00000000_00000172.i3.zst',
manager=MetadataManager(...),
fc_rc=RestClient(...),
)
from indexer.index import index_paths
- A wrapper around
index_file()
which indexes multiple files, and returns any nested sub-directories - Single-processed, single-threaded
- Note: Symbolic links are never followed.
sub_dirs = await index_paths(
paths=['/data/exp/IceCube/2018/filtered/level2/0820', '/data/exp/IceCube/2018/filtered/level2/0825'],
manager=MetadataManager(...),
fc_rc=RestClient(...),
)
from indexer.metadata_manager import MetadataManager
- The internal brain of the Indexer. This has minimal guardrails, does not communicate to File Catalog, and does not traverse file directory tree.
- Metadata is produced for an individual file, at a time.
- Ex:
manager = MetadataManager(...) # caches connections & directory info, manages metadata collection
metadata_file = manager.new_file(filepath) # returns an instance (computationally light)
metadata = metadata_file.generate() # returns a dict (computationally intense)
Scripts
python -m indexer.index
- A command-line alternative to using
from indexer.index import index
- Use with
-h
to see usage. - Note: Symbolic links are never followed.
python -m indexer.generate
- Like
python -m indexer.index
, but prints (usingpprint
) the metadata instead of posting to File Catalog. - Simply, uses file-traversing logic around calls to
indexer.metadata_manager.MetadataManager
- Note: Symbolic links are never followed.
python -m indexer.delocate
- Find files rooted at given path(s); for each, remove the matching location entry from its File Catalog record.
- Note: Symbolic links are never followed.
.i3 File Processing-Level Detection and Embedded Filename-Metadata Extraction
Regex is used heavily to detect the processing level of a .i3
file, and extract any embedded metadata in the filename. The exact process depends on the type of data:
Real Data (/data/exp/*
)
This is a two-stage process (see MetadataManager._new_file_real()
):
- Processing-Level Detection (Base Pattern Screening)
- The filename is applied to multiple generic patterns to detect if it is L2, PFFilt, PFDST, or PFRaw
- If the filename does not trigger a match, only basic metadata is collected (
logical_name
,checksum
,file_size
,locations
, andcreate_date
)
- Embedded Filename-Metadata Extraction
- After the processing level is known, the filename is parsed using one of (possibly) several tokenizing regex patterns for the best match (greedy matching)
- If the filename does not trigger a match, the function will raise an exception (script will exit). This probably indicates that a new pattern needs to be added to the list.
- see
indexer.metadata.real.filename_patterns
- see
Simulation Data (/data/sim/*
)
This is a three-stage process (see MetadataManager._new_file_simulation()
):
- Base Pattern Screening
- The filename is checked for
.i3
file extensions:.i3
,.i3.gz
,.i3.bz2
,.i3.zst
- If the filename does not trigger a match, only basic metadata is collected (
logical_name
,checksum
,file_size
,locations
, andcreate_date
)- there are a couple hard-coded "anti-patterns" used for rejecting known false-positives (see code)
- The filename is checked for
- Embedded Filename-Metadata Extraction
- The filename is parsed using one of MANY (around a thousand) tokenizing regex patterns for the best match (greedy matching)
- If the filename does not trigger a match, the function will raise an exception (script will exit). This probably indicates that a new pattern needs to be added to the list.
- see
indexer.metadata.sim.filename_patterns
- see
- Processing-Level Detection
- The filename is parsed for substrings corresponding to a processing level
- see
DataSimI3FileMetadata.figure_processing_level()
- see
- If there is no match,
processing_level
will be set toNone
, since the processing level is less important for simulation data.
- The filename is parsed for substrings corresponding to a processing level
Metadata Schema
see: Google Doc also: [File Catalog Types]https://github.com/WIPACrepo/file_catalog/blob/master/file_catalog/schema/types.py
Warnings
Re-indexing Files is Tricky (Two Scenarios)
- Indexing files that have not changed locations is okay--this probably means that the rest of the metadata has also not changed. A guardrail query will check if the file exists in the FC with that
locations
entry, and will not process the file further. - HOWEVER, don't point the indexer at restored files (of the same file-version)--those that had their initial
locations
entry removed (ie. removed from WIPAC, then moved back). Unlike re-indexing an unchanged file, this file will be fully locally processed (opened, read, and check-summed) before encountering the checksum-conflict then aborting. These files will be skipped (not sent to FC), unless you use--patch
(replaces thelocations
list, wholesale), which is DANGEROUS.- Example Conflict: It's possible a file-version exists in FC after initial guardrails
- file was at WIPAC & indexed
- then moved to NERSC (
location
added) & deleted from WIPAC (location
removed) - file was brought back to WIPAC
- now is being re-indexed at WIPAC
- CONFLICT -> has the same
logical_name
+checksum.sha512
but differinglocations
- Example Conflict: It's possible a file-version exists in FC after initial guardrails
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wipac-file-catalog-indexer-0.2.1.tar.gz
.
File metadata
- Download URL: wipac-file-catalog-indexer-0.2.1.tar.gz
- Upload date:
- Size: 38.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c3dd0f1e753d7e7ef30d27328129b79c5fc69567648c033772fad1d63c6c3127 |
|
MD5 | d80fb95e951374ac56eb0ba921720603 |
|
BLAKE2b-256 | 2c2467ba86eb108d671151a7805370bb80cc8e9fc95319167bfb73d6ce420501 |
File details
Details for the file wipac_file_catalog_indexer-0.2.1-py3-none-any.whl
.
File metadata
- Download URL: wipac_file_catalog_indexer-0.2.1-py3-none-any.whl
- Upload date:
- Size: 44.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.64.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f8a062205b8bca5e3be3213af405b8dac6546e1d6888fb372172db48bb8ccf8 |
|
MD5 | 3b167abdf706e54a142e5a5b17093628 |
|
BLAKE2b-256 | 05ddd92a1affd9e7b29e248abb418f48bffcd698f264e762204e5ce17e0fda1b |