Implement filesystem scanners and landing zones
Project description
iRODS Automated Ingest Framework
The automated ingest framework gives iRODS an enterprise solution that solves two major use cases: getting existing data under management and ingesting incoming data hitting a landing zone.
Based on the Python iRODS Client and Celery, this framework can scale up to match the demands of data coming off instruments, satellites, or parallel filesystems.
The example diagrams below show a filesystem scanner and a landing zone.
Usage options
Redis options
option | effect | default |
---|---|---|
redis_host | Domain or IP address of Redis host | localhost |
redis_port | Port number for Redis | 6379 |
redis_db | Redis DB number to use for ingest | 0 |
S3 options
To scan S3 bucket, minimally requires --s3_keypair
and source path of the form /bucket_name/path/to/root/folder
.
option | effect | default |
---|---|---|
s3_keypair | path to S3 keypair file | None |
s3_endpoint_domain | S3 endpoint domain | s3.amazonaws.com |
s3_region_name | S3 region name | us-east-1 |
s3_proxy_url | URL to proxy for S3 access | None |
s3_insecure_connection | Do not use SSL when connecting to S3 endpoint | False |
Logging/Profiling options
option | effect | default |
---|---|---|
log_filename | Path to output file for logs | None |
log_level | Minimum level of message to log | None |
log_interval | Time interval with which to rollover ingest log file | None |
log_when | Type/units of log_interval (see TimedRotatingFileHandler) | None |
--profile
allows you to use vis to visualize a profile of Celery workers over time of ingest job.
option | effect | default |
---|---|---|
profile_filename | Specify name of profile filename (JSON output) | None |
profile_level | Minimum level of message to log for profiling | None |
profile_interval | Time interval with which to rollover ingest profile file | None |
profile_when | Type/units of profile_interval (see TimedRotatingFileHandler) | None |
Ingest start options
These options are used at the "start" of an ingest job.
option | effect | default |
---|---|---|
job_name | Reference name for ingest job | a generated uuid |
interval | Restart interval (in seconds). If absent, will only sync once. | None |
file_queue | Name for the file queue. | file |
path_queue | Name for the path queue. | path |
restart_queue | Name for the restart queue. | restart |
event_handler | Path to event handler file | None (see "event_handler methods" below) |
synchronous | Block until sync job is completed | False |
progress | Show progress bar and task counts (must have --synchronous flag) | False |
ignore_cache | Ignore last sync time in cache - like starting a new sync | False |
Optimization options
option | effect | default |
---|---|---|
exclude_file_type | types of files to exclude: regular, directory, character, block, socket, pipe, link | None |
exclude_file_name | a list of space-separated python regular expressions defining the file names to exclude such as "(\S+)exclude" "(\S+).hidden" | None |
exclude_directory_name | a list of space-separated python regular expressions defining the directory names to exclude such as "(\S+)exclude" "(\S+).hidden" | None |
files_per_task | Number of paths to process in a given task on the queue | 50 |
initial_ingest | Use this flag on initial ingest to avoid check for data object paths already in iRODS | False |
irods_idle_disconnect_seconds | Seconds to hold open iRODS connection while idle | 60 |
available --event_handler
methods
method | effect | default |
---|---|---|
pre_data_obj_create | user-defined python | none |
post_data_obj_create | user-defined python | none |
pre_data_obj_modify | user-defined python | none |
post_data_obj_modify | user-defined python | none |
pre_coll_create | user-defined python | none |
post_coll_create | user-defined python | none |
pre_coll_modify | user-defined python | none |
post_coll_modify | user-defined python | none |
character_map | user-defined python | none |
as_user | takes action as this iRODS user | authenticated user |
target_path | set mount path on the irods server which can be different from client mount path | client mount path |
to_resource | defines target resource request of operation | as provided by client environment |
operation | defines the mode of operation | Operation.REGISTER_SYNC |
max_retries | defines max number of retries on failure | 0 |
timeout | defines seconds until job times out | 3600 |
delay | defines seconds between retries | 0 |
Event handlers can use logger
to write logs. See structlog
for available logging methods and signatures.
Operation mode
operation | new files | updated files |
---|---|---|
Operation.REGISTER_SYNC (default) |
registers in catalog | updates size in catalog |
Operation.REGISTER_AS_REPLICA_SYNC |
registers first or additional replica | updates size in catalog |
Operation.PUT |
copies file to target vault, and registers in catalog | no action |
Operation.PUT_SYNC |
copies file to target vault, and registers in catalog | copies entire file again, and updates catalog |
Operation.PUT_APPEND |
copies file to target vault, and registers in catalog | copies only appended part of file, and updates catalog |
Operation.NO_OP |
no action | no action |
--event_handler
usage examples can be found in the examples directory.
Character Mapping option
If an application should require that iRODS logical paths produced by the ingest process exclude subsets of the range of possible Unicode characters, we can add a character_map method that returns a dict object. For example:
class event_handler(Core):
@staticmethod
def character_map():
return {
re.compile('[^a-zA-Z0-9]'):'_'
}
# ...
The returned dictionary, in this case, indicates that the ingest process should replace all non-alphanumeric (as well as non-ASCII) characters with an underscore wherever they may occur in an otherwise normally generated logical path. The substition process also applies to the intermediate (ie collection name) elements in a logical path, and a suffix is appended to affected path elements to avoid potential collisions with other remapped object names.
Each key of the returned dictionary indicates a character or set of characters needing substitution. Possible key types include:
- character
# substitute backslashes with underscores
'\\': '_'
- tuple of characters
# any character of the tuple is replaced by a Unicode small script x
('\\','#','-'): '\u2093'
- regular expression
# any character outside of range 0-256 becomes an underscore
re.compile('[\u0100-\U0010ffff]'): '_'
- callable accepting a character argument and returning a boolean
# ASCII codes above 'z' become ':'
(lambda c: ord(c) in range(ord('z')+1,128)): ':'
In the event that the order-of-substitution is significant, the method may instead return a list of key-value tuples.
UnicodeEncodeError
Any file whose path in the filesystem whose ingest results in a UnicodeEncodeError exception being raised (e.g. by the inclusion of an unencodable UTF8 sequence) will be automatically renamed using a base-64 sequence to represent the original, unmodified vault path.
Additionally, data objects that have had their names remapped, whether pro forma or via a UnicodeEncodeError, will be annotated with an AVU of the form
Attribute: "irods::automated_ingest::" + ANNOTATION_REASON Value: A PREFIX plus the base64-converted "bad filepath" Units: "python3.base64.b64encode(full_path_of_source_file)"
Where :
- ANNOTATION_REASON is either "UnicodeEncodeError" or "character_map" depending on why the remapping occurred.
- PREFIX is either "irods_UnicodeEncodeError_" or blank(""), again depending on the re-mapping cause.
Note that the UnicodeEncodeError type of remapping is unconditional, whereas the character remapping is contingent on an event handler's character_map method being defined. Also, if a UnicodeEncodeError-style ingest is performed on a given object, this precludes character mapping being done for the object.
Manual Deployment
Configure python-irodsclient
environment
python-irodsclient
(PRC) is used by the Automated Ingest tool to interact with iRODS. The configuration and client environment files used for a PRC application applies here as well.
If you are using PAM authentication, remember to use the Client Settings File.
iRODSSession
s are instantiated using an iRODS client environment file. The client environment file used can be controlled with the IRODS_ENVIRONMENT_FILE
environment variable. If no such environment variable is set, the file is expected to be found at ${HOME}/.irods/irods_environment.json
. A secure connection can be made by making the appropriate configurations in the client environment file.
Starting Redis Server
Install Redis server: https://redis.io/docs/latest/get-started
Starting the Redis server with package installation:
redis-server
Or, dameonized:
sudo service redis-server start
sudo systemctl start redis
The Redis GitHub page also describes how to build and run Redis: https://github.com/redis/redis?tab=readme-ov-file#running-redis
The Redis documentation also recommends an additional step:
Make sure to set the Linux kernel overcommit memory setting to 1. Add vm.overcommit_memory = 1 to /etc/sysctl.conf and then reboot or run the command sysctl vm.overcommit_memory=1 for this to take effect immediately.
This allows the Linux kernel to overcommit virtual memory even if this exceeds the physical memory on the host machine. See kernel.org documentation for more information.
Note: If running in a distributed environment, make sure Redis server accepts connections by editing the bind
line in /etc/redis/redis.conf or /etc/redis.conf.
Setting up virtual environment
You may need to upgrade pip:
pip install --upgrade pip
Install virtualenv:
pip install virtualenv
Create a virtualenv with python3:
virtualenv -p python3 rodssync
Activate virtual environment:
source rodssync/bin/activate
Install this package
pip install irods_capability_automated_ingest
Set up environment for Celery:
export CELERY_BROKER_URL=redis://<redis host>:<redis port>/<redis db> # e.g. redis://127.0.0.1:6379/0
export PYTHONPATH=`pwd`
Start celery worker(s):
celery -A irods_capability_automated_ingest.sync_task worker -l error -Q restart,path,file -c <num workers>
Note: Make sure queue names match those of the ingest job (default queue names shown here).
Using the sync script
Start sync job
python -m irods_capability_automated_ingest.irods_sync start <source dir> <destination collection>
List jobs
python -m irods_capability_automated_ingest.irods_sync list
Stop jobs
python -m irods_capability_automated_ingest.irods_sync stop <job name>
Watch jobs (same as using --progress
)
python -m irods_capability_automated_ingest.irods_sync watch <job name>
Run tests
Note: The tests start and stop their own Celery workers, and they assume a clean Redis database.
python -m irods_capability_automated_ingest.test.test_irods_sync
See docker/ingest-test/README.md for how to run tests with Docker Compose.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for irods-capability-automated-ingest-0.5.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | d1b1cc8d36adb50438f56b38fb71a0ec44ac9b9c6f3bcb1a64d8e5b20289bedf |
|
MD5 | 6627e39d79f9d4984cafda5fd7224dc6 |
|
BLAKE2b-256 | 20d9328a51a5cf4f77f481b436b90193913646faf2ca8706cf8a4c4787490102 |
Hashes for irods_capability_automated_ingest-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d6351f72fd3905bdad97a387033a306f89416af5e9f414e15a79d1ba9ca07273 |
|
MD5 | 7a1d67339910867674cea5d6f5867ac5 |
|
BLAKE2b-256 | c4eb46356a0e687e92478d96e35d4f23d6f2889baef5cd96f0282c5c935290f8 |