Data aggregator
Project description
Lochness: Sync data from all over the cloud to a local directory
Lochness is a data management tool designed to periodically poll and download data from various data archives into a local directory. This is often referred to as building a "data lake" (hence the name).
Out of the box there is support for pulling data down from Beiwe, XNAT, REDCap, Dropbox, external hard drives, and more. Extending Lochness to support new services is also a fairly simple process.
Table of contents
Installation
Just use pip
pip install lochness
For most recent DPACC-lochness
pip install git+https://github.com/PREDICT-DPACC/lochness
For debugging
cd ~
git clone https://github.com/PREDICT-DPACC/lochness
pip install -r ~/lochness/requirements.txt
export PATH=${PATH}:~/lochness/scripts # add to ~/.bashrc
export PYTHONPATH=${PYTHONPATH}:~/lochness # add to ~/.bashrc
Running test
- Copy the token template, and add the information for each module.
cd lochness/tests
cp token_template_for_test_template.csv token_template_for_test.csv
- Run test
bash run_test.sh
Setup from a template
Creating the template
Setting up lochness from scratch could be slightly confusing in the beginning.
Try using the lochness_create_template.py
to create a starting point.
Create an example template to easily structure the lochness system
# ProNET
lochness_create_template.py \
--outdir /data/lochness_root \
--studies PronetLA PronetSL PronetWU \
--sources redcap xnat box mindlamp \
--email kevincho@bwh.harvard.edu \
--poll_interval 43200 \
--ssh_host erisone.partners.org \
--ssh_user kc244 \
--lochness_sync_send \
--s3
# PRESCIENT
lochness_create_template.py \
--outdir /data/lochness_root \
--studies PrescientAD PrescientME PrescientPE \
--sources RPMS mediaflux mindlamp \
--email kevincho@bwh.harvard.edu \
--poll_interval 43200 \
--ssh_host erisone.partners.org \
--ssh_user kc244 \
--lochness_sync_send \
--s3
# For more options: lochness_create_template.py -h
Making edits to the template
Running one of the commands above will create the structure below
/data/lochness_root/
├── 1_encrypt_command.sh
├── 2_sync_command.sh
├── PHOENIX
│ ├── GENERAL
│ │ ├── PronetLA
│ │ │ └── PronetLA_metadata.csv
│ │ ├── PronetSL
│ │ │ └── PronetSL_metadata.csv
│ │ └── PronetWU
│ │ └── PronetWU_metadata.csv
│ └── PROTECTED
│ ├── PronetLA
│ ├── PronetSL
│ └── PronetWU
├── config.yml
├── lochness.json
└── pii_convert.csv
-
Change information in
config.yml
andlochness.json
as needed. -
Either manually update the
PHOENIX/GENERAL/*/*_metadata.csv
or amend the field names in REDCap / RPMS sources correctly for lochness to automatically update the metadata files.Currently, lochness initializes the metadata using the following field names in REDCap and RPMS.
chric_subject_id
: the record ID field name- this field name must be in the REDCap or RPMS repository for the metadata to be updated by lochness.
chric_consent_date
: the field name of the consent date- this field name must be in the REDCap or RPMS repository for the metadata to be updated by lochness.
beiwe_id
: the field name of the BEIWE ID.xnat_id
: the field name of the XNAT ID.dropbox_id
: the field name of the Dropbox ID.box_id
: the field name of the Box ID.mediaflux_id
: the field name of the Mediaflux ID.mindlamp_id
: the field name of the Mindlamp ID.daris_id
: the field name of the DaRIS ID.rpms_id
: the field name of the RPMS ID.
- Encrypt the
lochness.json
by running
cd /data/lochness_root
bash 1_encrypt_command.sh
This encryption step creates a copy of encrypted keyrings to
/data/lochness_root/.lochness.enc
. To protect the sensitive keyring
information in json, remove the lochness.json
after running the encryption.
You can still extract keyring structure without sensitive information by running
lochness_check_config.py -ke /data/lochness_root/.lochness.enc
-
Set up REDCap Data Entry Trigger if using REDCap. Please see below "REDCap Data Entry Trigger capture" section.
-
Edit Personally identifiable information mapping table. Please seee below "Personally identifiable information removal from REDCap and RPMS data"
/data/lochness_root/pii_convert.csv
- Run the
sync.py
or use the example command in2_sync_command.sh
bash 2_sync_command.sh
To use lochness_to_lochness
transfer through aws s3
- Set up s3 bucket
- Install aws CLI
- Configure CLI with your s3 bucket information
$ aws configure
- Add your AWS information to
config.yml
AWS_BUCKET_NAME: ampscz-dev
AWS_BUCKET_ROOT: TEST_PHOENIX_ROOT
REDCap Data Entry Trigger capture
If your sources include REDCap and you would like to configure lochness to only pull new REDCap data, "Data Entry Trigger" needs to be set up in REDCap.
In REDCap,
- "Project Setup"
- "Enable optional modules and customizations"
- "Additional customizations"
- Check "Data Entry Trigger" and give address of the server including the port number e.g. http://pnl-t55-7.partners.org:9999
In order to use this functionality, the server where lochness is installed should be able to receieve HTTP POST signal from REDCap server. Which means it has to be either
- lochness server is inside the same firewall as REDCap server. Or
- lochness server has a open port that could listen to the REDCap POST signal.
After setting the "Data Entry Trigger" on REDCap settings, run below to update
the /data/data_entry_trigger_db.csv
real-time
# please specify the same port defined in the REDCap settings
listen_to_redcap.py --database_csv /data/data_entry_trigger_db.csv \
--port 9999
It would be useful to run listen_to_redcap.py
in background, maybe inside a
gnu screen
so it runs continuously without interference.
Personally identifiable information removal from REDCap and RPMS data
A path of csv file can be provided, which has information about how to process each PII fields.
For example
/data/personally_identifiable_process_mappings.csv
pii_label_string | process
-----------------|---------------
address | remove
date | change_date
phone_number | random_number
patient_name | random_string
subject_name | replace_with_subject_id
Any value from the field, with names that match to pii_label_string
rows,
the labelled PII processing method will be used to process the raw values
to remove or replace the PIIs.
Documentation
You can find all the documentation you will ever need here
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file ampscz-lochness-0.1.6.tar.gz
.
File metadata
- Download URL: ampscz-lochness-0.1.6.tar.gz
- Upload date:
- Size: 64.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.8.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ca17d01739ca9b34ba525408327eec61c47fa09b0307a3f548e25625f47ee40 |
|
MD5 | 7b3f87795319d1f4cdaa02548eaa5167 |
|
BLAKE2b-256 | 4d6dac1325445174daeb7dfd482229ebf9e116a36f4d9ccb1f0bcfc5df3f31d7 |