Skip to main content

git annex special remote for Globus

Project description

git-annex special remote for Globus

git-annex-remote-globus adds to git-annex the ability to retrieve files which are available through Globus.

  • Note! code is not yet distributed therefore Usage information are going to change soon

Requirements

  • Datalad
  • Git annex

Setup

  1. git clone this repository: git clone https://github.com/CONP-PCNO/git-annex-remote-globus.git

  2. In a different folder, install the following dataset: datalad install https://github.com/conpdatasets/FRDR-multimodal.gitand cd FRDR-multimodal. From now on we are going to work from the dataset repo location where you cd'ed

  3. Add the path of your git-annex-remote-globus location to your current PATH

  4. Initialize a virtual environment and install requirements.txt. You may need to add the git-globus-annex path manually here not to make the system confuse with the filename of other repos: run pip install -r path/to/git-annex-remote-globus/requirements.txt

  5. In the repository, run git-annex-remote-globus setup and follow the instructions to authenticate. Gmail and ORCHID are supported

  6. Add a remote for globus. Consider the following:

    • Add a git-annex remote called globus
    • Encrypts can be set to none for now
    • The option -d will enable a verbose output
    • The endpoint name corresponds to the name of the dataset in globus so globus can find it
    • The fileprefix corresponds to the location storing your files in globus
git annex initremote globus type=external externaltype=globus encryption=none endpoint=FRDR_Prod_2 fileprefix=/5/published/publication_170/submitted_data/

To debug git-annex initremote --debug.

Now globus is ready to use ! See the Usage section below

Options

Options specific to git-annex-remote-googledrive

  • prefix - The path to the folder that will be used for the remote
  • endpoint - Globus endpoint name where the dataset lives

General git-annex options

  • encryption - One of "none", "hybrid", "shared", or "pubkey". See encryption. (Must be set to none for now)
  • mac - The MAC algorithm. See encryption. (Not supported for now)
  • exporttree - Set to yes to make this special remote usable by git-annex-export. It will not be usable as a general-purpose special remote. (Not supported for now)
  • chunk - Enables chunking when storing large files. (Optional)

Using an existing remote (note on repository layout)

If the globus remote was previously initialized, it can be restarted by running git annex enableremote globus. See additional information here.

Choosing a Chunk Size

Choose your chunk size based on your needs. By using a chunk size below the maximum file size supported by your cloud storage provider for uploads and downloads, you won't need to worry about running into issues with file size. Smaller chunk sizes: leak less information about the size of file size of files in your repository, require less ram, and require less data to be re-transmitted when network connectivity is interrupted. Larger chunks require less round trips to and from your cloud provider and may be faster. Additional discussion about chunk size can be found here

Usage

In order to understand how globus remote works, we can work with one file as an example. The file is currently not available to use as only a symlink generated by annex is pointing at it. Run the following to find the file symlink, which includes the file hash

cd FRDR-multimodal
ll 2015_11_18_cortex/mask/mask.mat

The last command will allow you to visualize the symlink which contains the MD6 hash of the file content. At this point, globus does not know anything about this file and its symlink, as you can see running the following command

git annex whereis 2015_11_18_cortex/mask/mask.mat

Globus is not listed indeed. We now need to tell globus of the existence of the file based on its hash, which we call key. When initialized, globus was given a location ID by annex which distributes one to every remote it communicates with. We can find Globus location ID by running the following command from the dataset root location:

cat .git/config

This file shows the remote "globus" which we just initialized, and the globus remote location ID given by annex. Therefore we need to make this location know about the file we want to retrieve.

To do that, run:

git annex setpresentkey <file_hash> <annex-uuid> 1

For example, for the file we are working with 2015_11_18_cortex/mask/mask.mat we would run:

git annex setpresentkey MD5E-s572--1e5e0b0c5896d16ac14170c8f546d4e1.mat 056ae102-61ce-4417-9180-b45eecc45082 1

The 1 at the end tells globus about the existence of this file with its given key. A 0 would remove knowledge of the file

Now, to make sure globus knows about this file, we can run the command below.

git annex whereis 2015_11_18_cortex/mask/mask.mat

At this point we can go ahead and register a url to be associated with the given file key. In this way, we will connect the points and tell annex where in globus the file is located, so it can reach it. Therefore we will add a globus url which will contain the endpoint name and fileprefix:

git annex registerurl MD5E-s572--1e5e0b0c5896d16ac14170c8f546d4e1.mat globus://frdr_prod_2/5/published/publication_170/submitted_data/2015_11_18_cortex/mask/mask.mat

At this point we can finally obtain the file running the command below

git annex get 2015_11_18_cortex/mask/mask.mat

This is the point where the file becomes available on your machine

You can run whereis again to check that

git annex whereis 2015_11_18_cortex/mask/mask.mat

Moreover you can always run the following command if you want to mack sure the file has not been modified in globus compared to your previously downloaded version by using the annex-uuid again

git annex checkpresentkey MD5E-s572--1e5e0b0c5896d16ac14170c8f546d4e1.mat 056ae102-61ce-4417-9180-b45eecc45082

It will return Success if the file in Globus has not change

Issues, Contributing

If you run into any problems, please check for issues on GitHub. Please submit a pull request or create a new issue for problems or potential improvements.

License

Project details


Release history Release notifications

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for git-annex-remote-globus, version 1.0
Filename, size File type Python version Upload date Hashes
Filename, size git_annex_remote_globus-1.0-py3-none-any.whl (9.4 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size git-annex-remote-globus-1.0.tar.gz (11.5 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page