Big Data Smart Socket client
Project description
The increasing size of datasets used in scientific computing has made it difficult or impossible for a researcher to store all their data at the compute site they are using to process it. This has necessitated that a data transfer step become a key consideration in experimental design. Accordingly, scientific data repositories such as NCBI have begun to offer services such as dedicated data transfer machines and advanced transfer clients. Despite this, many researchers continue familiar but suboptimal practices: using slow transfer clients like a web browser or scp, transferring data over wireless networks, etc.
BDSS aims to alleviate this problem by shifting the burden of learning about alternative file mirrors, transfer clients, tuning parameters, etc. from the end user researcher to a group of “data curators”. It consists of three parts:
Components
Metadata repository
Central database managed by data curators
Matches patterns of data file URLs and maps them to alternate sources
Includes information about the transfer tool to use to retrieve the data
BDSS transfer client
Consumes information from metadata repository
Invokes transfer tools
Reports analytics to metadata repository
Integration as a Galaxy data transfer tool
Get Started
Moving data with the BDSS client:
Setting up a new metadata repository
Examples
All examples here require a metadata repository configured to support them. The default metadata repository at https://bdss.bioinfo.wsu.edu/ supports these examples and the necessary configuration is also listed here.
NCBI SRA archive
NCBI makes files available for transfer using Aspera Connect, a tool with “improved data transfer characteristics” vs FTP or HTTP. If ascp is installed on your machine, BDSS can handle building the appropriate command.
Without BDSS:
ascp -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh -T anonftp@ftp.ncbi.nlm.nih.gov:/sra/sra-instant/reads/ByRun/sra/SRR/SRR039/SRR039885/SRR039885.sra ./
With BDSS:
bdss transfer -u 'ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR039/SRR039885/SRR039885.sra'
Metadata repository configuration:
{
"data_sources": [
{
"description": "",
"label": "NCBI Sequence Read Archive with FTP",
"test_files": [],
"transfer_mechanism": {
"options": {},
"type": "curl"
},
"transforms": [
{
"for_destinations": [],
"options": {
"new_scheme": "aspera"
},
"target": "NCBI Sequence Read Archive with Aspera",
"type": "change_scheme"
}
],
"url_matchers": [
{
"options": {
"pattern": "^ftp://ftp\\.ncbi\\.nlm\\.nih\\.gov/sra"
},
"type": "regular_expression"
}
]
},
{
"description": "",
"label": "NCBI Sequence Read Archive with Aspera",
"test_files": [],
"transfer_mechanism": {
"options": {
"disable_encryption": true,
"username": "anonftp"
},
"type": "aspera"
},
"transforms": [],
"url_matchers": [
{
"options": {
"pattern": "^aspera://ftp\\.ncbi\\.nlm\\.nih\\.gov/sra"
},
"type": "regular_expression"
}
]
}
],
"destinations": []
}
JGI Genome Portal
To download files from the JGI Genome Portal, you must first authenticate. BDSS can prompt for credentials and handle storing your session cookies.
Without BDSS:
curl 'https://signon.jgi.doe.gov/signon/create' --data-urlencode 'login=USER_NAME' --data-urlencode 'password=USER_PASSWORD' -c cookies > /dev/null
curl 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=PhytozomeV10' -b cookies > get-directory
With BDSS:
bdss transfer -u 'http://genome.jgi.doe.gov/ext-api/downloads/get-directory?organism=PhytozomeV10'
JGI Genome Portal username?USER_NAME
JGI Genome Portal password?USER_PASSWORD
Metadata repository configuration:
{
"data_sources": [
{
"description": "",
"label": "JGI Genome Portal",
"test_files": [],
"transfer_mechanism": {
"options": {
"auth_url": "https://signon.jgi.doe.gov/signon/create",
"password_field": "password",
"password_prompt": "JGI Genome Portal password?",
"username_field": "login",
"username_prompt": "JGI Genome Portal username?"
},
"type": "session_authenticated_curl"
},
"transforms": [],
"url_matchers": [
{
"options": {
"pattern": "http:\\/\\/genome\\.jgi\\.doe\\.gov\\/ext-api"
},
"type": "regular_expression"
}
]
}
],
"destinations": []
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for bdss_client-1.0.1b4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 724bdbd8c2adc0461f0f7bc7f9ba055e858aea725ffe94a1ffdc0abddbab6656 |
|
MD5 | 46dbd99857110b839395f30224f99918 |
|
BLAKE2b-256 | 1ec623af8c8115ae2ee62bc9831f3fc8a30f132ee7760bab390d8052367151c4 |