Accessioning tool to submit genomics pipeline outputs to the ENCODE Portal
Project description
accession
Python module and command line tool to submit genomics pipeline analysis output files and metadata to the ENCODE Portal
Table of Contents
Installation
Install the module with pip:
$ pip install accession
Setting environmental variables
You will need ENCODE DCC credentials from the ENCODE Portal. Set them in your command line tool like so:
$ export DCC_API_KEY=XXXXXXXX
$ export DCC_SECRET_KEY=yyyyyyyyyyy
You will also need Google Application Credentials in your environment. Obtain and set your service account credentials:
$ export GOOGLE_APPLICATION_CREDENTIALS=<path_to_service_account_file>
Usage
$ accession --accession-metadata metadata.json \
--accession-steps steps.json \
--server dev \
--lab /labs/encode-processing-pipeline/ \
--award U41HG007000
Arguments
Metadata JSON
This file is an output of pipeline analysis run. The example file has metadata on all of the tasks and produced files.
Accession Steps
The accessioning steps configuration file specifies the task and file names in the output metadata JSON. Accessioning code will selectively submit the specified file keys to the ENCODE DCC Portal. Single step is configured in the following way:
{
"dcc_step_version": "/analysis-step-versions/kundaje-lab-atac-seq-trim-align-filter-step-v-1-0/",
"dcc_step_run": "atac-seq-trim-align-filter-step-run-v1",
"wdl_task_name": "filter",
"wdl_files": [
{
"filekey": "nodup_bam",
"output_type": "alignments",
"file_format": "bam",
"quality_metrics": ["cross_correlation", "samtools_flagstat"],
"derived_from_files": [{
"derived_from_task": "trim_adapter",
"derived_from_filekey": "fastqs",
"derived_from_inputs": "true"
}]
}
]
}
dcc_step_version
and dcc_step_run
must exist on the portal.
wdl_task_name
is the name of the task that has the files to be accessioned.
wdl_files
specifies the set of files to be accessioned.
filekey
is the variable that stores the file path in the metadata file.
output_type
, file_format
, and file_format_type
are the ENCODE specific metadata that are required by the Portal
quality_metrics
is a list of methods that will be called in the code to attach quality metrics to a file
possible_duplicate
specifies that there could be, in the case of optimal IDR peaks and conservative IDR peaks, files can have an identical content. If the possible duplicate flag is set, file is not accessioned.
derived_from_files
specifies the list of files the current file being accessioned derives from. These files need to have been accessioned before the current file can be submitted.
derived_from_inputs
is used when indicating that parent files were not produced during the task execution. Raw fastqs and genome references are examples of such files.
derived_from_output_type
is required in the case the parent file has a possible duplicate.
Server
prod
and dev
indicates the server where the files are being accessioned to. dev
points to test.encodedcc.org. The server parameter can be explicitly passed as test.encodedcc.org or encodeproject.org.
Lab and Award
These are unique identifiers that are expected to be already present on the ENCODE Portal.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for accession-0.0.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 16e7745f64ad81fd5a2b32aed792b0f1a9c10d53ea24d06e6854f01459b9094d |
|
MD5 | bf5c5658f3230cef283067fd09ae70da |
|
BLAKE2b-256 | d6636e686c22bec9f7d9d3e79c4a442b12f3e1913311005c3d4945a21a3f4601 |