Skip to main content

A tool for genomic sequential analysis automation.

Project description

ASGARD

User Manual for Asgard

ASGARD is a configuration file created by the Costa Rica National High Technology Center to automate the identification of antibiotic resistance genes in bacterias like Salmonella. ASGARD provides an easy to use interface to process big batches of fastq files with little to no configuration. It also provides a CPU optimization algorithm that reduces the processing time. This tool is based on the ARIBA software that was developed by Sanger-Pathogens.

SAGA is a compiled workflow of programs that enables the alignment, indexing and mapping of genes samples against a reference genome. Multiple reference genomes are available in different databases using the NCBI API as fasta files. SAGA provides an easy to use way to select the reference genome and analyze a series of samples to obtain a Phylogenetic tree using RAxML.

Usage

Python asgard.py <options>

Required Arguments:

--config_dir: Path to the directory containing the configuration files. All .ajson configuration files contained in this directory will be executed in alphabetical order.

AJSON specification

ASGARD json files are an extension of the JavaScript Object Notation that provides references to internal and external properties of the objects. Certain elements must be present in the configuration file for the program to work.

Syntax

Files contained in the config_dir directory with the .ajson extension are treated as configuration files for the execution of ASGARD

The configuration file is read from top to bottom and any reference values are resolved in the same manner.

  • Internal Objects
    • Internal references are defined using double braces. The referenced property must be assigned before it is referenced. In this example the value of the color key inside the motorcycle object would be lightblue after the evaluation.
{
    "motorcycle": {
        "variant": "light",

        "color": "{{variant}}blue",

        "year": 2010
    }
}
  • External Objects
    • References to external objects are defined using double braces and using dot to navigate the object depth, all external references must be made from the top object and are case sensitive, in this example the color of the helmet will match the color of the motorcycle.
{

    "motorcycle":{
        "color": "blue",
        "year": 2010
    },

    "helmet":{
        "color":"{{motorcycle.color}}"
    }

It is possible to create composite values from multiple references and strings.

The definition of the name/value pair must be defined before it is referenced so that it can be resolved properly.

Object Description Key Description
constants Contains non changing configuration parameters that can be referenced by other objects. Properties inside the "constant" object must not contain external references. name Name of the script that will be executed, this name is used to generate the output directory.
input_directory Directory with the fastq gz files forward and reverse. Each fastq file must have its pair in the same directory. Each pair is composed of a name and a suffix specified in the forward and reverse properties.
output_directory Directory where the output of each configuration file will be created. Each execution creates a new directory with an unique name at start of its execution, resulting files are then created inside this directory.
input_extension All files in the input directory ending with the input extension are listed and used for the execution of the commands.
reference_accession Accession number of the genomes to be downloaded and analysed. This file is downloaded with the fasta extension using NCBI efetch utility.
accessory_accession Accession number of the genome to be appended to the reference_accession fasta file.
entrez_database Database from where the fasta file will be searched and downloaded
workers Specifies the number of parallel jobs created of each command, each time a task finishes a new job is spawned with the next iteration.
forward Suffix of the forward files in the input directory.
reverse Suffix of the matching pair of the input fastq files.
iterator Expandable bash expression that represents a list of files to iterate with the workflow. This expression can be a composite value. Other wildcards can be used for the filename expansion.
dynamic This object contains information that is variable at run time, this enables it to iterate through the files present in the input directory. prefix_regex Regular expressions that define the pattern of the valid filename without extension nor suffix.
placeholder Symbol used as a placeholder for the fastq file names before its evaluation at runtime.

Execution Modes

Each command can be executed in different modes depending on the number of iterations required.

Object Description Execution mode Description
execute Each key and value pair describe the execution mode of each of the commands within the configuration file. The objects that describe the tasks of each command must have the same name as the key in the execute object. All commands with its respective task must be written after the execute object. single The object will be evaluated and will be executed one single time. Dynamic values should not be used in this command since these will not be evaluated.
iterate-parallel The object will be executed in a new process created by the subprocess library, the number of parallel processes is determined by the workers constant. Dynamic placeholders will be evaluated when the new process is spawned. Filenames will be replaced in no logical order.
iterate-sequential The command object will be iterative but only one process is run at a time. Dynamic placeholders will be evaluated the same way as in iterate-parallel.
false The task is disabled and will be ignored.

Command Types

Objects declared at the root level are checked for the <<command>> property, if this property is defined the program will queue its execution in the same order it's been read.

SAGA

  • Simple

These are simple commands designed to manipulate and download files and directories.

Command Description
create_file Creates an empty text file in the specified in the file parameter. Absolute path to the file is recommended.
Required parameters:
<tb/> file: Symbolic link to the new file to be created.
check_directory Verifies that the directory exists, if not it creates one with the specified name, recursive creation of directories is enabled.
Required parameters:
directory: Absolute path to be checked or created.
entrez_download This command downloads the fasta files using its accession number in the NCBI database. HTTPS GET request is used for the download.
Required parameters:
url: https URL to the fasta file in the NCBI database. Use the constant accession variable.
file: Symbolic link to the new file to be created.
merge The merge command enables the program to concatenate two or more text files into a new file. A new line is added between each file listed.
Required parameters:
files: JSON list of the absolute or relative paths to the files to be merged.
output_file: Path to the file to be created. If the file exists it will be overwritten.
replace Replaces all occurrences of a text value with a new string.
Required parameters:
file: Path to the file where the text fragment will be replaced.
old_data: Text to be replaced.
new_data: The new text that will replace the old text fragment.

Complex commands are specified using a json array, dynamically generated items are evaluated and then executed sequentially. These commands are run using the subprocess library of python. If POSIX is being used, the path to the program must be the first parameter of the list.

It is possible to add extra parameters, these will be evaluated by the program to be executed. If the expansion of bash parameters is necessary, it is possible to use the "shell" property to specify whether it should be executed by the shell interpreter. These types of complex commands can be used to iterate over multiple files with similar names. To iterate these files, the placeholders defined in the "dynamic" object must be used, these placeholders will be replaced by the real values at runtime. In order to enable file iteration, it is necessary to select the "iterate-parallel" or "iterate-sequential" execution modes.

Example:

In this case the program samtools must be accessible from the directory where ASGARD is being run, this can be achieved by setting the environmental variables or specifying the full path to the executable.

The values in the command list can be composite, constant, or strings.

"sam_view": {
    "extension": ".bam",
    "file": "{{dynamic.output_file}}{{extension}}",
    "output_pipeline": "{{file}}",
    "command": ["samtools","view","-bS","-q","15","{{bwa_mem.file}}"]

},

Default Configuration files.

Two different configuration files are provided with the software one corresponding to ASGARD and the other one for SAGA. These configuration files implement the following pipeline.

TODO

ASGARD

Task Command Parameters Description

| SAGA

Task Command Parameters Description

|

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asgard-saga-0.9.0a0.dev0.tar.gz (16.8 kB view hashes)

Uploaded source

Built Distribution

asgard_saga-0.9.0a0.dev0-py3-none-any.whl (31.2 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page