StrainFLAIR (STRAIN-level proFiLing using vArIation gRaph) is a tool for strain identification and quantification that uses variation graph representation of genes sequences
Project description
StrainFLAIR
StrainFLAIR (STRAIN-level proFiLing using vArIation gRaph) is a tool for strain identification and quantification that uses variation graph representation of genes sequences. The input is a collection of complete genomes, draft genomes or metagenome-assembled genomes from which genes will be predicted. StrainFLAIR is sub-divided into two main parts: first, an indexing step that stores clusters of reference genes into variation graphs, and then, a query step using mapping of metagenomic reads to infere strain-level abundances in the queried sample.
StrainFLAIR is composed of several modules. Each module of the pipeline is described below.
Dependencies (all installed by the Install procedure)
Install
git clone https://github.com/kevsilva/StrainFLAIR
cd StrainFLAIR
conda env create -p Strain --file env.yml
conda activate ./Strain
pip install strainflair
StrainFLAIR pipeline
Usage
StrainFLAIR.sh
is a pipeline combining the indexation and query steps. Mapping is to be done separately.
TODO
Full indexation and query example
TODO
StrainFLAIR modules
Module genes_prediction
: prediction of protein-coding genes from each input sequence
From the input reference sequences, protein-coding genes are predicted using Prodigal. To reduce mapping bias at the extremities, predicted genes can be extended on both ends if the reference sequence it originates from allows it.
Example: genes_prediction -s file_of_fasta_files.txt -o my_output_directory_name -l 75
Module cd-hit-est
: clustering of the predicted genes
Genes are clustered using CD-HIT. Genes are then grouped into gene families and the resulting clusters are composed of similar genes according to the user-defined thresholds and parameters.
Example: cd-hit-est -i my_genes_not_extended.fasta -o clusters_files_name -c 0.95 -aS 0.90 -g 1 -d 0 -M 0 -T 0 -G 0
Module graphs_construction
and concat_graphs
: building a variation graph representing the gene clusters
Each gene cluster (gene family) is converted into a variation graph. All variation graphs are then concatenated into a single one and indexed.
Example:
graphs_construction -s my_genes_extended.fasta -c cluster_file.clstr -o my_output_directory_name
concat_graphs -i my_input_directory_name -s 1000
vg view final_graph.vg > final_graph.gfa
vg prune final_graph.vg | vg index -g final_graph.gcsa -
vg index -x final_graph.xg final_graph.vg
vg snarls final_graph.vg > final_graph.snarls
Mapping reads onto a variation graph
Mapping of reads onto a variation graph is done using vg mpmap
from vg toolkit. The output needs to be into the JSON format.
Example:
vg mpmap -x final_graph.xg -g final_graph.gcsa -s final_graph.snarls -f my_reads.fastq.gz -t 24 -M 10 -m -L 0 > mapping_output.gamp
vg view -j -K mapping_output.gamp > mapping_output.json
Module json2csv
: Gene-level abundances
Mapping results are processed according to our developed algorithm to attribute abundances to the reference genes.
Example: json2csv -g final_graph.gfa -m mapping_output.json -p dict_clusters.pickle -o output_file_name
Module compute_strains_abundance
: Strain-level abundances
Gene-level abundances are converted into strain-level abundances. Strain abundance is set to zero if not metting the threshold of proportion of detected genes.
Example: compute_strains_abundance -i gene_level_table.csv -o my_output_directory -t proportion_detected_genes_threshold
Contact
Kévin Da Silva: kevin.da-silva@inria.fr
Pierre Peterlongo: pierre.peterlongo@inria.fr
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for StrainFlair-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | eb884ece97aad41fb1942f2604b1509f752a8f18dc76e59880572da8624898c1 |
|
MD5 | 9aa9bb069b81eb99c0456b73227cdc34 |
|
BLAKE2b-256 | 328ff4decd299af8563c342867d8f4b273027e756a951047fde11f963545db15 |