iKISS is a pipeline to detect kmers under selection.
Project description
Homepage: https://forge.ird.fr/diade/iKISS
About iKISS
iKISS (Kmer Inference sSelection) is a snakemake pipeline able to decompose reads into kmers and extract kmers under selection.
iKISS uses KmersGWAS https://github.com/voichek/kmersGWAS, pcadapt https://cran.r-project.org/web/packages/pcadapt/readme/README.html and lfmm https://bcm-uga.github.io/lfmm/articles/lfmm to select genomics regions under selection.
1. Install dependencies and clone iKISS
Check dependencies for iKISS : python and singularity
Install singularity and python3 in your local machine OR use module load to add singularity and python3 in your environment if you are working in a cluster :
module load system/python/3.8.12
module load system/singularity/3.6.0
iKISS is NOW available as a PyPI package (recommended)
python3 -m pip install ikiss
OR you can also install iKISS from git repository
python3 -m pip install ikiss@git+https://forge.ird.fr/diade/iKISS.git
#OR
git clone https://forge.ird.fr/diade/iKISS.git
cd iKISS
python3 -m pip install .
1.1 Installing in cluster mode
Install iKISS in cluster mode using singularity container from ikiss_utilities https://itrop.ird.fr/ikiss_utilities/
ikiss install_cluster --help
ikiss install_cluster --scheduler slurm --env singularity
1.2 Installing in local mode
ikiss install_local --help
ikiss install_local
2. Running a datatest
Running test with a datatest from iKISS_utilities in a repertory TEST
ikiss test_install --help
ikiss test_install -d TEST
2.1 In CLUSTER mode
Launching suggested command line done by iKISS, in CLUSTER mode :
Please run command line ‘ikiss create_cluster_config’ before the first run and modify theads, ram, node and computer ressources. iKISS do a copy of cluster_config.yaml file into your home “/home/$USER/.config/ikiss/cluster_config.yaml”
ikiss run_cluster --help
ikiss create_cluster_config
If singularity was selected in installation of iKISS, it could be needed to give argument –singularity-args "–bind $HOME" to Snakemake, by using :
ikiss run_cluster --help
ikiss run_cluster -c TEST/data_test_config.yaml --singularity-args "--bind $HOME"
# @IFB
ikiss run_cluster -c TEST/data_test_config.yaml --singularity-args "--bind /shared:/shared"
#you can also use snakemake parametters as --rerun-incomplete --nolock
Important Note : In i-Trop cluster, run iKISS using ONLY a node, data has to be in “/scratch” of chosen node. Use nodelist : nodeX parametter inside of cluster_config.yaml file.
2.2 In LOCAL mode
launching suggested command line done by iKISS, in LOCAL mode:
ikiss run_local --help
ikiss run_local -t 8 -c TEST/data_test_config.yaml --singularity-args "--bind $HOME"
In local mode, its possible to allocate threads to some rules using –set-threads snakemake argument such as
ikiss run_local -t 8 -c TEST/data_test_config.yaml --set-threads kmers_gwas_per_sample=4 mapping_kmers=2 filter_bam=2 kmer_position_from_bam=4 pcadapt=2 extract_kmers_from_bed=2
3. Running your data
3.1. Adapt config.yaml
Before to run iKISS, adapt config.yaml by using :
ikiss create_config
Adapt config.yaml file with path to fastq files (FASTQ) and outfile (OUTPUT) in the DATA section.
DATA:
FASTQ: './DATATEST/fastq'
OUTPUT: './OUTPUT-KISS/'
:warning if yours reads are ilumina paired, you need rename reads SAMPLE_R1.fastq.gz and SAMPLE_R2.fastq.gz. For single reads use SAMPLE_R1.fastq.gz
iKISS uses compressed ans decompressed fastq files.
3.1.1 WORKFLOW section
Parameter iKISS steps using the section WORKFLOW and parameter it with the PARAMS sections.
In WORKFLOW section:
KMERS_GWAS step has to be activated by default.
PCADAPT, LFMM, MAPPING or ASSEMBLY are optional. Active or deactivate these steps using true or false.
KMERS_GWAS convert reads in kmers, filter them and create a format ready to use in population genomics!
PCADAPT detects genetic markers (kmers here ^^) involved in biological adaptation and provides outlier detection based on Principal Component Analysis (PCA).
LFMM is used by iKISS for testing correlations between kmers and environmental data.
MAPPING_KMERS can optionally be used to align kmers to a genomic reference (if it is available ! ).
ASSEMBLY_KMERS can optionally assembly significant kmers obtained by pcadapt or lfmm
INTERSECT can optionally calculate how many kmers (if MAPPING_KMERS is activated ) or contigs(if ASSEMBLY_KMERS is activated) are found in FEATURES (gene by default)
WORKFLOW:
KMERS_MODULE : true
PCADAPT : true
LFMM : true
MAPPING_KMERS: true
ASSEMBLY_KMERS: true
INTERSECT: True
3.1.2 PARAMS section
In the PARAMS section, tools parameters can be modified and adapted.
=> 1. KMERS_MODULE
KMERS_GWAS module decompose reads into kmers and create a binary table of presence/absence of kmers. This table can be filter to use only most informative kmers into the populations. PLINK format outfiles are obtained in this module.
PARAMS:
KMERS_MODULE:
KMER_SIZE : 31
MAC : 2
P : 0.2
MAF : 0.05
B : 1000000 # nb kmers in each bed file
SPLIT_LIST_SIZE : 100000
MIN_LIST_SIZE : 50000
KMER_SIZE is the length of kmers (should be between 15-31)
MAC is the minor allele count (min allowed appearance of a kmer)
P is the minimum percent of appearance in each strand form
MAF is the minimum allele frequency
B is the number of kmers in each bed file
SPLIT_LIST_SIZE is the nb of kmers by bed file
MIN_LIST_SIZE indicates the minimal number of kmers allowed in the smaller bed file after splitting
=> 2. PCADAPT
PCADAPT detects kmers involved in biological adaptation and provides outlier detection based on Principal Component Analysis (PCA)
PARAMS:
PCADAPT:
K : 2
SAMPLES: "samples.txt"
CORRECTION: 'FDR'
ALPHA : 0.05
K : number K of principal components
SAMPLES : you need to generate a samples.txt file. This file contains two columns (tab delimitations) : accession_id and phenotype_value. It will be used by PCADAPT.
accession_id : contains exactly same name of samples in FASTQ.
phenotype_value (int): contains sample group (wild=1, cultivated=2 for example)
accession_id group
Clone12 2
Clone14 2
Clone16 2
Clone20 2
Clone2 1
Clone4 1
Clone8 1
CORRECTION: kmers outliers are obtained using a correction of BONFERONNI, BH or FDR model.
ALPHA: modify the alpha cutoff for outlier detection
=> 3. LFMM
LFMM is used by iKISS for testing correlations between kmers and environmental data.
PARAMS:
LFMM:
K : 2
PHENOTYPE_FILE: "pheno.txt"
PHENOTYPE_PCA_ANALYSIS : false
CORRECTION: 'BH'
ALPHA : 0.05
K are the latent factors used in LFMM association analyses
PHENOTYPE_FILE: an phenotype file is obligatory in LFMM analysis. You can give to iKISS PCA results, climate variables, etc.
A PCA can reveal some ‘structure’ in the genotype data and it could help you to fix K parameter.
PHENOTYPE_PCA_ANALYSIS
If PHENOTYPE_PCA_ANALYSIS is true, iKISS automatically run PCA using the file given by user in the PHENOTYPE_FILE key. This PHENOTYPE_FILE can be a PCA result for example.
If PHENOTYPE_PCA_ANALYSIS is false, iKISS use directly the PHENOTYPE_FILE as ‘phenotype’ to LFMM analysis. Kmers are used as ‘genotype’ data.
Here, a example of a phenotype file with climate variables
accession_id group b2.Mean_Diurnal_Range b3.Isothermality b4.Temp_Seasonality b5.Max_Temp_of_Warmest_Month b6.Min_Temp_of_Coldest_Month b7.Temp_Annual_Range b8.Mean_Temp_of
_Wettest_Quarter b9.Mean_Temp_of_Driest_Quarter b10.Mean_Temp_of_Warmest_Quarter b11.Mean_Temp_of_Coldest_Quarter b12.Annual_Precipitation b13.Precipitation_of_Wettest_Mo
nth b14.Precipitation_of_Driest_Month b15.Precipitation_Seasonality b16.Precipitation_of_Wettest_Quarter b17.Precipitation_of_Driest_Quarter b18.Precipitation_of_Warmest_Quarter b19.Precipitation_of_Coldest_Quarter
Clone12 2 99 68 1230 310 166 144 250 226 258 226 1462 249 3 68 573 17 549 17
Clone14 2 100 68 1235 301 155 146 241 217 248 217 1525 259 3 67 603 18 575 18
Clone16 2 93 65 1389 310 168 142 250 223 258 223 1416 264 0 73 579 8 544 8
Clone20 2 154 55 3955 403 123 280 296 234 315 214 118 62 0 184 107 0 45 0
Clone2 1 152 55 3617 403 128 275 287 242 316 220 173 80 0 167 153 0 18 0
Clone4 1 168 51 5719 414 86 328 315 201 322 181 20 12 0 166 18 0 17 0
Clone8 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
CORRECTION: kmers outliers are obtained using a correction of BONFERONNI, BH or FDR model.
ALPHA: modify the alpha cutoff for outlier detection
=> 4. MAPPING_KMERS
MAPPING_KMERS section in PARAMS can optionally be used to align kmers to a genomic reference. It could give a idea of selected regions in a genome.
PARAMS:
MAPPING_KMERS:
REF: "reference.fasta"
MODE : bwa-aln
INDEX_OPTIONS: ""
OPTIONS : "-n 0.04"
FILTER_FLAG : 4
FILTER_QUAL : 10
Use a reference file in the REF section.
Parametter MODE using bwa-aln or bwa-mem2
Set up the INDEX_OPTIONS according to the MODE you have chosen.
If bwa-mem2 leaf empty
If bwa-aln “-a bwtsw” or “”
Set options according of chosen mapper in the OPTIONS key.
If bwa-mem2 default parameters -A 1 -B 4;
If bwa-aln -n 0.04
Obtained bam could be filtered using FILTER_FLAG (-F 4 by default) and FILTER_QUAL (mapq>10 by defaut) params.
=> 5. ASSEMBLY_KMERS
ASSEMBLY_KMERS section in PARAMS can optionally be used to assembly significant kmers obtained by pcadapt or/and lfmm.
Contigs are assembled by iKISS using mergeTags from dekupl package https://github.com/Transipedia/dekupl-mergeTags.
Chose minimal overlap size “OVERLAP_SIZE” allowed to assembly kmers.
Feel free to filter contigs by size “FILTER_CONTIG_SIZE”.
Assembled contigs could be mapped activating MAPPING_CONTIGS. This mapping can be launch versus a REF reference file using bwa-mem2 by default. Reference file used in this step can be a different reference from MAPPING_KMERS options. Feel free of change parametters of mapping using MAPPING_OPTIONS
Assembled contigs could be used by blastn against a database, you can also try to annotate them!
PARAMS:
ASSEMBLY:
OVERLAP_SIZE : 15
FILTER_CONTIG_SIZE : 100
MAPPING_CONTIGS: True
# if MAPPING_CONTIGS is activate, ikiss maps contigs vs REF using bwamem2
REF: 'reference.fasta'
MAPPING_OPTIONS : ""
=> 6. INTERSECT
iKISS uses bedtools intersect to calculate how many kmers/contigs are mapped in FEATURES (gene by default).
These FEATURES are filtered from the annotation GFF fileb before use bedtools intersect.
iKISS filtered kmers/contigs by using FILTER_MAPQ_STATS and minimal kmers/contigs number FILTER_MIN_STATS by FEATURE.
PARAMS:
INTERSECT:
GFF : 'reference.gff'
FEATURE : 'gene'
FILTER_MAPQ_STATS: '15'
3.2. Adapt cluster_config.yaml
If you will run ikiss in cluster, adapt cluster_config.yaml :
ikiss edit_cluster_config
Inside cluster_config.yaml, adapt partition to your favorite cluster and change memory and cpu number in by __default__ key or in rules you need :
__default__:
cpus-per-task : 4
mem-per-cpu : 10G
partition : "normal"
nodelist: node19
output : 'slurm_logs/stdout/{rule}/{wildcards}.o'
error : 'slurm_logs/error/{rule}/{wildcards}.e'
job-name : '{rule}.{wildcards}'
kmers_gwas_per_sample:
cpus-per-task : 4
mem-per-cpu : 10G
RULES
Here you can quickly find iKISS snakemake rules list :
rule kmers_gwas_per_sample *
rule kmers_to_use
rule kmers_table
rule extract_kmers_from_bed
rule index_ref
rule index_ref_to_assembly
rule mapping_kmers
rule filter_bam
rule kmer_position_from_bam *
rule merge_kmer_position
rule samtools_merge
rule pcadapt *
rule merge_method
rule outliers_position
rule extracting_features_from_gff
rule kmers_bedtools_intersect
rule get_pca_from_phenotype
rule lfmm *
rule mergetags
rule mapping_contigs
rule contigs_bedtools_intersect
rule intersect_and_contigs
rule intersect_and_outliers
rule fastq_stats
rule report_ikiss
rule html_ikiss
rules with a * can be parallelised.
4. Running iKISS
Run iKISS by ikiss run_local or ikiss run_cluster as explained in “Running a datatest” section.
5. iKISS output
This is a overwiew of iKISS output directory:
OUTPUT-KISS/
config_corrected.yaml
0.FASTQ_STATS
└── fastq_stats.txt
1.KMERS_MODULE
├── Clone12
├── Clone14
├── Clone16
├── Clone2
├── Clone20
├── Clone4
└── Clone8
2.KMERS_TABLE
├── kmers_list_paths.txt
├── kmers_table.names
├── kmers_table.table
├── kmers_to_use
├── kmers_to_use.no_pass_kmers
├── kmers_to_use.shareness
├── kmers_to_use.stats.both
├── kmers_to_use.stats.only_canonical
└── kmers_to_use.stats.only_non_canonical
3.TABLE2BED
├── log
├── output_file.0.bed
├── output_file.0.bim
├── output_file.0.fam
├── output_file.1.bed
├── output_file.1.bim
├── output_file.1.fam
├── output_file.2.bed
├── output_file.2.bim
├── output_file.2.fam
├── output_file.3.bed
├── output_file.3.bim
├── output_file.3.fam
├── output_file.4.bed
├── output_file.4.bim
└── output_file.4.fam
4.EXTRACT_FASTA
├── output_file.0.fasta.gz
├── output_file.1.fasta.gz
├── output_file.2.fasta.gz
├── output_file.3.fasta.gz
└── output_file.4.fasta.gz
5.RANGES
├── output_file.0
├── output_file.1
├── output_file.2
├── output_file.3
└── output_file.4
6.LFMM
├── output_file.0_10_LFMM_outliers.csv
├── output_file.0_10_LFMM_pvalues.csv
├── output_file.0_10_LFMM.rplot.pdf
...
6.LFMM_PHENO
├── PCA_from_phenotype.csv
├── PCA_from_phenotype.html
└── PCA_from_phenotype.ipynb
6.PCADAPT
├── output_file.0_10_PCADAPT_outliers.csv
├── output_file.0_10_PCADAPT_pvalues.csv
├── output_file.0_10_PCADAPT.rplot.pdf
├── output_file.0_10_PCADAPT_scores.csv
...
7.MERGED_LFMM
├── merged_LFMM_outliers.csv
└── merged_LFMM_pvalues.csv
7.MERGED_PCADAPT
├── merged_PCADAPT_outliers.csv
└── merged_PCADAPT_pvalues.csv
8.MAPPING_KMERS
├── bam_files.txt
├── output_file.0_vs_reference.bam
├── output_file.0_vs_reference_FMQ.bam
├── output_file.0_vs_reference.sai
├── output_file.0_vs_reference_sorted.bam
├── output_file.0_vs_reference_sorted.bam.bai
├── output_file.0_vs_reference_sorted.bam.idxstats
├── output_file.0_vs_reference_sorted.bam.stats
...
9.KMERPOSITION
├── output_file.0_vs_reference_KMERPOSITION.txt
├── output_file.1_vs_reference_KMERPOSITION.txt
├── output_file.2_vs_reference_KMERPOSITION.txt
├── output_file.3_vs_reference_KMERPOSITION.txt
└── output_file.4_vs_reference_KMERPOSITION.txt
10.MERGE_KMERPOSITION
├── kmer_position_merged.txt
└── kmer_position_samtools_merge.bam
11.OUTLIERS_LFMM_POSITION
└── outliers_with_position.csv
11.OUTLIERS_PCADAPT_POSITION
└── outliers_with_position.csv
12.ASSEMBLY_OUTLIERS_LFMM
├── contigs_LFMM_vs_reference.bam
├── contigs_LFMM_vs_reference.sorted.bam
├── contigs_LFMM_vs_reference.sorted.bam.bai
├── contigs_LFMM_vs_reference.sorted.bam.idxstats
├── contigs_LFMM_vs_reference.sorted.bam.stats
├── outliers_LFMM_mergetags.csv
└── outliers_LFMM_mergetags.fasta
12.ASSEMBLY_OUTLIERS_PCADAPT
├── contigs_PCADAPT_vs_reference.bam
├── contigs_PCADAPT_vs_reference.sorted.bam
├── contigs_PCADAPT_vs_reference.sorted.bam.bai
├── contigs_PCADAPT_vs_reference.sorted.bam.idxstats
├── contigs_PCADAPT_vs_reference.sorted.bam.stats
├── outliers_PCADAPT_mergetags.csv
└── outliers_PCADAPT_mergetags.fasta
13.GFF_FEATURES
└── extracted.gff
14.CONTIGS_INTERSECT_LFMM
└── contigs_intersect_annotation.bed
14.CONTIGS_INTERSECT_PCADAPT
└── contigs_intersect_annotation.bed
14.KMERS_INTERSECT
└── kmers_bedtools_intersect_annotation.bed
15.CONTIGS_LFMM_INTERSECT
└── global_intersect_stats
15.CONTIGS_PCADAPT_INTERSECT
└── global_intersect_stats
15.OUTLIERS_LFMM_INTERSECT
├── global_intersect_stats
└── outliers_intersect_stats
15.OUTLIERS_PCADAPT_INTERSECT
├── global_intersect_stats
└── outliers_intersect_stats
REF
├── reference2.fasta
├── reference2.fasta.0123
├── reference2.fasta.amb
├── reference2.fasta.ann
├── reference2.fasta.bwt.2bit.64
├── reference2.fasta.pac
├── reference.fasta
├── reference.fasta.amb
├── reference.fasta.ann
├── reference.fasta.bwt
├── reference.fasta.pac
└── reference.fasta.sa
REPORT
├── iKISS_report.csv
├── iKISS_report.html
├── iKISS_report.ipynb
├── PCA_from_phenotype.html
└── PCA_from_phenotype.ipynb
BENCHMARK
LOGS
Note : we recommended to remove 1.KMER_GWAS repertory after analysis.
Contributeurs
Djamel Boubred (Bioinformatics Student at IRD) and Tram VI (Ph.D student IRD) have also contributed by debugging and test with rice and coffea datasets.
Sebastien Ravel has also contributed with the snakecdysis python package developpement.
Thanks
Thanks to Ndomassi Tando (i-Trop IRD) for his administration support.
The authors acknowledge the IRD i-Trop HPC (South Green Platform) from IRD Montpellier for providing HPC resources that contributed to this work. https://bioinfo.ird.fr/ - http://www.southgreen.fr
License
Licensed under MIT.
Intellectual property belongs to IRD and authors.
iKISS uses recycled code from the culebrONT project of SouthGreen platform https://culebront-pipeline.readthedocs.io/en/latest/. iKISS uses SnakEcdysis package https://snakecdysis.readthedocs.io/en/latest/package.html to perform installation and execution in local and cluster mode.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ikiss-1.5.0.tar.gz
.
File metadata
- Download URL: ikiss-1.5.0.tar.gz
- Upload date:
- Size: 3.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 551f512cfd1d03b880f77110e216f924bbb22905db9e5a406b24c3dbbaf699b2 |
|
MD5 | ef87b126ec539a41262a69009e2706ed |
|
BLAKE2b-256 | 5a09d19f21b90d6350bcef81a4cf59badbd7bb8677cdd1f5452431d068194f53 |
File details
Details for the file ikiss-1.5.0-py3-none-any.whl
.
File metadata
- Download URL: ikiss-1.5.0-py3-none-any.whl
- Upload date:
- Size: 3.2 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 18359fe0394d47804679a0c53a0d7aca168b3817e0556d1def557871fa526b7a |
|
MD5 | 4d92ac494c1de9d2a6c4bdc23e4beaa3 |
|
BLAKE2b-256 | e31479d540fd25d0f40e66272dcac9f312098cd09d3220969ed61e441a660d9b |