Skip to main content

iKISS is a pipeline to detect kmers under selection.

Project description

ikiss Logo

PythonVersions SnakemakeVersions Singularity

Homepage: https://forge.ird.fr/diade/iKISS

About iKISS

iKISS (Kmer Inference sSelection) is a snakemake pipeline able to decompose reads into kmers and extract kmers under selection.

iKISS uses KmersGWAS https://github.com/voichek/kmersGWAS, pcadapt https://cran.r-project.org/web/packages/pcadapt/readme/README.html and lfmm https://bcm-uga.github.io/lfmm/articles/lfmm to select genomics regions under selection.

1. Install dependencies and clone iKISS

Check dependencies for iKISS : python and singularity

Install singularity and python3 in your local machine OR use module load to add singularity and python3 in your environment if you are working in a cluster :

module load system/python/3.8.12
module load system/singularity/3.6.0

iKISS is NOW available as a PyPI package (recommended)

python3 -m pip install ikiss

OR you can also install iKISS from git repository

python3 -m pip install ikiss@git+https://forge.ird.fr/diade/iKISS.git

#OR

git clone https://forge.ird.fr/diade/iKISS.git
cd iKISS
python3 -m pip install .

1.1 Installing in cluster mode

Install iKISS in cluster mode using singularity container from ikiss_utilities https://itrop.ird.fr/ikiss_utilities/

ikiss install_cluster --help
ikiss install_cluster --scheduler slurm --env singularity

1.2 Installing in local mode

ikiss install_local --help
ikiss install_local

2. Running a datatest

Running test with a datatest from iKISS_utilities in a repertory TEST

ikiss test_install --help
ikiss test_install -d TEST

2.1 In CLUSTER mode

Launching suggested command line done by iKISS, in CLUSTER mode :

Please run command line ‘ikiss create_cluster_config’ before the first run and modify theads, ram, node and computer ressources. iKISS do a copy of cluster_config.yaml file into your home “/home/$USER/.config/ikiss/cluster_config.yaml”

ikiss run_cluster --help
ikiss create_cluster_config

If singularity was selected in installation of iKISS, it could be needed to give argument –singularity-args "–bind $HOME" to Snakemake, by using :

ikiss run_cluster --help
ikiss run_cluster -c TEST/data_test_config.yaml --singularity-args "--bind $HOME"
# @IFB
ikiss run_cluster -c TEST/data_test_config.yaml --singularity-args "--bind /shared:/shared"
#you can also use snakemake parametters as --rerun-incomplete --nolock

Important Note : In i-Trop cluster, run iKISS using ONLY a node, data has to be in “/scratch” of chosen node. Use nodelist : nodeX parametter inside of cluster_config.yaml file.

2.2 In LOCAL mode

launching suggested command line done by iKISS, in LOCAL mode:

ikiss run_local --help
ikiss run_local -t 8 -c TEST/data_test_config.yaml --singularity-args "--bind $HOME"

In local mode, its possible to allocate threads to some rules using –set-threads snakemake argument such as

ikiss run_local -t 8 -c TEST/data_test_config.yaml --set-threads kmers_gwas_per_sample=4 mapping_kmers=2 filter_bam=2 kmer_position_from_bam=4 pcadapt=2 extract_kmers_from_bed=2

3. Running your data

3.1. Adapt config.yaml

Before to run iKISS, adapt config.yaml by using :

ikiss create_config

Adapt config.yaml file with path to fastq files (FASTQ) and outfile (OUTPUT) in the DATA section.

DATA:
   FASTQ: './DATATEST/fastq'
   OUTPUT: './OUTPUT-KISS/'

:warning if yours reads are ilumina paired, you need rename reads SAMPLE_R1.fastq.gz and SAMPLE_R2.fastq.gz. For single reads use SAMPLE_R1.fastq.gz

iKISS uses compressed ans decompressed fastq files.

3.1.1 WORKFLOW section

Parameter iKISS steps using the section WORKFLOW and parameter it with the PARAMS sections.

In WORKFLOW section:

KMERS_GWAS step has to be activated by default.

PCADAPT, LFMM, MAPPING or ASSEMBLY are optional. Active or deactivate these steps using true or false.

KMERS_GWAS convert reads in kmers, filter them and create a format ready to use in population genomics!

PCADAPT detects genetic markers (kmers here ^^) involved in biological adaptation and provides outlier detection based on Principal Component Analysis (PCA).

LFMM is used by iKISS for testing correlations between kmers and environmental data.

MAPPING_KMERS can optionally be used to align kmers to a genomic reference (if it is available ! ).

ASSEMBLY_KMERS can optionally assembly significant kmers obtained by pcadapt or lfmm

INTERSECT can optionally calculate how many kmers (if MAPPING_KMERS is activated ) or contigs(if ASSEMBLY_KMERS is activated) are found in FEATURES (gene by default)

WORKFLOW:
   KMERS_MODULE : true
   PCADAPT : true
   LFMM : true
   MAPPING_KMERS: true
   ASSEMBLY_KMERS: true
   INTERSECT: True

3.1.2 PARAMS section

In the PARAMS section, tools parameters can be modified and adapted.

=> 1. KMERS_MODULE

KMERS_GWAS module decompose reads into kmers and create a binary table of presence/absence of kmers. This table can be filter to use only most informative kmers into the populations. PLINK format outfiles are obtained in this module.

PARAMS:
   KMERS_MODULE:
      KMER_SIZE : 31
      MAC : 2
      P : 0.2
      MAF : 0.05
      B : 1000000 # nb kmers in each bed file
      SPLIT_LIST_SIZE : 100000
      MIN_LIST_SIZE : 50000

KMER_SIZE is the length of kmers (should be between 15-31)

MAC is the minor allele count (min allowed appearance of a kmer)

P is the minimum percent of appearance in each strand form

MAF is the minimum allele frequency

B is the number of kmers in each bed file

SPLIT_LIST_SIZE is the nb of kmers by bed file

MIN_LIST_SIZE indicates the minimal number of kmers allowed in the smaller bed file after splitting

=> 2. PCADAPT

PCADAPT detects kmers involved in biological adaptation and provides outlier detection based on Principal Component Analysis (PCA)

PARAMS:
   PCADAPT:
      K : 2
      SAMPLES: "samples.txt"
      CORRECTION: 'FDR'
      ALPHA : 0.05

K : number K of principal components

SAMPLES : you need to generate a samples.txt file. This file contains two columns (tab delimitations) : accession_id and phenotype_value. It will be used by PCADAPT.

accession_id : contains exactly same name of samples in FASTQ.

phenotype_value (int): contains sample group (wild=1, cultivated=2 for example)

accession_id group
Clone12      2
Clone14      2
Clone16      2
Clone20      2
Clone2       1
Clone4       1
Clone8       1

CORRECTION: kmers outliers are obtained using a correction of BONFERONNI, BH or FDR model.

ALPHA: modify the alpha cutoff for outlier detection

=> 3. LFMM

LFMM is used by iKISS for testing correlations between kmers and environmental data.

PARAMS:
   LFMM:
      K : 2
      PHENOTYPE_FILE: "pheno.txt"
      PHENOTYPE_PCA_ANALYSIS : false
      CORRECTION: 'BH'
      ALPHA : 0.05

K are the latent factors used in LFMM association analyses

PHENOTYPE_FILE: an phenotype file is obligatory in LFMM analysis. You can give to iKISS PCA results, climate variables, etc.

A PCA can reveal some ‘structure’ in the genotype data and it could help you to fix K parameter.

PHENOTYPE_PCA_ANALYSIS

  • If PHENOTYPE_PCA_ANALYSIS is true, iKISS automatically run PCA using the file given by user in the PHENOTYPE_FILE key. This PHENOTYPE_FILE can be a PCA result for example.

  • If PHENOTYPE_PCA_ANALYSIS is false, iKISS use directly the PHENOTYPE_FILE as ‘phenotype’ to LFMM analysis. Kmers are used as ‘genotype’ data.

Here, a example of a phenotype file with climate variables

accession_id        group   b2.Mean_Diurnal_Range   b3.Isothermality        b4.Temp_Seasonality     b5.Max_Temp_of_Warmest_Month    b6.Min_Temp_of_Coldest_Month    b7.Temp_Annual_Range    b8.Mean_Temp_of
_Wettest_Quarter    b9.Mean_Temp_of_Driest_Quarter  b10.Mean_Temp_of_Warmest_Quarter        b11.Mean_Temp_of_Coldest_Quarter        b12.Annual_Precipitation        b13.Precipitation_of_Wettest_Mo
nth b14.Precipitation_of_Driest_Month       b15.Precipitation_Seasonality   b16.Precipitation_of_Wettest_Quarter    b17.Precipitation_of_Driest_Quarter     b18.Precipitation_of_Warmest_Quarter    b19.Precipitation_of_Coldest_Quarter
Clone12     2       99      68      1230    310     166     144     250     226     258     226     1462    249     3       68      573     17      549     17
Clone14     2       100     68      1235    301     155     146     241     217     248     217     1525    259     3       67      603     18      575     18
Clone16     2       93      65      1389    310     168     142     250     223     258     223     1416    264     0       73      579     8       544     8
Clone20     2       154     55      3955    403     123     280     296     234     315     214     118     62      0       184     107     0       45      0
Clone2      1       152     55      3617    403     128     275     287     242     316     220     173     80      0       167     153     0       18      0
Clone4      1       168     51      5719    414     86      328     315     201     322     181     20      12      0       166     18      0       17      0
Clone8      1       NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA

CORRECTION: kmers outliers are obtained using a correction of BONFERONNI, BH or FDR model.

ALPHA: modify the alpha cutoff for outlier detection

=> 4. MAPPING_KMERS

MAPPING_KMERS section in PARAMS can optionally be used to align kmers to a genomic reference. It could give a idea of selected regions in a genome.

PARAMS:
   MAPPING_KMERS:
      REF: "reference.fasta"
      MODE : bwa-aln
      INDEX_OPTIONS: ""
      OPTIONS : "-n 0.04"
      FILTER_FLAG : 4
      FILTER_QUAL : 10

Use a reference file in the REF section.

Parametter MODE using bwa-aln or bwa-mem2

Set up the INDEX_OPTIONS according to the MODE you have chosen.

If bwa-mem2 leaf empty

If bwa-aln “-a bwtsw” or “”

Set options according of chosen mapper in the OPTIONS key.

If bwa-mem2 default parameters -A 1 -B 4;

If bwa-aln -n 0.04

Obtained bam could be filtered using FILTER_FLAG (-F 4 by default) and FILTER_QUAL (mapq>10 by defaut) params.

=> 5. ASSEMBLY_KMERS

ASSEMBLY_KMERS section in PARAMS can optionally be used to assembly significant kmers obtained by pcadapt or/and lfmm.

Contigs are assembled by iKISS using mergeTags from dekupl package https://github.com/Transipedia/dekupl-mergeTags.

Chose minimal overlap size “OVERLAP_SIZE” allowed to assembly kmers.

Feel free to filter contigs by size “FILTER_CONTIG_SIZE”.

Assembled contigs could be mapped activating MAPPING_CONTIGS. This mapping can be launch versus a REF reference file using bwa-mem2 by default. Reference file used in this step can be a different reference from MAPPING_KMERS options. Feel free of change parametters of mapping using MAPPING_OPTIONS

Assembled contigs could be used by blastn against a database, you can also try to annotate them!

PARAMS:
   ASSEMBLY:
      OVERLAP_SIZE : 15
      FILTER_CONTIG_SIZE : 100
      MAPPING_CONTIGS: True
      # if MAPPING_CONTIGS is activate, ikiss maps contigs vs REF using bwamem2
      REF: 'reference.fasta'
      MAPPING_OPTIONS : ""

=> 6. INTERSECT

iKISS uses bedtools intersect to calculate how many kmers/contigs are mapped in FEATURES (gene by default).

These FEATURES are filtered from the annotation GFF fileb before use bedtools intersect.

iKISS filtered kmers/contigs by using FILTER_MAPQ_STATS and minimal kmers/contigs number FILTER_MIN_STATS by FEATURE.

PARAMS:
   INTERSECT:
         GFF : 'reference.gff'
         FEATURE : 'gene'
         FILTER_MAPQ_STATS: '15'

3.2. Adapt cluster_config.yaml

If you will run ikiss in cluster, adapt cluster_config.yaml :

ikiss edit_cluster_config

Inside cluster_config.yaml, adapt partition to your favorite cluster and change memory and cpu number in by __default__ key or in rules you need :

__default__:
   cpus-per-task : 4
   mem-per-cpu : 10G
   partition : "normal"
   nodelist: node19
   output : 'slurm_logs/stdout/{rule}/{wildcards}.o'
   error : 'slurm_logs/error/{rule}/{wildcards}.e'
   job-name : '{rule}.{wildcards}'

kmers_gwas_per_sample:
   cpus-per-task : 4
   mem-per-cpu : 10G

RULES

Here you can quickly find iKISS snakemake rules list :

rule kmers_gwas_per_sample *
rule kmers_to_use
rule kmers_table
rule extract_kmers_from_bed
rule index_ref
rule index_ref_to_assembly
rule mapping_kmers
rule filter_bam
rule kmer_position_from_bam *
rule merge_kmer_position
rule samtools_merge
rule pcadapt *
rule merge_method
rule outliers_position
rule extracting_features_from_gff
rule kmers_bedtools_intersect
rule get_pca_from_phenotype
rule lfmm *
rule mergetags
rule mapping_contigs
rule contigs_bedtools_intersect
rule intersect_and_contigs
rule intersect_and_outliers
rule fastq_stats
rule report_ikiss
rule html_ikiss
  • rules with a * can be parallelised.

4. Running iKISS

Run iKISS by ikiss run_local or ikiss run_cluster as explained in “Running a datatest” section.

5. iKISS output

This is a overwiew of iKISS output directory:

OUTPUT-KISS/
   config_corrected.yaml
   0.FASTQ_STATS
   └── fastq_stats.txt
   1.KMERS_MODULE
   ├── Clone12
   ├── Clone14
   ├── Clone16
   ├── Clone2
   ├── Clone20
   ├── Clone4
   └── Clone8
   2.KMERS_TABLE
   ├── kmers_list_paths.txt
   ├── kmers_table.names
   ├── kmers_table.table
   ├── kmers_to_use
   ├── kmers_to_use.no_pass_kmers
   ├── kmers_to_use.shareness
   ├── kmers_to_use.stats.both
   ├── kmers_to_use.stats.only_canonical
   └── kmers_to_use.stats.only_non_canonical
   3.TABLE2BED
   ├── log
   ├── output_file.0.bed
   ├── output_file.0.bim
   ├── output_file.0.fam
   ├── output_file.1.bed
   ├── output_file.1.bim
   ├── output_file.1.fam
   ├── output_file.2.bed
   ├── output_file.2.bim
   ├── output_file.2.fam
   ├── output_file.3.bed
   ├── output_file.3.bim
   ├── output_file.3.fam
   ├── output_file.4.bed
   ├── output_file.4.bim
   └── output_file.4.fam
   4.EXTRACT_FASTA
   ├── output_file.0.fasta.gz
   ├── output_file.1.fasta.gz
   ├── output_file.2.fasta.gz
   ├── output_file.3.fasta.gz
   └── output_file.4.fasta.gz
   5.RANGES
   ├── output_file.0
   ├── output_file.1
   ├── output_file.2
   ├── output_file.3
   └── output_file.4
   6.LFMM
   ├── output_file.0_10_LFMM_outliers.csv
   ├── output_file.0_10_LFMM_pvalues.csv
   ├── output_file.0_10_LFMM.rplot.pdf
   ...
   6.LFMM_PHENO
   ├── PCA_from_phenotype.csv
   ├── PCA_from_phenotype.html
   └── PCA_from_phenotype.ipynb
   6.PCADAPT
   ├── output_file.0_10_PCADAPT_outliers.csv
   ├── output_file.0_10_PCADAPT_pvalues.csv
   ├── output_file.0_10_PCADAPT.rplot.pdf
   ├── output_file.0_10_PCADAPT_scores.csv
   ...
   7.MERGED_LFMM
   ├── merged_LFMM_outliers.csv
   └── merged_LFMM_pvalues.csv
   7.MERGED_PCADAPT
   ├── merged_PCADAPT_outliers.csv
   └── merged_PCADAPT_pvalues.csv
   8.MAPPING_KMERS
   ├── bam_files.txt
   ├── output_file.0_vs_reference.bam
   ├── output_file.0_vs_reference_FMQ.bam
   ├── output_file.0_vs_reference.sai
   ├── output_file.0_vs_reference_sorted.bam
   ├── output_file.0_vs_reference_sorted.bam.bai
   ├── output_file.0_vs_reference_sorted.bam.idxstats
   ├── output_file.0_vs_reference_sorted.bam.stats
   ...
   9.KMERPOSITION
   ├── output_file.0_vs_reference_KMERPOSITION.txt
   ├── output_file.1_vs_reference_KMERPOSITION.txt
   ├── output_file.2_vs_reference_KMERPOSITION.txt
   ├── output_file.3_vs_reference_KMERPOSITION.txt
   └── output_file.4_vs_reference_KMERPOSITION.txt
   10.MERGE_KMERPOSITION
   ├── kmer_position_merged.txt
   └── kmer_position_samtools_merge.bam
   11.OUTLIERS_LFMM_POSITION
   └── outliers_with_position.csv
   11.OUTLIERS_PCADAPT_POSITION
   └── outliers_with_position.csv
   12.ASSEMBLY_OUTLIERS_LFMM
   ├── contigs_LFMM_vs_reference.bam
   ├── contigs_LFMM_vs_reference.sorted.bam
   ├── contigs_LFMM_vs_reference.sorted.bam.bai
   ├── contigs_LFMM_vs_reference.sorted.bam.idxstats
   ├── contigs_LFMM_vs_reference.sorted.bam.stats
   ├── outliers_LFMM_mergetags.csv
   └── outliers_LFMM_mergetags.fasta
   12.ASSEMBLY_OUTLIERS_PCADAPT
   ├── contigs_PCADAPT_vs_reference.bam
   ├── contigs_PCADAPT_vs_reference.sorted.bam
   ├── contigs_PCADAPT_vs_reference.sorted.bam.bai
   ├── contigs_PCADAPT_vs_reference.sorted.bam.idxstats
   ├── contigs_PCADAPT_vs_reference.sorted.bam.stats
   ├── outliers_PCADAPT_mergetags.csv
   └── outliers_PCADAPT_mergetags.fasta
   13.GFF_FEATURES
   └── extracted.gff
   14.CONTIGS_INTERSECT_LFMM
   └── contigs_intersect_annotation.bed
   14.CONTIGS_INTERSECT_PCADAPT
   └── contigs_intersect_annotation.bed
   14.KMERS_INTERSECT
   └── kmers_bedtools_intersect_annotation.bed
   15.CONTIGS_LFMM_INTERSECT
   └── global_intersect_stats
   15.CONTIGS_PCADAPT_INTERSECT
   └── global_intersect_stats
   15.OUTLIERS_LFMM_INTERSECT
   ├── global_intersect_stats
   └── outliers_intersect_stats
   15.OUTLIERS_PCADAPT_INTERSECT
   ├── global_intersect_stats
   └── outliers_intersect_stats
   REF
   ├── reference2.fasta
   ├── reference2.fasta.0123
   ├── reference2.fasta.amb
   ├── reference2.fasta.ann
   ├── reference2.fasta.bwt.2bit.64
   ├── reference2.fasta.pac
   ├── reference.fasta
   ├── reference.fasta.amb
   ├── reference.fasta.ann
   ├── reference.fasta.bwt
   ├── reference.fasta.pac
   └── reference.fasta.sa
   REPORT
   ├── iKISS_report.csv
   ├── iKISS_report.html
   ├── iKISS_report.ipynb
   ├── PCA_from_phenotype.html
   └── PCA_from_phenotype.ipynb
   BENCHMARK
   LOGS

Note : we recommended to remove 1.KMER_GWAS repertory after analysis.

Authors

Julie Orjuela (IRD) develops iKISS

Yves Vigouroux (IRD) is the big boss with a lot of ideas and contributions!

Contributeurs

Djamel Boubred (Bioinformatics Student at IRD) and Tram VI (Ph.D student IRD) have also contributed by debugging and test with rice and coffea datasets.

Sebastien Ravel has also contributed with the snakecdysis python package developpement.

Thanks

Thanks to Ndomassi Tando (i-Trop IRD) for his administration support.

The authors acknowledge the IRD i-Trop HPC (South Green Platform) from IRD Montpellier for providing HPC resources that contributed to this work. https://bioinfo.ird.fr/ - http://www.southgreen.fr

License

Licensed under MIT.

Intellectual property belongs to IRD and authors.

iKISS uses recycled code from the culebrONT project of SouthGreen platform https://culebront-pipeline.readthedocs.io/en/latest/. iKISS uses SnakEcdysis package https://snakecdysis.readthedocs.io/en/latest/package.html to perform installation and execution in local and cluster mode.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ikiss-1.5.0.tar.gz (3.2 MB view hashes)

Uploaded Source

Built Distribution

ikiss-1.5.0-py3-none-any.whl (3.2 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page