Skip to main content

A package to detect IBS regions

Project description

IBSpy

Python package Maintainability

Python library to identify Identical By State regions

To build the mker database for kmc and the tests run this comand:

kmc -k31 -r -ci1 -fm data/test4B.jagger.fa data/test4B.jagger.kmc_k31 tmp

Installyng IBSpy

There easiest way to install IBSpy is to use pip3.

pip3 install IBSpy

If pip3 fails, you can clone the project and compiling it with:

pip3 install cython biopython pyfaidx
python3 setup.py develop

Then you should have the IBSpy command available.

KMC3

If you want to use the KMC binder, install the KMC and compile the python instructions.

Then, run the following command to setup the path for it.

cd KMC/py_kmc_api
source set_path.sh 

Preparing the databases

IBSpy requires to have a kmer database from the sequencing files. Currently two formats are supported:

  1. Jellyfish: Follow the instructions in its website
  2. kmerGWAS: Has an adhoc file format that contains only the kmers in a binary representation, sorted. This option is faster than the jellyfish version, but creating the kmer table is less straight forward. The manual is here.

Runn unit tests

To makes sure that your changes havent broken the core IBSpy, run the unit tests:

python3 setup.py test

Running IBSPy

IBSpy has relatively few options, you can look at them with the --help command.

IBSPy --help
usage: IBSPy [-h] [-w WINDOW_SIZE] [-k KMER_SIZE] [-d DATABASE] [-r REFERENCE]
             [-z] [-o OUTPUT] [-f {kmerGWAS,jellyfish}]

optional arguments:
  -h, --help            show this help message and exit
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        window size to analyze
  -k KMER_SIZE, --kmer_size KMER_SIZE
                        Kmer size of the database
  -d DATABASE, --database DATABASE
                        Kmer database
  -r REFERENCE, --reference REFERENCE
                        The reference with the position of the kmers
  -z, --compress        When an ouput file is present, it is compressed as .gz
  -o OUTPUT, --output OUTPUT
                        Output file. If missing, the ouptut is sent to stdout
  -f {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}, --database_format {kmerGWAS,kmerGWAS_mmap,jellyfish,kmc3}
                        Database format 

To generate the table with the number of observed kmers and variants run the following command, using the kmer database from kmerGWAS use the following command:

 IBSpy --output "kmer_windows_LineXXX.tsv.gz" -z --database kmers_with_strand  --reference arinaLrFor.fa --window_size 50000 --compress --database_format kmerGWAS

For KMC3, the database is the name used while creating the database, not the filename.

Running IBSplot

Look at the IBSplot commands using --help.

IBSPy --help
usage: IBSplot [-h] [-i IBSPY_COUNTS] [-w WINDOW_SIZE] [-f FILTER_COUNTS]
               [-n N_COMPONENTS] [-c COVARIANCE_TYPE] [-s STITCH_NUMBER]
               [-o OUTPUT] [-r REFERENCE] [-q QUERY] [-p PLOT_OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -i IBSPY_COUNTS, --IBSpy_counts IBSPY_COUNTS
                        tvs file genetared by IBSpy output
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Windows size to count variations within
  -f FILTER_COUNTS, --filter_counts FILTER_COUNTS
                        Filter number of variaitons above this threshold to
                        compute GMM model, default=None
  -n N_COMPONENTS, --n_components N_COMPONENTS
                        Number of componenets for the GMM model, default=3
  -c COVARIANCE_TYPE, --covariance_type COVARIANCE_TYPE
                        type of covariance used for GMM model, default="full"
  -s STITCH_NUMBER, --stitch_number STITCH_NUMBER
                        Consecutive "outliers" in windows to stitch, default=3
  -o OUTPUT, --output OUTPUT
                        tsv file with variations count by windows and summary
                        statistics
  -r REFERENCE, --reference REFERENCE
                        genome reference name
  -q QUERY, --query QUERY
                        query sample
  -p PLOT_OUTPUT, --plot_output PLOT_OUTPUT
                        histograms and ascatter files in .PDF format

IBSplot uses the output table generated by IBSpy described above (e.g., "kmer_windows_LineXXX.tsv.gz"). It can be used to count variant assigning larger windows. In the example below it is using 400,000 bp windows to compute a GMM model and generate the plots.

To generate the table with variant count categorized by the GMM model as IBS or non-IBS and generate the plots, run the following command: The description of the GMM model is here

# minimal arguments
IBSplot --IBSpy_counts "kmeribs-Wheat_Jagger-Flame.tsv.gz" --window_size 400000 --output gmm_ibs.tsv.gz --reference Jagger --query Flame --plot_output gmm_plots.pdf

In addition, you can include some or all of the following commands to tune the GMM model parameters and define the best IBS and non-IBS according to the reference and query sample used:

IBSplot --filter_counts 1000 --n_components 3 --covariance_type 'full' --stitch_number 3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

IBSpy-0.3.1.tar.gz (6.2 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page