isONclust

De novo clustering of long-read transcriptome reads.

These details have not been verified by PyPI

Project links

Homepage

Project description

isONclust
========

isONclust is a tool for clustering either PacBio Iso-Seq reads, or Oxford Nanopore reads into clusters, where each cluster represents all reads that came from a gene. Detailed information is available in [preprint](link).

isONclust is distributed as a python package supported on Linux / OSX with python v>=2.7, and 3.4-3.6, 3.5-dev and 3.6-dev [![Build Status](https://travis-ci.org/ksahlin/isONclust.svg?branch=master)](https://travis-ci.org/ksahlin/isONclust)

Table of Contents
=================

* [INSTALLATION](#INSTALLATION)
* [Using conda](#Using-conda)
* [Using pip](#Using-pip)
* [Downloading source from GitHub](#Downloading-source-from-github)
* [Dependencies](#Dependencies)
* [Testing installation](#testing-installation)
* [USAGE](#USAGE)
* [Iso-Seq](#Iso-Seq)
* [Oxford Nanopore](#Oxford-Nanopore)
* [Output](#Output)
* [Parameters](#Parameters)
* [CREDITS](#CREDITS)
* [LICENCE](#LICENCE)

INSTALLATION
----------------

### Using conda

Coming soon.

### Using pip

`pip` is pythons official package installer. This section assumes you have `python` (v2.7 or >=3.4) and a recent version of `pip` installed which should be included in most python versions. If you do not have `pip`, it can be easily installed [from here](https://pip.pypa.io/en/stable/installing/) and upgraded with `pip install --upgrade pip`.

With `python` and `pip` available, create a file `requirements.txt` with contents copied from [this file](https://github.com/ksahlin/isONclust/blob/master/requirements.txt). Then, type in terminal

```
pip install --requirement requirements.txt isONclust
```

`pip` will install the dependencies automatically for you. IsoCon has been built with python 2.7, 3.4-3.6 on Linux systems using [Travis](https://travis-ci.org/). For customized installation of latest master branch, see below.

### Downloading source from GitHub

#### Dependencies

Make sure the below listed dependencies are installed (installation links below). Versions in parenthesis are suggested as IsoCon has not been tested with earlier versions of these libraries. However, IsoCon may also work with earliear versions of these libaries.
* [parasail](https://github.com/jeffdaily/parasail-python)
* [pysam](http://pysam.readthedocs.io/en/latest/installation.html) (>= v0.11)

With these dependencies installed. Run

```sh
git clone https://github.com/ksahlin/isONclust.git
cd isONclust
./isONclust
```

### Testing installation

You can verify successul installation by running isONclust on this [small dataset](https://github.com/ksahlin/isONclust/tree/master/test/sample_alz_2k.fastq). Simply download the test dataset and run:

```
isONclust pipeline --fastq [test/sample_alz_2k.fastq] --outfolder [output path]
```

USAGE
-------

IsONclust can be used with either Iso-Seq or ONT reads. It takes either a fastq file or ccs.bam file.

### Iso-Seq

IsONclust works with full-lengh non-chimeric (flnc) reads that has quality values assigned to bases. If you already have such a fastq file generated for your reads, isONclust can be run as

```
isONclust pipeline --isoseq --fastq <reads.fastq> --outfolder </path/to/output>
```

If not, flnc reads can be generated as follows. Raw pacbio subreads needs to be proccesed with `ccs` with the command `--polish` (to get quality values), followed by `lima`, and `isoseq3 cluster` to get the flnc reads. The flnc file is generated at the very beginning of the `isoseq3 cluster` algorithm and it can be used once its created (no need to wait for isoseq3 to finish). See full documentation on generating flnc reads at [isoseq3](https://github.com/PacificBiosciences/IsoSeq3). After these three comands are run isONclust can be run as follows
```
isONclust --isoseq --ccs <ccs.bam> --flnc <flnc.bam> --outfolder </path/to/output>
```
Where `<ccs.bam>` is the file generated from `ccs` and `<flnc.bam>` is the file generated from `isoseq3 cluster`. The argument `--isoseq` simply means `--k 15 --w 50`. These arguments can be set manually and the `--isoseq` flag be removed. Specify number of cores with `--t`.

### Oxford Nanopore
isONclust needs a fastq file generated by an Oxford Nanopore basecaller.

```
IsoCon pipeline --ont --fastq <reads.fastq> --outfolder </path/to/output>
```
The argument `--ont` simply means `--k 13 --w 30`. These arguments can be set manually and the `--ont` flag be removed. Specify number of cores with `--t`.

#### Output

The final high quality transcripts are written to the file `final_candidates.fa` in the output folder. If there was only one or two reads coming from a transcript, which is sufficiently different from other reads (exon difference), it will be output in the file `not_converged.fa`. This file may contain other erroneous CCS reads such as chimeras. The output also contains a file `cluster_info.tsv` that shows for each read which candidate it was assigned to in `final_candidates.fa`.

### Parameters

```
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--fastq FASTQ Path to consensus fastq file(s) (default: False)
--flnc FLNC The flnc reads generated by the isoseq3 algorithm (BAM
file) (default: False)
--ccs CCS Path to consensus BAM file(s) (default: False)
--t NR_CORES Number of cores allocated for clustering (default: 8)
--ont Clustering of ONT transcript reads. (default: False)
--isoseq Clustering of PacBio Iso-Seq reads. (default: False)
--k K Kmer size (default: 15)
--w W Window size (default: 50)
--min_shared MIN_SHARED
Minmum number of minimizers shared between read and
cluster (default: 5)
--mapped_threshold MAPPED_THRESHOLD
Minmum mapped fraction of read to be included in
cluster. The density of minimizers to classify a
region as mapped depends on quality of the read.
(default: 0.7)
--aligned_threshold ALIGNED_THRESHOLD
Minmum aligned fraction of read to be included in
cluster. Aligned identity depends on the quality of
the read. (default: 0.4)
--min_fraction MIN_FRACTION
Minmum fraction of minimizers shared compared to best
hit, in order to continue mapping. (default: 0.8)
--min_prob_no_hits MIN_PROB_NO_HITS
Minimum probability for i consecutive minimizers to be
different between read and representative and still
considered as mapped region, under assumption that
they come from the same transcript (depends on read
quality). (default: 0.1)
--outfolder OUTFOLDER
A fasta file with transcripts that are shared between
samples and have perfect illumina support. (default:
None)
```

CREDITS
----------------

Please cite [1] when using IsoCon.

1. Kristoffer Sahlin Paul Medvedev (2018) "De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm", bioRxiv [Link]().

LICENCE
----------------

GPL v3.0, see [LICENSE.txt](https://github.com/ksahlin/IsoCon/blob/master/LICENCE.txt).

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.6.1

Apr 3, 2021

0.0.6

Dec 14, 2019

0.0.5

Oct 13, 2019

0.0.4

Feb 28, 2019

0.0.3

Jan 15, 2019

0.0.2

Dec 23, 2018

This version

0.0.1

Nov 5, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isONclust-0.0.1.tar.gz (546.4 kB view hashes)

Uploaded Nov 5, 2018 Source

Built Distribution

isONclust-0.0.1-py2.py3-none-any.whl (552.2 kB view hashes)

Uploaded Nov 5, 2018 Python 2 Python 3

Hashes for isONclust-0.0.1.tar.gz

Hashes for isONclust-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`e28a02b6c2cc52e7ccb3d23563cac71b419a3238d2ad34385b2d121954dad647`
MD5	`b108e5651e80d42301b7823b08c5dc9a`
BLAKE2b-256	`f31c5da5b4b4fd5fb325716acf0ecff832061ed96da2da5673418d801520bbaa`

Hashes for isONclust-0.0.1-py2.py3-none-any.whl

Hashes for isONclust-0.0.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4dd8a3dff2e885a96bee5a080504fd20d51f8be36ca7fb41d53157427b85e44`
MD5	`a6662a910ea897f5540ba9d0a64fe59a`
BLAKE2b-256	`c59e6ba6db056665f9485040f142c6320bcae49b7459d20a97990b4ca18a10b0`