Skip to main content

Small RNA-Seq count analysis starting from bedgraph traces

Project description

COALISPR

A collection of Python scripts for quick, selective counting and visualisation of high throughput (small RNA) sequencing data.

Coalispr - COunt ALIgned SPecified Reads - relies on Pandas, Numpy, Pysam and Matplotlib. Retrieval of read counts is fairly fast when bam files with alignments of collapsed sequences, obtained with pyCRAC, are parsed. Seaborn is used for presenting count-data.

coalispr logo

The input for Coalispr are bedgraph files. Bedgraphs show the frequency by which high-throughput sequencing reads are aligned to genome locations. Usually, these alignments are saved in bam files and converted to bedgraph data for visual analysis in a genome browser (IGB, IGV, Ensembl). This collection of scripts compares bedgraphs to classify reads which are then selected from bam files by their coordinates and counted.

Bio-informatics: Integrate negative controls to get the good data

This package has been developed for the analysis of small RNA datasets obtained by sequencing size-selected cDNA libraries produced from co-purified or total RNA samples.

The presence of unspecific signals, which vary per experiment, forms a major hindrance to systematic analysis of small RNA datasets. On top of that, the counting of reads that represent genuine siRNAs can be problematic. For protein coding messenger RNAs (mRNA) read counts are usually obtained by reference to known features described in a GTF file. GTFs tend to poorly annotate (uncommon) non-coding RNAs or mRNAs that appear unproductive (i.e. come from a predicted pseudo-gene); often these are ignored, especially when they derive from a repeated locus in a genome. Thus, most GTFs are no help in identifying siRNAs and their targets, often transposon-linked transcripts.

Coalispr only uses genomic coordinates, not features, for counting. The program relies on data from mock experiments to distinguish specific from unspecific signals, applying a standard experimental heuristic to bioinformatic analysis:

The overall idea behind this application is that the output of negative biological controls is common to all samples, and relate to the kind of experimental methods used. These negative controls show which part of the experimental output is not informative. Therefore, removing this unspecific output from all samples gives signals that are specific, both for the positive and for mutant samples. Coalispr systematizes this clearing up into a traceable and transparent procedure; it embraces the bio in bioinformatics.

Essential for biological experiments are the 'positive' and 'negative' controls. The 'negative' control is a sample that is not expected to provide an informative answer; a negative control shows the noise in the experiment. In contrast, a 'positive control' should give output that is specific, i.e. adheres to the current knowledge available for the biological system under study. It tells that the experimental conditions produced a useful answer. To assess the effect of a mutation is to check what happens to the specific output (relative to that of the controls).

Comparing bedgraphs

Bedgraphs are simple tables with read-count values for chromosomal positions. A framework like Pandas is ideal to assess and vectorize such data. File sizes loaded into the program are reduced at various stages. This combination speeds Coalispr up.

  • Reads are collected by their mid-point in genomic regions of a settable size with their values summed over each bin. A common index enables direct comparison of bedgraph values between different samples.
  • Comparisons are done per chromosome (they differ in length) and for each strand separately.
  • Intermediate outputs are stored as pickle files and used for
  • Interactive visualization by means of Matplotlib.
  • Boundaries of contiguous regions with either unspecific or specific reads are mapped and stored in tsv files.
  • Retrieval of reads from coordinate-sorted bam files with Pysam is on the basis of these segment-definitions. This round gathers the counts and other characteristics of mapped reads and saves these in tsv files.

A major time-gain is obtained by using the pyCRAC package for collapsing reads before alignment. The pyFastqDuplicateRemover.py in pyCRAC.scripts takes identical sequences together and stores their count in the name of their unique, collapsed representative.

Bam files with alignments of collapsed reads are more noisy (with single-mapped reads' bias) but still enable division between unspecific and specific segments and are much smaller. Having collapsed reads in the bam files speeds up the counting of aligned reads (one of the major bottlenecks in this kind of analysis) enormously. The default is to take a mixed approach: segment boundaries collected from bedgraphs based on aligned uncollapsed reads are determined. Then, with these boundaries, collapsed reads are retrieved for fast counting, which results in outputs that reproduce the original read counts.

After obtaining the counts, comparison to a features file can be done by means of the segment boundaries. The program comes with a collection of scripts to visualize various aspects of the sequencing libraries that are analyzed.

Setup and Install

The source for this project is available here. Please see the docs for more information.

The scripts rely on

  • an experiment-file, which is a tabbed file describing the experiments (for example coalispr.resources.experiment_table.tsv).
  • a configuration-file with settings and constants, constant.py, needed for parsing the experiment-file and running the program.

This package is best installed locally in user space, not system wide, or in a virtual environment (with Python sources in 'env/'). Then, any code file is adaptable, while altered or added scripts can be directly run without re-installation. For local installation, after extraction of the source archive (tar -xvzf <archive.tgz>), go to the project folder with the setup.py file and run in a terminal (as user): python3 -m pip install --editable . (Note the single dot). A script callable from the command line, coalispr, will be installed in the home folder (~/.local/bin/coalispr) or in the virtual environment install folder (env/bin/coalispr). Alternatively, run python3 -m coalispr instead of coalispr or the GUI-version, via another script, coalispr_gui).

Coalispr has various commands with multiple options. The 'help' command coalispr -h provides an overview.

Change directory (say to the one with sequencing data), then use the command coalispr init to set up the work environment and name the session/experiment. Text files (shipped in coalispr.resources.constant_in) that form constant.py, are copied to a config folder. The session-specific configuration file needs to be edited in order to get a usable constant.py, which is generated via the command: coalispr setexp.

Configuration settings

Negative data in high-throughput sequencing are reads with shared mapping positions and comparable peak intensities that are common to all samples. Coalispr extracts these reads from the data using the negative controls as reference. The procedure to specify reads as either 'specific' or 'unspecific' is guided by particular configuration settings: UNSPECLOG10 (sets a threshold for reads with some overlap to reads in the negative control), LOG2BG and USEGAP, for demarcation of read segments (clusters with contiguous reads). Obvious peaks shared by all samples could indicate a ncRNA.

coalispr diagram

Licence

The project is licensed under the European Union Public Licence (EUPL)

Background

The developer has been a wet-scientist whose aim was to complement visual analysis of siRNA datasets in a genome browser with a computational approach, allowing for a systematic comparison of a large number of datasets in one go, including negative controls.

Development of the application was triggered after reviewing a meta-analysis of sequencing data linked to Argonaute proteins. The bioinformatic outcomes were highly suspicious, i.e. that could be the product of analyzing 'negative' data. Therefore, it was alarming that negative controls when available in the data-sets had not been checked for relevant overlap with the (positive) data the main conclusions were based on. The aim of Coalispr is to make this overlap explicit and thereby identify reads that can be genuinely informative. When bioinformatics loses sight of bio it easily becomes a propagator of fake news.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coalispr-0.9.7.tar.gz (70.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coalispr-0.9.7-py3-none-any.whl (461.8 kB view details)

Uploaded Python 3

File details

Details for the file coalispr-0.9.7.tar.gz.

File metadata

  • Download URL: coalispr-0.9.7.tar.gz
  • Upload date:
  • Size: 70.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for coalispr-0.9.7.tar.gz
Algorithm Hash digest
SHA256 c2c75d533e12f84c384d852122242ac127ffcbf64b29f207f55169085c70d1e1
MD5 fd6f94baf29b48958e327144b45058b3
BLAKE2b-256 bcbe4a7569b75785144c277485c5a23a0bc54d47df67bdab5bc5f9743b5a0ffe

See more details on using hashes here.

File details

Details for the file coalispr-0.9.7-py3-none-any.whl.

File metadata

  • Download URL: coalispr-0.9.7-py3-none-any.whl
  • Upload date:
  • Size: 461.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for coalispr-0.9.7-py3-none-any.whl
Algorithm Hash digest
SHA256 df533714b3bf67cd1f305fed25d3a8ea0b73a96e064026aee7126fdd0f66989f
MD5 d5c8d2982efe08c47e9a520415315379
BLAKE2b-256 50a0342cf54b8acedcb4c5c6ff2e1a69963b8398f121772bbdb2c70fb66a7430

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page