Skip to main content

Small RNA-Seq analysis starting from bedgraph files

Project description

COALISPR

A collection of Python command-line scripts for quick, selective counting and visualisation of high throughput (small RNA) sequencing data.

Coalispr - COunt ALIgned SPecified Reads - relies on Pandas, Numpy, Pysam and Matplotlib. Retrieval of read counts is fairly fast when bam files with alignments of collapsed sequences, obtained with pyCRAC, are parsed. Seaborn is used for presenting count-data.

coalispr logo

The input for Coalispr are bedgraph files. Bedgraphs show the frequency by which high-throughput sequencing reads are aligned to genome locations. Usually, these alignments are saved in bam files and converted to bedgraph data for visual analysis in a genome browser (IGB, IGV, Ensembl). This collection of scripts compares bedgraphs to classify reads which are then selected from bam files by their coordinates and counted.

Bio-informatics: Integrate negative controls to get the good data

This package has been developed for the analysis of small RNA datasets obtained by sequencing size-selected cDNA libraries produced from co-purified or total RNA samples.

The presence of unspecific signals, which vary per experiment, forms a major hindrance to systematic analysis of small RNA datasets. On top of that, the counting of reads that represent genuine siRNAs can be problematic. For protein coding messenger RNAs (mRNA) read counts are usually obtained by reference to known features described in a GTF file. GTFs tend to poorly annotate (uncommon) non-coding RNAs or mRNAs that appear unproductive (i.e. come from a predicted pseudo-gene); often these are ignored, especially when they derive from a repeated locus in a genome. Thus, most GTFs are no help in identifying siRNAs and their targets, often transposon-linked transcripts.

Coalispr only uses genomic coordinates, not features, for counting. The program relies on data from mock experiments to distinguish specific from unspecific signals, applying a standard experimental heuristic to bioinformatic analysis:

The overall idea behind this application is that the output of negative biological controls is common to all samples, and relate to the kind of experimental methods used. These negative controls show which part of the experimental output is not informative. Therefore, removing this unspecific output from all samples gives signals that are specific, both for the positive and for mutant samples. Coalispr systemises this clearing up into a traceable and transparent procedure; it embraces the bio in bioinformatics.

Essential for biological experiments are the 'positive' and 'negative' controls. The 'negative' control is a sample that is not expected to provide an informative answer; a negative control shows the noise in the experiment. In contrast, a 'positive control' should give output that is specific, i.e. adheres to the current knowledge available for the biological system under study. It tells that the experimental conditions produced a useful answer. To assess the effect of a mutation is to check what happens to the specific output (relative to that of the controls).

Comparing bedgraphs

Bedgraphs are simple tables with read-count values for chromosomal positions. A framework like Pandas is ideal to assess and vectorize such data. File sizes loaded into the program are reduced at various stages. This combination speeds Coalispr up.

  • Reads are collected by their mid-point in genomic regions of a settable size with their values summed over each bin. A common index enables direct comparison of bedgraph values between different samples.
  • Comparisons are done per chromosome (they differ in length) and for each strand separately.
  • Intermediate outputs are stored as pickle files and used for
  • Interactive visualization by means of Matplotlib.
  • Boundaries of contiguous regions with either unspecific or specific reads are mapped and stored in tsv files.
  • Retrieval of reads from coordinate-sorted bam files with Pysam is on the basis of these segment-definitions. This round gathers the counts and other characteristics of mapped reads and saves these in tsv files.

A major time-gain is obtained by using the pyCRAC package for collapsing reads before alignment. The pyFastqDuplicateRemover.py in pyCRAC.scripts takes identical sequences together and stores their count in the name of their unique, collapsed representative.

Bam files with alignments of collapsed reads are more noisy (with single-mapped reads' bias) but still enable division between unspecific and specific segments and are much smaller. Having collapsed reads in the bam files speeds up the counting of aligned reads (one of the major bottlenecks in this kind of analysis) enormously. The default is to take a mixed approach: segment boundaries collected from bedgraphs based on aligned uncollapsed reads are determined. Then, with these boundaries, collapsed reads are retrieved for fast counting, which results in outputs that reproduce the original read counts.

After obtaining the counts, comparison to a features file can be done by means of the segment boundaries. The program comes with a collection of scripts to visualize various aspects of the sequencing libraries that are analyzed.

Setup and Install

The source for this project is available here. Please see the docs for more information.

The scripts rely on

  • an experiment-file, which is a tabbed file describing the experiments (for example coalispr.resources.experiment_table.tsv).
  • a configuration-file with settings and constants, constant.py, needed for parsing the experiment-file and running the program.

This package is best installed locally in user space, not system wide, or in a virtual environment (with Python sources in 'env/'). Then, any code file is adaptable, while altered or added scripts can be directly run without re-installation. For local installation, after extraction of the source archive (tar -xvzf <archive.tgz>), go to the project folder with the setup.py file and run in a terminal (as user): python3 -m pip install --editable . (Note the single dot, it stands for 'current directory'). A script callable from the command line, coalispr, will be installed in the home folder (~/.local/bin/coalispr) or in the virtual environment install folder (env/bin/coalispr). (alternatively, run python3 -m coalispr instead of coalispr)

Coalispr has various commands with multiple options. The 'help' command coalispr -h provides an overview.

Use the command coalispr init to set up the work environment and name the session/experiment. Text files (shipped in coalispr.resources.constant_in) that form constant.py, are copied to a config folder. The session-specific configuration file needs to be edited in order to get a usable constant.py, which is generated via the command: coalispr setexp.

Configuration settings

Negative data in high-throughput sequencing are reads with shared mapping positions and comparable peak intensities that are common to all samples. Coalispr extracts these reads from the data using the negative controls as reference. The procedure to specify reads as either 'specific' or 'unspecific' is guided by particular configuration settings: UNSPECLOG10 (sets a threshold for reads with some overlap to reads in the negative control), LOG2BG and USEGAP, for demarcation of read segments (clusters with contiguous reads). Obvious peaks shared by all samples could indicate a ncRNA.

coalispr diagram

Licence

The project is licensed under the European Union Public Licence (EUPL)

Background

The developer has been a wet-scientist whose aim was to complement visual analysis of siRNA datasets in a genome browser with a computational approach, allowing for a systematic comparison of a large number of datasets in one go, including negative controls.

Development of the application was triggered after reviewing a meta-analysis of sequencing data linked to Argonaute proteins. The bioinformatic outcomes were highly suspicious, i.e. that could be the product of analyzing 'negative' data. Therefore, it was alarming that negative controls when available in the data-sets had not been checked for relevant overlap with the (positive) data the main conclusions were based on. The aim of Coalispr is to make this overlap explicit and thereby identify reads that can be genuinely informative. When bioinformatics loses sight of bio it easily becomes a propagator of fake news.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coalispr-0.7.8.tar.gz (68.2 MB view details)

Uploaded Source

Built Distribution

coalispr-0.7.8-py3-none-any.whl (344.2 kB view details)

Uploaded Python 3

File details

Details for the file coalispr-0.7.8.tar.gz.

File metadata

  • Download URL: coalispr-0.7.8.tar.gz
  • Upload date:
  • Size: 68.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.10

File hashes

Hashes for coalispr-0.7.8.tar.gz
Algorithm Hash digest
SHA256 f43200e2439efdcf6f64221b8917ef43af8f47f11af6b74988e87013523e1e4d
MD5 86445f9da25f813927c54f8d49330f3d
BLAKE2b-256 8cb7e29c32991923b4d547ef9e648d3ea00f5ef65dae80d15dea3f3ff311152f

See more details on using hashes here.

File details

Details for the file coalispr-0.7.8-py3-none-any.whl.

File metadata

  • Download URL: coalispr-0.7.8-py3-none-any.whl
  • Upload date:
  • Size: 344.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.10

File hashes

Hashes for coalispr-0.7.8-py3-none-any.whl
Algorithm Hash digest
SHA256 51a7706f73e14ac988fbed43168bc8248032966db843e57b2d2ba68ec1602132
MD5 ed80eba98df3847e52ebfb81aa742e56
BLAKE2b-256 2b0b0055194b2d6312285ad2519a719c283334554b5b80ce9075cbb4c85acf01

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page