Skip to main content

finds mutants in your scRNA-seq experiment

Project description

cerebra

PyPI

Build Status Code Coverage

What is cerebra?

This tool allows you to quickly extract meaningful variant information from a DNA or RNA sequencing experiment. If you're interested in learning what mutations are present in your DNA/RNA samples, variant callers like GATK HaplotypeCaller can be used to generate variant calling format (.vcf) files following a sequencing experiment. However, a single sequencing run can generate on the order of 10^8 unique vcf entries, only a small portion of which contain meaningful biological signal. Thus drawing conclusions from .vcf files remains a substantial challange. cerebra provides a fast and intuitive framework for summarizing vcf entries across samples. It is comprised of four modules that do the following:

    1) remove germline mutations from samples of interest        
    2) count the total number of mutations in a given sample           
    3) report amino acid level SNPs and indels for each sample             
    4) report the ratio of total to variant reads to each mutation site      

cerebra gets its name from the eponymous X-men character, who had the ability to detect mutant individuals among the general public.

If you're working with tumor data, it might be a good idea to limit the mutational search space to only known cancer variants. Therefore cerebra implements an optional method for restricting to variants also found in the COSMIC database.

NOTE: this framework was developed for, but is certainly not limited to, single-cell RNA sequencing data.

  • Free software: MIT license

What makes cerebra different from traditional vcf parsers?

Python libraries exist (ie. PyVCF and vcfpy) for extracting information from vcf files, and GATK has its own tool for the task. In fact we integrate vcfpy into our tool. What makes cerebra different is that it reports the RNA transcript and amino acid change associated with each variant. GATK VariantsToTable produces a file that looks like:

CHROM    POS ID      QUAL    AC
 1        10  .       50      1
 1        20  rs10    99      10

Such a table contains only genomic (ie. DNA-level) coordinates. Often the next question is what specific gene and protein-level mutation is each variant associated with? cerebra queries a reference genome (.fa) and annotation (.gtf) to match each DNA-level variant with its associated gene, probable transcript and probable amino-acid level mutation. cerebra produces a table that looks like the following: alt text

cerebra adheres to HGVS sequence variant nomenclature in reporting peptide level variants

Installation

To install the latest version from PyPi you'll first need to install a few system-specific dependencies.

For OSX:

sudo pip install setuptools
brew update
brew install openssl
brew install zlib

For Debian/Ubuntu:

sudo apt-get install autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev

Following that, you can install directly from PyPi.
pip install cerebra

If you prefer working with virtual environments you can clone from github and install with pip.

git clone https://github.com/czbiohub/cerebra.git
cd cerebra
conda create -n cerebra python=3.7
conda activate cerebra
pip install -e . 

Usage

cerebra should now be installed as a commandline executable. $ cerebra should return help information

Usage: cerebra  <command>

  high-throughput summarizing of vcf entries following a sequencing
  experiment

Options:
  -h, --help  Show this message and exit.

Commands:
  count-mutations    count total number of mutations in each sample
  find-aa-mutations  report amino-acid level SNPs and indels in each sample
  germline-filter    filter out common SNPs/indels between germline samples...

Features

count-mutations: count total number of mutations in each sample
find-aa-mutations: report amino-acid level SNPs and indels in each sample
germline-filter: filter out common SNPs/indels between germline samples and samples of interest

If you have access to germline vcfs for each of you samples, then the place to start is germline-filter. You'll have to give it a simple metadata file (.csv) that maps germline sample names to experimental sample names. For example, if you have five experimental samples and two germline your metadata csv might look like:

cell_id,patient_id
sample1,gl_sample1
sample2,gl_sample1
sample3,gl_sample2
sample4,gl_sample2
sample5,gl_sample2

Note the two headers, cell_id and patient_id. germline-filter will produce a set of germline filtered vcfs, which you can then use for count-mutations or find-aa-mutations. If you do not have access to germline vcfs, then just proceed directly to count-mutations or find-aa-mutations.

Authors

This work was produced by Lincoln Harris, Rohan Vanheusden, Olga Botvinnik and Spyros Darmanis of the Chan Zuckerberg Biohub. For questions please contact lincoln.harris@czbiohub.org

Contributing

We welcome any bug reports, feature requests or other contributions. Please submit a well documented report on our issue tracker. For substantial changes please fork this repo and submit a pull request for review.

Feel free to clone but NOTE this project is still a work in progress.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cerebra-1.0.9.tar.gz (30.9 kB view hashes)

Uploaded Source

Built Distribution

cerebra-1.0.9-py2.py3-none-any.whl (25.2 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page