Skip to main content

Package for estimating UMI counts in Transcript Tag Counting data.

Project description

# umis


**umis** provides tools for estimating expression in RNA-Seq data which performs
sequencing of end tags of transcript, and incorporate molecular tags to
correct for amplification bias.

There are four steps in this process.

1. Formatting reads
2. Filtering noisy cellular barcodes
3. Pseudo-mapping to cDNAs
4. Counting molecular identifiers

## 1. Formatting reads

We want to strip out all non-biological segments of the sequenced reads for
the sake of mapping. While also keeping this information for later use. We
consider non-biological information such as Cellular Barcode and Molecular
Barcode. To later be able to extract the optional CB and the MB these are put
in the read header, with the following format.

@HWI-ST808:130:H0B8YADXX:1:1101:2088:2222:CELL_GGTCCA:UMI_CCCT
AGGAAGATGGAGGAGAGAAGGCGGTGAAAGAGACCTGTAAAAAGCCACCGN
+
@@@DDBD>=AFCF+<CAFHDECII:DGGGHGIGGIIIEHGIIIGIIDHII#

The command `umis fastqtransform` is for transforming a (pair of) read(s) to
this format based on a _transform file_. The transform file is a json file
which has a Python flavored regular expression for each read, made to extract
the necessary components of the reads.

## 2. Filtering noisy cellular barcodes
Not all cellular barcodes identified in the transformation will be real. Some
will be low abundance barcodes that do not represent an actual cell. Others
will be barcodes that don't come from a set of known barcodes. The `umi cb_filter`
command can be used to filter a transformed FASTQ file, dropping unknown
barcodes. The `--nedit` option can be supplied to correct barcodes `--nedit`
distance away from known barcodes. After barcode filtering,
the `umis cb_histogram` command will generate a file of counts for
each cellular barcode. This file can be used to find a count cut-off for barcodes
that are high abundance for downstream quantitation.

## 3. Pseudo-mapping to cDNAs

This is done by pseudo-aligners, either Kallisto or RapMap. The SAM (or BAM) file output
from these tools need to be saved.

## 4. Counting molecular identifiers

The final step is to infer which cDNA was the origin of the tag a UMI was
attached to. We use the pseudo-alignments to the cDNAs, and consider a tag
assigned to a cDNA as a partial _evidence_ for a (cDNA, UMI) pairing. For
actual counting, we only count unique UMIs for (gene, UMI) pairings with
sufficient evidence.

To count, use the command `umis tagcount`. This requires a SAM or BAM file as input.

By default, the read name will be used to cell barcodes and UMI sequences. Optionally,
when using the `--parse_tags` option, the `CR` and `UM` bam tags will be used to
extract the cell barcode and UMI, respectively.

The recommended workflow is to map reads to cDNA, in which case the target name in the BAM
will be a transcript ID. If the BAM has been mapped to a genome (e.g. with STAR) `tagcount`
can use the optional `GX` BAM tag to get the gene name. In this case, use the option `--gene_tags`.

## kallisto
The quantitation used in `umis` handles reads that could come from multiple
transcripts by assigning a fractional count to each transcript and then
filtering for a minimum count at the end. Many single-cell analyses use
something similar to this type of counting, but it has drawbacks
(see
[this paper](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0970-8)).
For more principled UMI quantification,
see [Kallisto](https://github.com/pachterlab/kallisto). kallisto needs the files
in a certain format: each cellular barcode has its own FASTQ file and a file
that lists the UMI for each read. The `umis kallisto` command can reformat your
fastq files to that format.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

umis-1.0.3.tar.gz (18.2 kB view details)

Uploaded Source

Built Distribution

umis-1.0.3-cp36-cp36m-macosx_10_7_x86_64.whl (25.4 kB view details)

Uploaded CPython 3.6m macOS 10.7+ x86-64

File details

Details for the file umis-1.0.3.tar.gz.

File metadata

  • Download URL: umis-1.0.3.tar.gz
  • Upload date:
  • Size: 18.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/38.5.2 requests-toolbelt/0.8.0 tqdm/4.19.6 CPython/3.6.7

File hashes

Hashes for umis-1.0.3.tar.gz
Algorithm Hash digest
SHA256 7535affabe6665370a8c9a8fef26d1fad3289441118ef607a29bbc7e9047fb49
MD5 be5686fff73e0c205b5c30a7365aaec5
BLAKE2b-256 863d1c0eb82e68cfdaf5cf1c91fd1d999aa8397e087d8784e932b8ea324ee58f

See more details on using hashes here.

File details

Details for the file umis-1.0.3-cp36-cp36m-macosx_10_7_x86_64.whl.

File metadata

  • Download URL: umis-1.0.3-cp36-cp36m-macosx_10_7_x86_64.whl
  • Upload date:
  • Size: 25.4 kB
  • Tags: CPython 3.6m, macOS 10.7+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.18.4 setuptools/38.5.2 requests-toolbelt/0.8.0 tqdm/4.19.6 CPython/3.6.7

File hashes

Hashes for umis-1.0.3-cp36-cp36m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 6e7242931999bbd933f25429506cd42331b783267f7e3a4ecc16096c81693c1d
MD5 628b99473b6a76c325fd598c9f9bed84
BLAKE2b-256 0b30489213cb0c7b9d65524465e12867d8bd1422423c928c757e6872a5503969

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page