Skip to main content

A package for aligning RNA-seq data without reference biases

Project description

RNA-APoGee

RNA-APoGee (RNA Alignment to Personal Genomes) is a package to align RNA-seq data while minimizing reference biases. It can also be used to align RNA-seq data to haplotype resolved variants. Currently, RNA-APoGee relies on the Olego aligner, although other aligners could be used instead.

Pre-requisites:

  • RNA-APoGee has only been tested on Linux and requires Python 3.
  • Olego must be installed and on your PATH.
  • samtools

Installation

pip install RNA-ApoGee

Command line utilities

Alignment involves two steps:

  1. Generating a "personalized" genome that has the variants of the individual embedded into the reference genome.
  2. Aligning against the reference and the personal genome (or against two haplotypes) and then merging the two sets of alignment to pick the best alignment for each read.

Generating a personal genome

create_genomes creates versions of an input FASTA with sample-specific SNVs replacing reference bases.

If you have phased variants, you can create two VCFs corresponding to the variants of each haplotype and then create two versions of the reference by calling create_genomes twice, once for each haplotype (unfortunately currently this script ignores the phasing of the variants.)

create_genomes --fasta FASTA
               --vcf VCF
               --outdir OUTDIR
               [--samples SAMPLES]
               [--min_gq MIN_GQ]
               [--chunk CHUNK]

  --fasta FASTA      FASTA file that will be used as the base for generating
                     personal genomes. For each sample in the input VCF, an
                     individual genome will be created by substituting the
                     sample's SNVs into this base FASTA. SNVs will be
                     considered only if the FILTER field is PASS, and the
                     genotype quality is greater than <min_gq>.

  --vcf VCF          VCF with variant calls. Can have multiple samples.

  --outdir OUTDIR    Personal genome for sample <sample> will be in
                     <outdir>/<sample>.fa

  --samples SAMPLES  (Optional) Comma separated list of samples from the input VCF. If
                     provided, only the personal genomes for these samples
                     will be created, otherwise personal genomes for all
                     samples in the input VCF will be created.

  --min_gq MIN_GQ    (Optional) Minimum genotype quality to consider a variant

  --chunk CHUNK      (Optional) How many bases to keep in memory. Reduce if running OOM.

Aligning against the reference and the personal genome

apogee aligns RNA-seq data to a personalized genome. Each read (or read-pair in case of paired data) is aligned against two FASTAs (correponding to two haplotypes or to a reference with and without an individual's variants). Then for each read (or read-pair) the best alignment across the two FASTAs is chosen. The order in which the two references are given (i.e. which one is specified as ref_fasta and which one is specified as alt_fasta) does not matter. Note that a lot of intermediate files are created. If tmp_dir is specified, all intermediate files will be stored there, with a prefix matching the prefix of the output BAM. In this case, it's up to you to delete that directory. If tmp_dir is not specified a temporary directory will be created, in the same directory as the output BAM and then deleted (so all intermediate files will be lost).

apogee --fq1 FQ1
       --ref_fasta REF_FASTA
       --alt_fasta ALT_FASTA
       --bam BAM
       [--fq2 FQ2]
       [--tmp_dir TMP_DIR]

  --fq1 FQ1              FASTQ file with all reads (for single-end) or read1
                         reads (for paired-end)
  --fq2 FQ2              (Optional) FASTQ file with read2 reads
  --ref_fasta REF_FASTA
                         First FASTA against which to align
  --alt_fasta ALT_FASTA
                         Second FASTA against which to align
  --bam BAM              Output BAM
  --tmp_dir TMP_DIR      (Optional) Directory of intermediate files
  --threads THREADS      (Optional) Number of threads for alignment [1]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

RNA-APoGee-0.0.8.tar.gz (7.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

RNA_APoGee-0.0.8-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file RNA-APoGee-0.0.8.tar.gz.

File metadata

  • Download URL: RNA-APoGee-0.0.8.tar.gz
  • Upload date:
  • Size: 7.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.3

File hashes

Hashes for RNA-APoGee-0.0.8.tar.gz
Algorithm Hash digest
SHA256 8a537d14be266ecb8427626f24863bae2d23bde26e10b64d1cf5819a5c53aeb5
MD5 2ff9af8491dac1c14daa147767c0177d
BLAKE2b-256 5e9d70248a5fb921a89202cb143b1c5d8823f00b15128c4b7b03e4c91f5a759c

See more details on using hashes here.

File details

Details for the file RNA_APoGee-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: RNA_APoGee-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.3

File hashes

Hashes for RNA_APoGee-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 85cba7c277fdcccc6878af0d7bc94958fe14f7dac97c86af4bce484ded5d9c77
MD5 714f45b01801584109f7fd1c4ea77e26
BLAKE2b-256 9565fe33c12e6afe4bea8e92147eeaab5cb53df84e5dc5cd297dcff9a86a663f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page