Skip to main content

Alignment in a python wrapper.

Project description

xAlign: Hassle-free transcript quantification

xAlign is an efficient python package to align FASTQ files against any Ensembl reference genomes. The currently supported alignment algorithms are kallisto (https://pachterlab.github.io/kallisto/) and Salmon (https://salmon.readthedocs.io/en/latest/salmon.html). The package contains modules for Ensemble ID mapping to gene symbols via the mygene.info python package and SRA download capabilities. When using this package please cite the corresponding alignment algorithm.

Installation

pip3 install git+https://github.com/MaayanLab/xalign.git

Requirements

The alignment algorithms require a minimum of around 5GB of memory to run. When downloading SRA files, make sure that there is sufficient available disk space. xalign is currently only working on Linux operating systems.

Usage

The recommended usage is xalign.align_folder() if there are multiple FASTQ files. These FASTQ files can be aligned one by one, and gene level counts can be aggregated using the function xalign.ensembl.agg_gene_counts()

Align a single FASTQ file in single-read mode

To align a single RNA-seq file we first download an example SRA file and save it in the folder data/example_1 relative to the working directory. The function xalign.align_fastq() will generate the required cDNA index from the Ensembl reference genome when the index is not already built. result is a dataframe with transcript IDs, gene counts, and TPM.

When the alignment is run against a new species, the initial setup will take a few minutes to complete because building a new index and creating gene mapping files are required.

import xalign

xalign.sra.load_sras(["SRR14457464"], "data/example_1")

result = xalign.align_fastq("homo_sapiens", "data/example_1/SRR14457464.fastq", t=8)

Align a single FASTQ file in paired-end mode

To align a single RNA-seq file in paired-end mode we first download an example SRA file and save it in folder data/example_2 relative to the working directory. If the SRA file is a paired-end sample, two files will be generated with the two suffixes _1 and _2. The function xalign.align_fastq() will generate the required cDNA index from the Ensembl reference genome when the index is not already built. result is a dataframe with transcript IDs, gene counts, and TPM.

When the alignment is run against a new species, the initial setup will take a couple of minutes to built the index and to create the gene mapping files.

import xalign

# the sample is paired-end and will result in two files (SRR15972519_1.fastq, SRR15972519_2.fastq)
xalign.sra.load_sras(["SRR15972519"], "data/example_2")

result = xalign.align_fastq("homo_sapiens", ["data/example_2/SRR15972519_1.fastq", "data/example_2/SRR15972519_2.fastq"], t=8)

Align FASTQ files in a directory

xalign can automatically align all files in a given folder, instead of calling xalign.align_fastq() multiple times. In this case xalign.align_folder() will automatically detect whether the folder contains paired- or single-end samples and group the samples accordingly without manual input. The output will be two dataframes. gene_count will contain gene level counts that can be aggregated for different gene identifiers (symbol:default, ensembl_id, entrezgene_id). Transcripts that can not be mapped to corresponding identifiers are discarded. transcript_count contains the read counts at transcript level.

import xalign

# this will download multiple GB of samples
xalign.sra.load_sras(["SRR15972519", "SRR15972520", "SRR15972521"], "data/example_3")

gene_count, transcript_count = xalign.align_folder("homo_sapiens", "data/example_3", t=8, overwrite=False)

Mapping transcript counts to gene-level counts

When FASTQ files are aligned individually using xalign.align_fastq() the output is in transcript-level. To aggregate counts to gene-level the function xalign.ensembl.agg_gene_counts() can be used.

import xalign

xalign.sra.load_sras(["SRR14457464"], "data/example_4")

result = xalign.align_fastq("homo_sapiens", "data/example_4/SRR15972519.fastq", t=8)

# identifier can be symbol/ensembl_id/entrezgene_id
gene_counts = xalign.ensembl.agg_gene_counts(result, "homo_sapiens", identifier="symbol")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xalign-0.1.75.tar.gz (6.1 MB view details)

Uploaded Source

File details

Details for the file xalign-0.1.75.tar.gz.

File metadata

  • Download URL: xalign-0.1.75.tar.gz
  • Upload date:
  • Size: 6.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for xalign-0.1.75.tar.gz
Algorithm Hash digest
SHA256 bf97923e61c24fd31d57d230fc82aab64f07e7bdec7f5ee0fc92f45eaea07128
MD5 910e7ea38398a621e3ddfa711ffaf8f6
BLAKE2b-256 41986e3e282b2af675116a2800e8e9cab63999a1374a01b1608b12b0092ef365

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page