Skip to main content

A python implementation of tximport to transform transcript into gene counts

Project description

pytximport

Version License GitHub Actions Workflow Status Documentation Status Codecov Install with bioconda PyPI - Downloads Python Version Required Code Style: black pre-commit

pytximport is a Python package for efficient gene count estimation based on transcript quantification files produced by pseudoalignment/quasi-mapping tools such as kallisto or salmon. pytximport is a port of the popular tximport Bioconductor R package.

Installation

pip install pytximport

Quick Start

You can either import the tximport function in your Python files:

from pytximport import tximport
results = tximport(
    file_paths,
    "salmon",
    transcript_gene_mapping,
)

Or use it from the command line:

pytximport -i ./sample_1.sf -i ./sample_2.sf -t salmon -m ./tx2gene_map.tsv -o ./output_counts.csv

Common options are:

  • -i: The input files.
  • -t: The input type, e.g., salmon, kallisto or tsv.
  • -m: The map to match transcript ids to their gene ids. Expected column names are transcript_id and gene_id.
  • -o: The output path.
  • -c: The count transform to apply. Leave out for none, other options include scaled_tpm, length_scaled_tpm and dtu_scaled_tpm.
  • -gl: Whether the input is already gene-level counts. Provide this flag when importing gene counts from RSEM.
  • -tx: Whether to return transcript-level counts without gene summarization.
  • -id: The column name containing the transcript ids, in case it differs from the typical naming standards for the configured input file type.
  • -counts: The column name containing the transcript counts, in case it differs from the typical naming standards for the configured input file type.
  • -length: The column name containing the transcript lenghts, in case it differs from the typical naming standards for the configured input file type.
  • -tpm: The column name containing the transcript abundance, in case it differs from the typical naming standards for the configured input file type.
  • --help: Display all configuration options.

Documentation

Detailled documentation is made available at: https://pytximport.readthedocs.io.

Development status

pytximport is still in development and has not yet reached version 1.0.0 in the SemVer versioning scheme. While it should work for most use cases and we regularly compare outputs against the R implementation, expect breaking changes. If you encounter any problems, please open a GitHub issue. If you are a Python developer, we welcome pull requests implementing missing features, adding more extensive unit tests and bug fixes.

Motivation

The tximport package has become a main stay in the bulk RNA sequencing community and has been used in hundreds of scientific publications. However, its accessibility has remained limited since it requires the R programming language and cannot be used from within Python scripts or the command line. Other tools of the bulk RNA sequencing analysis stack, like DESeq2 (in the form of PyDESeq2), decoupler, liana and others all have Python versions. Additionally, pseudoalignment tools like salmon and kallisto can be installed via conda and can be used from the command line. tximport thus constitutes the missing link in many common analysis workflows. pytximport fills this gap and allows these workflows to be entirely done in Python, which is preinstalled on most development machines, and from the command line.

Citation

Please cite both the original publication as well as this Python implementation:

  • Charlotte Soneson, Michael I. Love, Mark D. Robinson. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences, F1000Research, 4:1521, December 2015. doi: 10.12688/f1000research.7563.1
  • Kuehl, M., & Puelles, V. (2024). pytximport: Gene count estimation from transcript quantification files in Python (Version 0.7.0) [Computer software]. https://github.com/complextissue/pytximport

License

The software is provided under the GNU General Public License version 3. Please consult LICENSE for further information.

Differences

Generally, outputs from pytximport correspond to the outputs from tximport within the accuracy allowed by multiple floating point operations and small implementation differences in its dependencies when using the same configuration. If you observe larger discrepancies, please open an issue.

While the outputs are roughly identical for the same configuration, there remain some differences between the packages:

  • pytximport can be used from the command line.
  • pytximport supports AnnData format outputs (set output_type to anndata), enabling seamless integration with the scverse.
  • Argument order and argument defaults may differ between the implementations.
  • Additional features:
    • When ignore_transcript_version is set, the transcript version will not only be scrapped from the quantization file but also from the provided transcript to gene mapping.
    • When biotype_filter is set, all transcripts that do not contain any of the provided biotypes will be removed prior to all other steps.
    • When save_path is configured, a count matrix will be saved as a .csv file.

Building the documentation locally

The documentation can be build locally by navigating to the docs folder and running: make html. This requires that the development requirements of the package as well as the package itself have been installed in the same virtual environment and that pandoc has been added, e.g. by running brew install pandoc on macOS operating systems.

Data sources

The quantification files used for the unit tests are partly adopted from tximportData which in turn used a subsample of the GEUVADIS data: Lappalainen, T., Sammeth, M., Friedländer, M. R., ‘t Hoen, P. A., Monlong, J., Rivas, M. A., ... & Dermitzakis, E. T. (2013). Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501(7468), 506-511.

Other test and example files, such as those used in the vignette, are based on the following work: Braun, F., Abed, A., Sellung, D., Rogg, M., Woidy, M., Eikrem, O., ... & Huber, T. B. (2023). Accumulation of α-synuclein mediates podocyte injury in Fabry nephropathy. The Journal of clinical investigation, 133(11).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytximport-0.7.0.tar.gz (43.7 kB view details)

Uploaded Source

Built Distribution

pytximport-0.7.0-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file pytximport-0.7.0.tar.gz.

File metadata

  • Download URL: pytximport-0.7.0.tar.gz
  • Upload date:
  • Size: 43.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.32.3

File hashes

Hashes for pytximport-0.7.0.tar.gz
Algorithm Hash digest
SHA256 fa4992d4fe11246941b36eadc886d34d61d6682e5cf2ad266f9f96bb0da9290a
MD5 42ab866f75b57bf659ead5eb0de90b16
BLAKE2b-256 7fd15281da2e242c878bb5007a29ae558d4abbc8f9ad764f1da53105fe0b216f

See more details on using hashes here.

File details

Details for the file pytximport-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: pytximport-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.32.3

File hashes

Hashes for pytximport-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2e1bb1c6e96b49e3fdac1cb4e7d88faebc1f92951cb3d93521537eb92bcddbaf
MD5 6395a0d1e7acae5a4f297aff738bbf39
BLAKE2b-256 897abf25427348599a109bcaad4b70d2a3641be51b0c6dc8104e11639f490b53

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page