Skip to main content

Python wrapper around kallisto | bustools for scRNA-seq analysis

Project description

kb-python

github version pypi version python versions status codecov pypi downloads docs license

kb-python is a python package for processing single-cell RNA-sequencing. It wraps the kallisto | bustools single-cell RNA-seq command line tools in order to unify multiple processing workflows.

kb-python was first developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi while in Lior Pachter's lab at Caltech. If you use kb-python in a publication please cite*:

Melsted, P., Booeshaghi, A.S., et al. 
Modular, efficient and constant-memory single-cell RNA-seq preprocessing. 
Nat Biotechnol  39, 813–818 (2021). 
https://doi.org/10.1038/s41587-021-00870-2

Installation

The latest release can be installed with

pip install kb-python

The development version can be installed with

pip install git+https://github.com/pachterlab/kb_python

There are no prerequisite packages to install. The kallisto and bustools binaries are included with the package.

Usage

kb consists of five subcommands

$ kb
usage: kb [-h] [--list] <CMD> ...
positional arguments:
  <CMD>
    info      Display package and citation information
    compile   Compile `kallisto` and `bustools` binaries from source
    ref       Build a kallisto index and transcript-to-gene mapping
    count     Generate count matrices from a set of single-cell FASTQ files
    extract   Extract reads that were pseudoaligned to specific genes/transcripts (or extract all reads that were / were not pseudoaligned)

kb ref: generate a pseudoalignment index

The kb ref command takes in a species annotation file (GTF) and associated genome (FASTA) and builds a species-specific index for pseudoalignment of reads. This must be run before kb count. Internally, kb ref extracts the coding regions from the GTF and builds a transcriptome FASTA that is then indexed with kallisto index.

kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME> <GENOME_ANNOTATION>
  • <GENOME> refers to a genome file (FASTA).
    • For example, the zebrafish genome is hosted by ensembl and can be downloaded here
  • <GENOME_ANNOTATION> refers to a genome annotation file (GTF)
    • For example, the zebrafish genome annotation file is hosted by ensembl and can be downloaded here
  • Note: The latest genome annotation and genome file for every species on ensembl can be found with the gget command-line tool.

Prebuilt indices are available at https://github.com/pachterlab/kallisto-transcriptome-indices

Examples

# Index the transcriptome from genome FASTA (genome.fa.gz) and GTF (annotation.gtf.gz)
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa genome.fa.gz annotation.gtf.gz
# An example for downloading a prebuilt reference for mouse
$ kb ref -d mouse -i index.idx -g t2g.txt

kb count: pseudoalign and count reads

The kb count command takes in the pseudoalignment index (built with kb ref) and sequencing reads generated by a sequencing machine to generate a count matrix. Internally, kb count runs numerous kallisto and bustools commands comprising a single-cell workflow for the specified technology that generated the sequencing reads.

kb  count -i index.idx -g t2g.txt -o out/ -x <TECHNOLOGY> <FASTQ FILE[s]>
  • <TECHNOLOGY> refers to the assay that generated the sequencing reads.
    • For a list of supported assays run kb --list
  • <FASTQ FILE[s]> refers to the a list of FASTQ files generated
    • Different assays will have a different number of FASTQ files
    • Different assays will place the different features in different FASTQ files
      • For example, sequencing a 10xv3 library on a NextSeq Illumina sequencer usually results in two FASTQ files.
      • The R1.fastq.gz file (colloquially called "read 1") contains a 16 basepair cell barcode and a 12 basepair unique molecular identifier (UMI).
      • The R2.fastq.gz file (colloquially called "read 2") contains the cDNA associated with the cell barcode-UMI pair in read 1.

Examples

# Quantify 10xv3 reads read1.fastq.gz and read2.fastq.gz
$ kb count -i index.idx -g t2g.txt -o out/ -x 10xv3 read1.fastq.gz read2.fastq.gz

kb info: display package and citation information

The kb info command prints out package information including the version of kb-python, kallisto, and bustools along with their installation location.

$ kb info
kb_python 0.29.1 ...
kallisto: 0.51.1 ...
bustools: 0.44.1 ...
...

kb compile: compile kallisto and bustools binaries from source

The kb compile command grabs the latest kallisto and bustools source and compiles the binaries. Note: this is not required to run kb-python.

Use cases

kb-python facilitates fast and uniform pre-processing of single-cell sequencing data to answer relevant research questions.

$ pip install kb-python gget ffq

# Goal: quantify publicly available scRNAseq data
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
# -> count matrix in out/ folder

# Goal: quantify 10xv2 feature barcode data, feature_barcodes.txt is a tab-delimited file
# containing barcode_sequence<tab>barcode_name
$ kb ref -i index.idx -g f2g.txt -f1 features.fa --workflow kite feature_barcodes.txt
$ kb count -i index.idx -g f2b.txt -x 10xv2 -o out/ --workflow kite --h5ad R1.fastq.gz R2.fastq.gz
# -> count matrix in out/ folder

Submitted by @sbooeshaghi.

Do you have a cool use case for kb-python? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.

Tutorials

For a list of tutorials that use kb-python please see https://www.kallistobus.tools/.

Documentation

Developer documentation is hosted on Read the Docs.

Contributing

Thank you for wanting to improve kb-python! If you have believe you've found a bug, please submit an issue.

If you have a new feature you'd like to add to kb-python please create a pull request. Pull requests should contain a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.

Cite

If you use kb-python in a publication, please cite the following papers:

kb-python & kallisto and/or bustools

@article{sullivan2023kallisto,
  title={kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq},
  author={Sullivan, Delaney K and Min, Kyung Hoi and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Luebbert, Laura and Holley, Guillaume and Moses, Lambda and Gustafsson, Johan and Bray, Nicolas L and Pimentel, Harold and Booeshaghi, A Sina and others},
  journal={bioRxiv},
  pages={2023--11},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

bustools

@article{melsted2021modular,
  title={\href{https://doi.org/10.1038/s41587-021-00870-2}{Modular, efficient and constant-memory single-cell RNA-seq preprocessing}},
  author={Melsted, P{\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi Joseph and da Veiga Beltrame, Eduardo and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Gehring, Jase and Pachter, Lior},
  author+an={1=first;2=first,highlight},
  journal={Nature biotechnology},
  year={2021},
  month={4},
  day={1},
  doi={https://doi.org/10.1038/s41587-021-00870-2}
}

kallisto

@article{bray2016near,
  title={Near-optimal probabilistic RNA-seq quantification},
  author={Bray, Nicolas L and Pimentel, Harold and Melsted, P{\'a}ll and Pachter, Lior},
  journal={Nature biotechnology},
  volume={34},
  number={5},
  pages={525--527},
  year={2016},
  publisher={Nature Publishing Group}
}

kITE

@article{booeshaghi2024quantifying,
  title={Quantifying orthogonal barcodes for sequence census assays},
  author={Booeshaghi, A Sina and Min, Kyung Hoi and Gehring, Jase and Pachter, Lior},
  journal={Bioinformatics Advances},
  volume={4},
  number={1},
  pages={vbad181},
  year={2024},
  publisher={Oxford University Press}
}

BUS format

@article{melsted2019barcode,
  title={The barcode, UMI, set format and BUStools},
  author={Melsted, P{\'a}ll and Ntranos, Vasilis and Pachter, Lior},
  journal={Bioinformatics},
  volume={35},
  number={21},
  pages={4472--4473},
  year={2019},
  publisher={Oxford University Press}
}

kb-python was inspired by Sten Linnarsson’s loompy fromfq command (http://linnarssonlab.org/loompy/kallisto/index.html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kb_python-0.29.1.tar.gz (36.3 MB view details)

Uploaded Source

Built Distribution

kb_python-0.29.1-py3-none-any.whl (36.5 MB view details)

Uploaded Python 3

File details

Details for the file kb_python-0.29.1.tar.gz.

File metadata

  • Download URL: kb_python-0.29.1.tar.gz
  • Upload date:
  • Size: 36.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for kb_python-0.29.1.tar.gz
Algorithm Hash digest
SHA256 d811dcf06327fc6e643779f75497be33ecfcaaf402304bd71085e5a7aea8afa8
MD5 e2f4478ed78ebfdd85a887dc7277908f
BLAKE2b-256 cb5b6d158ec9595aaaae1ae043d8343ba95cf49399941e3aa4c69310f1a79051

See more details on using hashes here.

File details

Details for the file kb_python-0.29.1-py3-none-any.whl.

File metadata

  • Download URL: kb_python-0.29.1-py3-none-any.whl
  • Upload date:
  • Size: 36.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for kb_python-0.29.1-py3-none-any.whl
Algorithm Hash digest
SHA256 59f3955c591dff5d3fd3423c6eabb2ea1926435ec164652d890a486ac017d953
MD5 8a2f5df58aceb08c3828da3fc363ad7b
BLAKE2b-256 dd48a435abcc1926fefd87d19dce7cf1a1f447f37aeb412b5cddd6a6fdf8d30a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page