Skip to main content

Python wrapper around kallisto | bustools for scRNA-seq analysis

Project description

kb-python

github version pypi version python versions status codecov pypi downloads docs license

kb-python is a python package for processing single-cell RNA-sequencing. It wraps the kallisto | bustools single-cell RNA-seq command line tools in order to unify multiple processing workflows.

kb-python was developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi while in Lior Pachter's lab at Caltech. If you use kb-python in a publication please cite*:

Melsted, P., Booeshaghi, A.S., et al. 
Modular, efficient and constant-memory single-cell RNA-seq preprocessing. 
Nat Biotechnol  39, 813–818 (2021). 
https://doi.org/10.1038/s41587-021-00870-2

Installation

The latest release can be installed with

pip install kb-python

The development version can be installed with

pip install git+https://github.com/pachterlab/kb_python

There are no prerequisite packages to install. The kallisto and bustools binaries are included with the package.

Usage

kb consists of four subcommands

$ kb
usage: kb [-h] [--list] <CMD> ...
positional arguments:
  <CMD>
    info      Display package and citation information
    compile   Compile `kallisto` and `bustools` binaries from source
    ref       Build a kallisto index and transcript-to-gene mapping
    count     Generate count matrices from a set of single-cell FASTQ files

kb ref: generate a pseudoalignment index

The kb ref command takes in a species annotation file (GTF) and associated genome (FASTA) and builds a species-specific index for pseudoalignment of reads. This must be run before kb count. Internally, kb ref extracts the coding regions from the GTF and builds a transcriptome FASTA that is then indexed with kallisto index.

kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME> <GENOME_ANNOTATION>
  • <GENOME> refers to a genome file (FASTA).
    • For example, the zebrafish genome is hosted by ensembl and can be downloaded here
  • <GENOME_ANNOTATION> refers to a genome annotation file (GTF)
    • For example, the zebrafish genome annotation file is hosted by ensembl and can be downloaded here
  • Note: The latest genome annotation and genome file for every species on ensembl can be found with the gget command-line tool.

Prebuilt indices are available at https://github.com/pachterlab/kallisto-transcriptome-indices

Examples

# Index the transcriptome from genome FASTA (genome.fa.gz) and GTF (annotation.gtf.gz)
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa genome.fa.gz annotation.gtf.gz
# An example for downloading a prebuilt reference for mouse
$ kb ref -d mouse -i index.idx -g t2g.txt

kb count: pseudoalign and count reads

The kb count command takes in the pseudoalignment index (built with kb ref) and sequencing reads generated by a sequencing machine to generate a count matrix. Internally, kb count runs numerous kallisto and bustools commands comprising a single-cell workflow for the specified technology that generated the sequencing reads.

kb  count -i index.idx -g t2g.txt -o out/ -x <TECHNOLOGY> <FASTQ FILE[s]>
  • <TECHNOLOGY> refers to the assay that generated the sequencing reads.
    • For a list of supported assays run kb --list
  • <FASTQ FILE[s]> refers to the a list of FASTQ files generated
    • Different assays will have a different number of FASTQ files
    • Different assays will place the different features in different FASTQ files
      • For example, sequencing a 10xv3 library on a NextSeq Illumina sequencer usually results in two FASTQ files.
      • The R1.fastq.gz file (colloquially called "read 1") contains a 16 basepair cell barcode and a 12 basepair unique molecular identifier (UMI).
      • The R2.fastq.gz file (colloquially called "read 2") contains the cDNA associated with the cell barcode-UMI pair in read 1.

Examples

# Quantify 10xv3 reads read1.fastq.gz and read2.fastq.gz
$ kb count -i index.idx -g t2g.txt -o out/ -x 10xv3 read1.fastq.gz read2.fastq.gz

kb info: display package and citation information

The kb info command prints out package information including the version of kb-python, kallisto, and bustools along with their installation location.

$ kb info
kb_python 0.28.0 ...
kallisto: 0.50.1 ...
bustools: 0.43.1 ...
...

kb compile: compile kallisto and bustools binaries from source

The kb compile command grabs the latest kallisto and bustools source and compiles the binaries. Note: this is not required to run kb-python.

Use cases

kb-python facilitates fast and uniform pre-processing of single-cell sequencing data to answer relevant research questions.

$ pip install kb-python gget ffq

# Goal: quantify publicly available scRNAseq data
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
# -> count matrix in out/ folder

# Goal: quantify 10xv2 feature barcode data, feature_barcodes.txt is a tab-delimited file
# containing barcode_sequence<tab>barcode_name
$ kb ref -i index.idx -g f2g.txt -f1 features.fa --workflow kite feature_barcodes.txt
$ kb count -i index.idx -g f2b.txt -x 10xv2 -o out/ --workflow kite --h5ad R1.fastq.gz R2.fastq.gz
# -> count matrix in out/ folder

Submitted by @sbooeshaghi.

Do you have a cool use case for kb-python? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.

Tutorials

For a list of tutorials that use kb-python please see https://www.kallistobus.tools/.

Documentation

Developer documentation is hosted on Read the Docs.

Contributing

Thank you for wanting to improve kb-python! If you have believe you've found a bug, please submit an issue.

If you have a new feature you'd like to add to kb-python please create a pull request. Pull requests should contain a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.

Cite

If you use kb-python in a publication, please cite the following papers:

kb-python & bustools

@article{melsted2021modular,
  title={\href{https://doi.org/10.1038/s41587-021-00870-2}{Modular, efficient and constant-memory single-cell RNA-seq preprocessing}},
  author={Melsted, P{\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi Joseph and da Veiga Beltrame, Eduardo and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Gehring, Jase and Pachter, Lior},
  author+an={1=first;2=first,highlight},
  journal={Nature biotechnology},
  year={2021},
  month={4},
  day={1},
  doi={https://doi.org/10.1038/s41587-021-00870-2}
}

kallisto

@article{bray2016near,
  title={Near-optimal probabilistic RNA-seq quantification},
  author={Bray, Nicolas L and Pimentel, Harold and Melsted, P{\'a}ll and Pachter, Lior},
  journal={Nature biotechnology},
  volume={34},
  number={5},
  pages={525--527},
  year={2016},
  publisher={Nature Publishing Group}
}

BUS format

@article{melsted2019barcode,
  title={The barcode, UMI, set format and BUStools},
  author={Melsted, P{\'a}ll and Ntranos, Vasilis and Pachter, Lior},
  journal={Bioinformatics},
  volume={35},
  number={21},
  pages={4472--4473},
  year={2019},
  publisher={Oxford University Press}
}

kb-python was inspired by Sten Linnarsson’s loompy fromfq command (http://linnarssonlab.org/loompy/kallisto/index.html)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kb_python-0.28.0.tar.gz (13.0 MB view hashes)

Uploaded Source

Built Distribution

kb_python-0.28.0-py3-none-any.whl (13.1 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page