Python wrapper around kallisto | bustools for scRNA-seq analysis
Project description
kb-python
kb-python
is a python package for processing single-cell RNA-sequencing. It wraps the kallisto
| bustools
single-cell RNA-seq command line tools in order to unify multiple processing workflows.
kb-python
was first developed by Kyung Hoi (Joseph) Min and A. Sina Booeshaghi while in Lior Pachter's lab at Caltech. If you use kb-python
in a publication please cite*:
Melsted, P., Booeshaghi, A.S., et al.
Modular, efficient and constant-memory single-cell RNA-seq preprocessing.
Nat Biotechnol 39, 813–818 (2021).
https://doi.org/10.1038/s41587-021-00870-2
Installation
The latest release can be installed with
pip install kb-python
The development version can be installed with
pip install git+https://github.com/pachterlab/kb_python
There are no prerequisite packages to install. The kallisto
and bustools
binaries are included with the package.
Usage
kb
consists of five subcommands
$ kb
usage: kb [-h] [--list] <CMD> ...
positional arguments:
<CMD>
info Display package and citation information
compile Compile `kallisto` and `bustools` binaries from source
ref Build a kallisto index and transcript-to-gene mapping
count Generate count matrices from a set of single-cell FASTQ files
extract Extract reads that were pseudoaligned to specific genes/transcripts (or extract all reads that were / were not pseudoaligned)
kb ref
: generate a pseudoalignment index
The kb ref
command takes in a species annotation file (GTF) and associated genome (FASTA) and builds a species-specific index for pseudoalignment of reads. This must be run before kb count
. Internally, kb ref
extracts the coding regions from the GTF and builds a transcriptome FASTA that is then indexed with kallisto index
.
kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa <GENOME> <GENOME_ANNOTATION>
<GENOME>
refers to a genome file (FASTA).<GENOME_ANNOTATION>
refers to a genome annotation file (GTF)- Note: The latest genome annotation and genome file for every species on ensembl can be found with the
gget
command-line tool.
Prebuilt indices are available at https://github.com/pachterlab/kallisto-transcriptome-indices
Examples
# Index the transcriptome from genome FASTA (genome.fa.gz) and GTF (annotation.gtf.gz)
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa genome.fa.gz annotation.gtf.gz
# An example for downloading a prebuilt reference for mouse
$ kb ref -d mouse -i index.idx -g t2g.txt
kb count
: pseudoalign and count reads
The kb count
command takes in the pseudoalignment index (built with kb ref
) and sequencing reads generated by a sequencing machine to generate a count matrix. Internally, kb count
runs numerous kallisto
and bustools
commands comprising a single-cell workflow for the specified technology that generated the sequencing reads.
kb count -i index.idx -g t2g.txt -o out/ -x <TECHNOLOGY> <FASTQ FILE[s]>
<TECHNOLOGY>
refers to the assay that generated the sequencing reads.- For a list of supported assays run
kb --list
- For a list of supported assays run
<FASTQ FILE[s]>
refers to the a list of FASTQ files generated- Different assays will have a different number of FASTQ files
- Different assays will place the different features in different FASTQ files
- For example, sequencing a 10xv3 library on a NextSeq Illumina sequencer usually results in two FASTQ files.
- The
R1.fastq.gz
file (colloquially called "read 1") contains a 16 basepair cell barcode and a 12 basepair unique molecular identifier (UMI). - The
R2.fastq.gz
file (colloquially called "read 2") contains the cDNA associated with the cell barcode-UMI pair in read 1.
Examples
# Quantify 10xv3 reads read1.fastq.gz and read2.fastq.gz
$ kb count -i index.idx -g t2g.txt -o out/ -x 10xv3 read1.fastq.gz read2.fastq.gz
kb info
: display package and citation information
The kb info
command prints out package information including the version of kb-python
, kallisto
, and bustools
along with their installation location.
$ kb info
kb_python 0.29.1 ...
kallisto: 0.51.1 ...
bustools: 0.44.1 ...
...
kb compile
: compile kallisto
and bustools
binaries from source
The kb compile
command grabs the latest kallisto
and bustools
source and compiles the binaries. Note: this is not required to run kb-python
.
Use cases
kb-python
facilitates fast and uniform pre-processing of single-cell sequencing data to answer relevant research questions.
$ pip install kb-python gget ffq
# Goal: quantify publicly available scRNAseq data
$ kb ref -i index.idx -g t2g.txt -f1 transcriptome.fa $(gget ref --ftp -w dna,gtf homo_sapiens)
$ kb count -i index.idx -g t2g.txt -x 10xv3 -o out $(ffq --ftp SRR10668798 | jq -r '.[] | .url' | tr '\n' ' ')
# -> count matrix in out/ folder
# Goal: quantify 10xv2 feature barcode data, feature_barcodes.txt is a tab-delimited file
# containing barcode_sequence<tab>barcode_name
$ kb ref -i index.idx -g f2g.txt -f1 features.fa --workflow kite feature_barcodes.txt
$ kb count -i index.idx -g f2b.txt -x 10xv2 -o out/ --workflow kite --h5ad R1.fastq.gz R2.fastq.gz
# -> count matrix in out/ folder
Submitted by @sbooeshaghi.
Do you have a cool use case for kb-python
? Submit a PR (including the goal, code snippet, and your username) so that we can feature it here.
Tutorials
For a list of tutorials that use kb-python
please see https://www.kallistobus.tools/.
Documentation
Developer documentation is hosted on Read the Docs.
Contributing
Thank you for wanting to improve kb-python
! If you have believe you've found a bug, please submit an issue.
If you have a new feature you'd like to add to kb-python
please create a pull request. Pull requests should contain a message detailing the exact changes made, the reasons for the change, and tests that check for the correctness of those changes.
Cite
If you use kb-python
in a publication, please cite the following papers:
kb-python
& kallisto
and/or bustools
@article{sullivan2023kallisto,
title={kallisto, bustools, and kb-python for quantifying bulk, single-cell, and single-nucleus RNA-seq},
author={Sullivan, Delaney K and Min, Kyung Hoi and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Luebbert, Laura and Holley, Guillaume and Moses, Lambda and Gustafsson, Johan and Bray, Nicolas L and Pimentel, Harold and Booeshaghi, A Sina and others},
journal={bioRxiv},
pages={2023--11},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
bustools
@article{melsted2021modular,
title={\href{https://doi.org/10.1038/s41587-021-00870-2}{Modular, efficient and constant-memory single-cell RNA-seq preprocessing}},
author={Melsted, P{\'a}ll and Booeshaghi, A. Sina and Liu, Lauren and Gao, Fan and Lu, Lambda and Min, Kyung Hoi Joseph and da Veiga Beltrame, Eduardo and Hj{\"o}rleifsson, Kristj{\'a}n Eldj{\'a}rn and Gehring, Jase and Pachter, Lior},
author+an={1=first;2=first,highlight},
journal={Nature biotechnology},
year={2021},
month={4},
day={1},
doi={https://doi.org/10.1038/s41587-021-00870-2}
}
kallisto
@article{bray2016near,
title={Near-optimal probabilistic RNA-seq quantification},
author={Bray, Nicolas L and Pimentel, Harold and Melsted, P{\'a}ll and Pachter, Lior},
journal={Nature biotechnology},
volume={34},
number={5},
pages={525--527},
year={2016},
publisher={Nature Publishing Group}
}
kITE
@article{booeshaghi2024quantifying,
title={Quantifying orthogonal barcodes for sequence census assays},
author={Booeshaghi, A Sina and Min, Kyung Hoi and Gehring, Jase and Pachter, Lior},
journal={Bioinformatics Advances},
volume={4},
number={1},
pages={vbad181},
year={2024},
publisher={Oxford University Press}
}
BUS
format
@article{melsted2019barcode,
title={The barcode, UMI, set format and BUStools},
author={Melsted, P{\'a}ll and Ntranos, Vasilis and Pachter, Lior},
journal={Bioinformatics},
volume={35},
number={21},
pages={4472--4473},
year={2019},
publisher={Oxford University Press}
}
kb-python
was inspired by Sten Linnarsson’s loompy fromfq
command (http://linnarssonlab.org/loompy/kallisto/index.html)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file kb_python-0.29.1.tar.gz
.
File metadata
- Download URL: kb_python-0.29.1.tar.gz
- Upload date:
- Size: 36.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d811dcf06327fc6e643779f75497be33ecfcaaf402304bd71085e5a7aea8afa8 |
|
MD5 | e2f4478ed78ebfdd85a887dc7277908f |
|
BLAKE2b-256 | cb5b6d158ec9595aaaae1ae043d8343ba95cf49399941e3aa4c69310f1a79051 |
File details
Details for the file kb_python-0.29.1-py3-none-any.whl
.
File metadata
- Download URL: kb_python-0.29.1-py3-none-any.whl
- Upload date:
- Size: 36.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59f3955c591dff5d3fd3423c6eabb2ea1926435ec164652d890a486ac017d953 |
|
MD5 | 8a2f5df58aceb08c3828da3fc363ad7b |
|
BLAKE2b-256 | dd48a435abcc1926fefd87d19dce7cf1a1f447f37aeb412b5cddd6a6fdf8d30a |