A fast xenograft read sorter based on space-efficient k-mer hashing

These details have not been verified by PyPI

Project description

xengsort: Fast lightweight accurate xenograft sorting

This tool, xengsort, uses 3-way bucketed Cuckoo hashing to efficiently solve the xenograft sorting problem.

For a description of the method an evaluation on several datasets, please see our article in "Algorithms for Molecular Biology", or the initial WABI 2020 publication. (There was also a preprint on bioRxiv.) BibTeX entrys for citation can be found at the end of this README file.

In case of problems,

please read the tutorial and/or usage guide in this file carefully,
check the troubleshooting section below if your problem is solved there,
and file an issue in the issue tracker if this does not help or you have discovered an error in the documentation.

See CHANGELOG.md for recent changes. Thank you!

Tutorial: Classification of human-captured mouse exomes

We here provide an example showing how to run xengsort on a human-captured mouse exome (part of one of the datasets described in the paper). This tutorial uses the workflow management system Snakemake to run the example. This is, however, not necessary to use xengsort. Please refer to the Usage Guide below to learn how to run xengsort on your own data manually.

Our software xengsort is provided as a Python package. For efficiency, it uses just-in-time compilation provided by the numba package. We here use the package manager conda to manage Python packages and to use a separate environment for xengsort.

Install the conda package manager (miniconda)

Go to https://docs.conda.io/en/latest/miniconda.html and download the Miniconda installer: Choose Python 3.10 (or higher), your operating system, and preferably the 64 bit version. Follow the instructions of the installer and append the conda executable to your PATH (even if the installer does not recommend it). You can let the installer do it, or do it manually by editing your .bashrc or similar file under Linux or MacOS, or by editing the environment varialbes under Windows. To verify that the installation works, open a new terminal and execute

conda --version # ideally 22.9.xx or higher
python --version # ideally 3.10.xx or higher

Obtain or update xengsort

Our software is currently obtained by cloning this public git repository:

git clone https://gitlab.com/genomeinformatics/xengsort.git

We may release a bioconda package later.

If you need to update xengsort later, you can do so by just executing

git pull

within the cloned directory tree.

Create and activate a conda environment

To run our software, a conda environment with the required libraries needs to be created. A list of needed libraries is provided in the environment.yml file in the cloned repository; it can be used to create a new environment:

cd xengsort  # the directory of the cloned repository
conda env create

which will create an environment named xengsort with the required dependencies, using the provided environment.yml file in the same directory.

A more explicit alternative is to run

conda create --name xengsort -c conda-forge -c bioconda --file requirements.txt

which will do the same thing using the packages mentioned in requirements.txt.

Note: While xengsort works on Windows/Mac without problems, some of the required packages may currently not exist for Windows/Mac.

Setting up the environment may take some time, as conda searches and downloads packages. After all dependencies are resolved, you activate the environment and install the package from the repository into this environment. Make sure that you are in the root directory of the cloned repository (where this README.md file or the CHANGELOG.md file is) and run

conda activate xengsort  # activate environment
pip install -e .  # install xengsort package using pip

Run the Snakemake example workflow

You can in principle remove snakemake-minimal and sra-toolkit from the list of packages. However, you will not be able to automatically run the provided example workflow or download sequence datasets from SRA.

We provide an example Snakemake workflow (for publicly available mouse exomes), which downloads all needed reference FASTA files and exome FASTQ files, generates a hash table as an index and classifies the reads.

To run this workflow,

you will need space for the downloaded datasets and result files, so snakemake should be executed in a separate working directory with lots of space for the datset.
you will additionally need the snakemake-minimal package. Make sure you are in the working directory, that your conda environment called xengsort is active, and then additionally install Snakemake:
```
conda install -c bioconda  snakemake-minimal
```
ensure that xengsort's Snakefile is present in the working directory. It can be symbolically linked, such as:
```
cd /path/to/workdir
ln -s /path/to/xengsort/Snakefile Snakefile
```

Now Snakemake can be run as follows:

snakemake -n  # dry-run: What will happen?
snakemake -j 16  --use-conda  # execute with 16 threads

The -n (or --dry-run) option first performs a dry run and prints out what will be done. In the real invocation, the -j 16 option uses up to 16 CPU cores to execute 16 separate jobs in parallel or one job that uses 16 threads, or any combination.

You can examine the (commented) Snakefile to see what happens in each step:

Reference sequences (human and mouse genome and transcriptome) will be downloaded from Ensembl in FASTA format.
The k-mer index (for k=25) will be built from the FASTA files (xengsort index).
An example dataset (a human-captured mouse exome) will be downloaded from SRA using sra-toolkit and saved in FASTQ format (paired-end reads).
The paired-end reads in the dataset will be classified according to species (xengsort classify)

This will run for quite some time (especially downloads and index creation), best leave it overnight. The results will appear in the results/ directory.

To run your own human/mouse xenograft analysis, you can continue to use the same index. All reference files, including the index in ZARR format (.zarr) are in the ref/ directory. You can follow the structure of the Snakefile to create your own custom workflow. Additional information on how to use xengsort is given in the usage guide below.

Usage Guide

xengsort is a multi-command tool with several subcommands (like git), in particular

xengsort index builds an index (a bucketed 3-way Cuckoo hash table)
xengsort classify classifies a sequences sample (FASTQ files) using an existing index

It is a good idea to run xengsort index --help to see all available options. Using --help works on any subcommand.

How to build an index

To build an index for xengsort, several parameters must be provided, which are described in the following.

First, a file name for the index must be chosen. The index is saved as a ZARR directory, so the extension should be .zarr. We will use myindex.zarr/ here.

Second, two reference genomes (host and graft) must be provided (in FASTA files). We assume that they are given as host.fa.gz and graft.fa.gz. The corresponding options are -H or --host and -G or --graft. Each option can take several arguments as files. The provided FASTA files must be uncompressed streams, but you can use zcat (or gzcat depending on your OS) as in the following example for compressed files. This will run the decompression in independent processes:

xengsort index  -H <(zcat host.fa.gz) -G <(zcat graft.fa.gz)  -n 4_500_000_000  [OPTIONS]  myindex.zarr

This <(zcat ...) construction creates file-like streams with the uncompressed data that are directly fed into xengsort without storing them explicitly on disk, so using this construction saves considerable disk space. However, it does consume CPU time, because several zcat processes run in addition to xengsort.

You can speed this up even further if you have pigz installed (a parallel implementation of gzip) and are not afraid to use even more CPU power: Using pigz -cd -p4 will run pigz to decompress (-d) to stdout (-c) with 4 threads (-p4). While the decompression is not paralellized per se, file I/O is, and so may save some additional time.

xengsort index  -H <(pigz -cd -p4 host.fa.gz) -G <(pigz -cd -p4 graft.fa.gz)  -n 4_500_000_000  [OPTIONS]  myindex.zarr

We must specify the size of the hash table:

-n or --nobjects: number of k-mers that will be stored in the hash table. This depends on the used refernece genomes and must be estimated beforehand! As a precise estimate of the number of different k-mers can be difficult, you can err on the safe side and provide a generously large estimate, examine the final (low) load factor and then rebuild the index with a smaller -n parameter to aechieve the desired load. There are also some tools that quickly estimate the number of distinct k-mers in large files, such as ntCard or KmerEstimate. As a guide: Human and mouse genome and transcriptome together comprise around 4.5 billion 25-mers, as shown in the examples above. This option must be specified; there is no default!

We may further specify additional properties of the hash table:

-p or --pagesize indicates how many elements can be stored in one bucket (or page). This is 4 by default.
--fill between 0.0 and 1.0 describes the desired fill rate or load factor of the hash table. Together with -n, the number of slots in the table is calculated as ceil(n/fill). In our experiments we used 0.88. (The number of buckets is then the smallest odd integer that is at least ceil(ceil(n/fill)/p).)
--aligned or --unaligned: indicates whether each bucket should consume a number of bits that is a power of 2. Using --aligned ensures that each bucket stays within the same cache line, but may waste space (padding bits), yielding faster speed but possibliy (much!) larger space requirements. With --unaligned, no bits are used for padding and buckets may cross cache line boundaries. This is slightly slower, but may save a little or a lot of space (depending on the bucket size in bits). The default is --unaligned, because the speed decrease is small and the memory savings can be significant.
--hashfunctions defines the parameters for the hash functions used to store the key-value pairs. If the parameter is unspecified, different random functions are chosen each time. The hash functions can be specified using a colon separated list: --hashfunctions linear945:linear9123641:linear349341847. It is recommended to have them chosen randomly unless you need strictly reproducible behavior, in which case the example given here is recommended.

The final important parameter is about paralellization:

-T defines how many threads are used to calculate weak k-mers.

The following hash table options should not be modified from their default values and are for internal development only:

--type 3FCVbb: defines the type of the hash table: how many hash functions are used (3 in this case) and how the key-value pair is stored inside a bucket. At the moment only 3FCVbb is supported.
-P or --parameters can be used to combine all parameters stated above into one string (see the --help text)

How to classify

To classify a FASTQ sample (one or several single-end or paired-end files), make sure you are in an environment where xengsort and its dependencies are installed. Then run the xengsort classify command with a previously built index, such as

xengsort classify --index myindex.zarr  --fastq single.fq  --out myresults

for single-end reads, or

xengsort classify --index myindex.zarr  --fastq paired.1.fq  --pairs paired.2.fq  --out myresults

for paired-end reads.

The parameter --out or equivalently --prefix is required and defines the prefix for all output files; this can be a combination of path and file prefix, such as /path/to/sorted/samplename. Typically, there will be 5 output files for each of the first and the second read pair:

{prefix}.host.1.fq: host reads
{prefix}.graft.1.fq.: graft reads
{prefix}.both.1.fq: reads that could originate from both
{prefix}.neither.1.fq: reads that originate from neither host nor graft
{prefix}.ambiguous.1.fq: (few) ambiguous reads that cannot be classified,

and similarly with .2.fq. For single-end reads, there is only .fq (no numbers).

Further parameters and options are:

-T defines how many threads are used for classification (4 to 8 is recommended).
--filter: With this flag, only the graft reads are output.
--count: With this flag, only the number of reads in each category is counted, but the reads are not sorted.
--quick: With this flag, the quick mode is used (see our article)

Please report a GitLab issue in case of problems.

Happy sorting!

Troubleshooting

Snakemake throws errors when attempting to download the example dataset from SRA.

This may happen for several reasons:

The dataset is quite large, over 20 GB. Make sure you have sufficient space (also temporary space) and sufficient disk quota (on a shared system) to store all of the files.
Another source of trouble may be that for some reason and older version of sra-tools gets called. We use the fasterq-dump tool to download the data, which did not exist in older verisons. To use a clean environment with a recent version of sra-tools, you can run snakemake as snakemake -j 16 --use-conda, which will generate a separate environment only for the download (this takes additional time, but may resolve the problem).
In earlier versions, the dataset was downloaded from a different location which is no longer available. Please check that you are using at least v1.1.0 (xengsort --version). If not, please do a fresh clone and follow the tutorial to set up a new environment (delete the old one first) and run pip install -e ., as explained above in the tutorial.

Indexing stops after a few seconds with a failure.

The most likely cause of this is that you have not specified the size of the hash table (-n parameter). (From version 1.0.1, the -n parameter is required, and the error cannot occur anymore. Please update to the latest version!) For the human and mouse genomes this is approximately -n 4_500_000_000 (4.5 billion). It is unfortunately a limitation of the implementation that the hash table size has to be specified in advance. If uncertain, use a generous overestimate (e.g., add the sizes of the two genomes) and look in the output for the line that starts with choice nonzero. This is the exact number of k-mers stored. You can use this value to re-index everything in a second iteration.

The xengsort classify step throws an error about NoneType vs. str.

This can happen if you do not specify the --out or --prefix parameter. (From version 1.0.2, this parameter is required, and the error cannot occur anymore. Please update to the latest version!)

Citation

If you use xengsort, please cite the article in "Algorithms for Molecular Biology". The BibTeX entry is provided here:

@Article{pmid33810805,
   Author="Jens Zentgraf and Sven Rahmann",
   Title="Fast lightweight accurate xenograft sorting",
   Journal="Algorithms Mol Biol",
   Year="2021",
   Volume="16",
   Number="1",
   Pages="2",
   Month="Apr"
}

In addition, you may also cite the WABI 2020 proceedigs paper:

@InProceedings{ZentgrafRahmann2020xengsort,
  author =	{Jens Zentgraf and Sven Rahmann},
  title =	{Fast Lightweight Accurate Xenograft Sorting},
  booktitle =	{20th International Workshop on Algorithms in Bioinformatics (WABI 2020)},
  pages =	{4:1--4:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-161-0},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{172},
  editor =	{Carl Kingsford and Nadia Pisanti},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2020/12793},
  URN =		{urn:nbn:de:0030-drops-127933},
  doi =		{10.4230/LIPIcs.WABI.2020.4},
  annote =	{Keywords: xenograft sorting, alignment-free method, Cuckoo hashing, k-mer}
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.8

Sep 10, 2024

2.0.7

Sep 9, 2024

2.0.5

Jan 4, 2024

2.0.4

Dec 19, 2023

2.0.3

Dec 8, 2023

2.0.2

Nov 17, 2023

2.0.1

Sep 18, 2023

2.0.0

Aug 31, 2023

1.5.0.3

Aug 17, 2023

1.5.0.2

Aug 17, 2023

This version

1.5.0.1

Aug 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xengsort-1.5.0.1.tar.gz (47.3 kB view details)

Uploaded Aug 17, 2023 Source

Built Distribution

xengsort-1.5.0.1-py3-none-any.whl (41.5 kB view details)

Uploaded Aug 17, 2023 Python 3

File details

Details for the file xengsort-1.5.0.1.tar.gz.

File metadata

Download URL: xengsort-1.5.0.1.tar.gz
Upload date: Aug 17, 2023
Size: 47.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for xengsort-1.5.0.1.tar.gz
Algorithm	Hash digest
SHA256	`9866a88e080a53ab6645d084fe80e9cdd90263861ca3a47d440f13c4bfdfc587`
MD5	`8f35a4a4e7dd6876e6ef1b6891baf86b`
BLAKE2b-256	`ed7a9662af0d56bc68042a9594608c295393ec5dfd13d3e3d0b07860ca382951`

See more details on using hashes here.

File details

Details for the file xengsort-1.5.0.1-py3-none-any.whl.

File metadata

Download URL: xengsort-1.5.0.1-py3-none-any.whl
Upload date: Aug 17, 2023
Size: 41.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for xengsort-1.5.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`759ce65fba6a395d7ae63cc1762c9df6cd506d69c5cad45ce7ed31847e49d5aa`
MD5	`cd3a48183efb6d91b017cc36a7472355`
BLAKE2b-256	`f87ccb1531c8eb61668660c9f427d9dc8cc4bc29a0bddaabcd17c6826326c065`