Skip to main content

alignment free scRNAseq expression estimation

Project description

Arcane: Alignment-free single cell RNA-seq gene expression estimation

Arcane is a lightweight, alignment-free tool for scRNA-seq quantification.

In case of problems file an issue in the issue tracker.

See CHANGELOG.md for recent changes. Thank you!


Usage Guide

arcane is a multi-command tool with several subcommands (like git), in particular

  • arcane index builds an index (a bucketed 3-way Cuckoo hash table)
  • arcane express processes a sample and generates a count matrix

It is a good idea to run arcane express --help to see all available options.
Using --help works on any subcommand.

Installation guide

Our software can be obtained by cloning this public git repository:

https://gitlab.com/rahmannlab/arcane

To run our software, a conda environment with the required libraries needs to be created.
A list of needed libraries is provided in the environment.yml file in the cloned repository;
it can be used to create a new environment:

cd arcane  # the directory of the cloned repository
conda env create

which will create an environment named arcane with the required dependencies, using the provided environment.yml file in the same directory.

After all dependencies are downloaded, you activate the environment and install the package from the repository into this environment.
Make sure that you are in the root directory of the cloned repository (where this README.md file or the CHANGELOG.md file is) and run

conda activate arcane  # activate environment
pip install -e .  # install arcane package using pip

Prebuild index

We provide two indices to process human or mouse data. The index contains all gapped k-mers (k=31,w=43, ####_#_##_###_#_###_###_###_#_###_##_#_####) of all sequences provided by the GTF file filtered by CellRanger. The human and mouse indices can be downloaded here.

How to classify

To compute the count matrix of a sample (R1 and R2), make sure you are in an environment where arcane and its dependencies are installed (see Installation guide).
In addition, the index must either be downloaded (here myindex.filter and myindex.info) or an own custom index was created (see How to build a custom index).
Then run the arcane express command with a previously built index, e.g.,

arcane express --index myindex --R1 $R1-files --R2 $R2-files --out outfolder -c v3

assuming that your sample was generated with 10x Genomics Chromium chemistry v3 (10x v2-v4 is currently supported).

The parameter --out is required and defines the prefix for all output files; this can be a combination of path and file prefix, such as /path/to/sorted/samplename.

Use

arcane express --help

to get a full list of optional parameters.

How to build a custom index

To build an index for arcane, several parameters must be provided, which are described in the following.

First, a file name and a path for the index must be chosen. The index is stored in two files. We will use myindex to store the index in the current folder.

Second, a reference file and an associate annotation file is required. The reference and annotation file should have been filtered by CellRanger. Also we need the k-mer size or the mask with that the index should be build and a name.

arcane filter --fasta REFERENCE --gtf ANNOTATION --mask '####_#_##_###_#_###_###_###_#_###_##_#_####' --name NAME --prefix outfolder

This creates a new fasta file prefix/arcane_NAME_ref.fa.gz.

To build the index you have to run:

arcane index --index myindex --ref REFERENCE --mask '####_#_##_###_#_###_###_###_#_###_##_#_####' -n 2_000_000_000

We must specify the size of the hash table:

  • -n or --nobjects: number of k-mers that will be stored in the hash table. This depends on the used reference genomes and must be estimated beforehand! As a precise estimate of the number of different k-mers can be difficult, you can be on the safe side and provide a generously large estimate, examine the final (low) load factor and then rebuild the index with a smaller -n parameter to achieve the desired load. There are also some tools that quickly estimate the number of distinct k-mers in large files, such as ntCard or KmerEstimate. As a guide: The Human genome consists of roughly 2.5 billion 25-mers. This option must be specified; there is no default!

We may further specify additional properties of the hash table:

  • -b or --bucketsize indicates how many elements can be stored in one bucket (or page). This is 4 by default.

  • --fill between 0.0 and 1.0 describes the desired fill rate or load factor of the hash table. Together with -n, the number of slots in the table is calculated as ceil(n/fill). In our experiments we used 0.88. (The number of buckets is then the smallest odd integer that is at least ceil(ceil(n/fill)/p).)

  • --aligned or --unaligned: indicates whether each bucket should consume a number of bits that is a power of 2. Using --aligned ensures that each bucket stays within the same cache line, but may waste space (padding bits), yielding faster speed but possibly (much!) larger space requirements. With --unaligned, no bits are used for padding and buckets may cross cache line boundaries. This is slightly slower, but may save a little or a lot of space (depending on the bucket size in bits). The default is --unaligned, because the speed decrease is small and the memory savings can be significant.

  • --hashfunctions defines the parameters for the hash functions used to store the key-value pairs. If the parameter is unspecified, different random functions are chosen each time. The hash functions can be specified using a colon separated list: --hashfunctions linear945:linear9123641:linear349341847. It is recommended to have them chosen randomly unless you need strictly reproducible behavior, in which case the example given here is recommended.

Most of the parameters can also be provided in a config file (.yaml):

  • --cfg or --config defines the path the the config file.

Load index into shared memory

To load the index into shared memory to run several arcane in parallel without increasing the memory footprint with the index size, run

arcane load --name myindex

where myindex is the path to a prebuilt index.

To remove the index from shared memory, run

arcane remove --name myindex

Reproduce results from the paper

To reproduce all results, check the workflow folder for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sc_arcane-0.1.1.2.tar.gz (122.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sc_arcane-0.1.1.2-py3-none-any.whl (139.7 kB view details)

Uploaded Python 3

File details

Details for the file sc_arcane-0.1.1.2.tar.gz.

File metadata

  • Download URL: sc_arcane-0.1.1.2.tar.gz
  • Upload date:
  • Size: 122.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sc_arcane-0.1.1.2.tar.gz
Algorithm Hash digest
SHA256 eb2ecb6b059978d989c7aafceae9ddd1923e82235c152e30af9055c0d4347645
MD5 1dac842e8ce57edd54ea1937cd3cd5f2
BLAKE2b-256 b57b8b15fffc2205178a4ec51ba85bd5af85761b9a31eeef9565f69cc3023c71

See more details on using hashes here.

File details

Details for the file sc_arcane-0.1.1.2-py3-none-any.whl.

File metadata

  • Download URL: sc_arcane-0.1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 139.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for sc_arcane-0.1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ab81faaa74e556691f69442310069aa144c73bd2f180cac3a845aa9c374b2c30
MD5 7ff75f06ebe6d5a150d799e3856ca48e
BLAKE2b-256 41422deb7f285e92b2eefd9ff7b3219bef392443a294ab26fa2950f764ec0152

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page