UMI counting tool for SmartSeq3 libraries

These details have not been verified by PyPI

Project description

umite

UMI extraction & counting for Smart‑seq3 scRNA‑seq libraries

umite unites two tools that turn raw Smart‑seq3 FASTQ files into gene‑by‑cell count tables for downstream analysis:

umite_scheme

Step	Tool	Purpose
①	umiextract	Detect and label UMI-containing reads in a FASTQ file. Use optional fuzzy UMI matching to incrase the number of detected UMIs.
②	(external)	Align reads using e.g. the splice-aware aligner, STAR. Then sort reads by read-name using samtools
③	umicount	Parse aligned reads and assign reads/UMIs to genes & exons, while deduplicate and (optionally) error‑correcting UMIs.

Additional information is available in our paper

Usage

Installation

git clone https://github.com/leoforster/umite.git
cd umite
pip install -e .

if this has issuses you can try using python -m pip install -e . to install via the local python interpreter instead. You can also install the pytest suite by calling pip install -e .[dev].

umite requires Python≥3.7 and the packages HTSeq, regex and RapidFuzz, which are automatically installed in the example above. For the alignment step, you will need samtools and ideally the STAR aligner.

Pipeline

The repository ships with minimal_umite_run.sh , a minimal script for running the umite pipeline.

① Detect UMI-containing reads with umiextract

in this example we have paired-end Smart-seq3 libraries from two cells: cellA and cellB. We enable error-tolerant detection of UMI-containing reads with --fuzzy_umi , and the script takes paired-end FASTQs as input. Note that parallel processing of libraries is possible using the --cores argument.

umiextract \
    -1 cellA_R1.fastq.gz cellB_R1.fastq.gz \
    -2 cellA_R2.fastq.gz cellB_R2.fastq.gz \
    --umilen 8 \
    --fuzzy_umi 	# enable error-tolerant UMI detection

This will output modified FASTQs for each cell, e.g. cellA_R1_umiextract.fastq.gz . These files are essentially idential to the input where detected UMIs are trimmed from read sequences and appended to the readnames.

② Read alignment

Here using STAR , note that --genomeDir requires a pre-existing genome index.

STAR \
  --genomeDir /path/to/STAR/index \	
  --readFilesIn cellA_R1_umiextract.fastq.gz \
                 cellA_R2_umiextract.fastq.gz \
  --readFilesCommand zcat \
  --outSAMtype BAM Unsorted

then sort BAM files by read-name, such that paired-reads are adjacent in the file:

samtools sort -n -o cellA_sorted.bam Aligned.out.bam

③ Quantify counts with umicount

Finally, quantify counts per gene per cell for UMI-containing (U) and internal-reads (R) using umicount. This step can process multiple BAMs in parallel by setting --cores , and requires a GTF file containing genome annotations (e.g. from Ensembl). In the example, setting --mm_count_primary causes the primary alignment to be counted for multimapping reads, and --UMI_correct enables gene-wise UMI correction by collapsing sequencing errors using directional Hamming-distances. Other options are detailed below.

umicount \
    --bams cellA_sorted.bam cellB_sorted.bam \
    --gtf Mus_musculus.GRCm39.102.gtf \
    --mm_count_primary \
    --UMI_correct

This will produce the following output files:

umite.U.tsv   # unique UMI counts (per gene × cells)
umite.R.tsv   # internal-read (i.e. non-UMI) counts
umite.D.tsv   # counts for UMI-duplicates (for QC)
log.txt           # summary of processing & statistics (optional, enabled with --logfile)

The output counts matrices contain samples (cells) in rows, with columns denoted by genes parsed from the GTF. These begin with read categories (e.g. _unmapped, _multimapping, _ambiguous) that report the fate of every read from the umiextract FASTQ according to the following schema:

umite_read_categories

Command‑line reference

Run umiextract -h or umicount -h for the full list of options.

umiextract

Flag	Description	Default
`-1 / --read1`	R1 FASTQ files (space‑separated)	required
`-2 / --read2`	R2 FASTQ files (same order as R1)	–
`-d / --output_dir`	Where to write processed FASTQs	`.`
`-c / --cores`	Parallel workers (one sample per core)	`4`
`-l / --logfile`	Path to log file	`sys.stdout`
`--umilen`	UMI length in bp	`8`
`--anchor`	Pre‑UMI anchor (TSO) sequence	`ATTGCGCAATG`
`--trailing`	Post‑UMI trailing sequence	`GGG`
`--search_region`	Sequence cutoff to search for UMI	`-1` (whole read)
`--fuzzy_umi`	Enable mismatch/indel‑tolerant search	off
`--anchor_mismatches`	Max mismatches in anchor	`2`
`--anchor_indels`	Max indels in anchor	`1`
`--trailing_hamming_threshold`	Max Hamming distance in trailing	`2`
`--min_seqlen`	Minimum remaining sequence after trimming UMI	`-1`
`--only_umi`	Drop reads that lack a detectable UMI	off

umicount

Flag	Description	Default
`-f / --bams`	Read‑name–sorted BAM files	required
`-d / --output_dir`	Output directory	`.`
`-c / --cores`	Parallel workers (one BAM per core)	all cores
`-l / --logfile`	Path to log file	`sys.stdout`
`-g / --gtf`	Ensembl‑style GTF annotation	required (see below)
`--tmp_dir`	Directory to save temporary files	`--output_dir`
`--no_dedup`	Skip deduplication and report all UMI-reads	off
`--mm_count_primary`	Count primary alignment for multimapping reads	off
`--multiple_primary_action`	When a read has mutliple primary alignments: `warn`, `raise`, or `skip`	`warn`
`--min_read_mapQ`	Min mapQ to keep read	`0`
`--combine_unspliced`	If set, don't distinguish intronic and exonic reads	off
`--UMI_correct`	Enable gene‑wise UMI collapse by Hamming distance	off
`--hamming_threshold`	Hamming threshold for merging UMIs	`1`
`--count_ratio_threshold`	Only merge UMIs if one has 2*threhsold as many counts	`2`

Of note, as GTF parsing can take several minutes, umicount implements the option to parse from the GTF file once and dump the contents to a pickle file. Using --gtf with --GTF_dump will enable dumping parsed GTF data to a pickle file which can be used as input for umicount with --GTF_skip_parse instead of --gtf. This functionality is useful when running multiple repeat quantifications, however generally --gtf is the better option. Here a minimal example of this functionality:

umicount \
	-gtf examplefile.gtf \
	--GTF_dump umite_GTF_dump.pkl

umicount \
	--bams example.bam
	--GTF_skip_parse umite_GTF_dump.pkl

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Jan 6, 2026

This version

0.1.0 yanked

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

umite-0.1.0.tar.gz (39.1 kB view details)

Uploaded Nov 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

umite-0.1.0-py3-none-any.whl (39.1 kB view details)

Uploaded Nov 14, 2025 Python 3

File details

Details for the file umite-0.1.0.tar.gz.

File metadata

Download URL: umite-0.1.0.tar.gz
Upload date: Nov 14, 2025
Size: 39.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for umite-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`800427566838aee9c91822b9f3450214770e13ccae92bb4f032dbc4e7aeecd18`
MD5	`85486bc2879dd64f5e0dede32784d0ed`
BLAKE2b-256	`859ef91d1d2cd9adb9f740851c09098da97cb93a094dd235025960bf3801d3dd`

See more details on using hashes here.

File details

Details for the file umite-0.1.0-py3-none-any.whl.

File metadata

Download URL: umite-0.1.0-py3-none-any.whl
Upload date: Nov 14, 2025
Size: 39.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for umite-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b52c9be731eaa8814899a61e42588bf3af1f0b2f631187176c26568988333266`
MD5	`eb1c70988a00babec2879d95d1e13633`
BLAKE2b-256	`df986244bd004bcc4a3fbf920f788421832cfacd631558ed9800a9a244b2dcd8`

See more details on using hashes here.

umite 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

umite

Usage

Installation

Pipeline

① Detect UMI-containing reads with umiextract

② Read alignment

③ Quantify counts with umicount

Command‑line reference

umiextract

umicount

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes