Skip to main content

FastSTR: A high-performance tool for short tandem repeat (STR) detection and analysis from genome assemblies.

Project description

🧬 FastSTR

FastSTR — Ultra-fast and accurate identification of Short Tandem Repeats (STRs) from long-read DNA sequences. Developed for genome-wide STR detection, consensus construction, and comparative STR analysis.


📘 Table of Contents

  1. Overview
  2. Installation
  3. Quick Start
  4. Command Line Options
  5. Input & Output
  6. Usage
  7. Performance
  8. Citation
  9. License
  10. Changelog

🌍 Overview

FastSTR is a novel and efficient tool for de novo detection of short tandem repeats (STRs) in genomic sequences. It combines fast motif recognition with accurate sequence alignment to achieve both high precision and completeness in STR identification. FastSTR is optimized for large-scale genomic datasets and enables rapid detection of repetitive elements without relying on predefined motif libraries or fixed repeat-length thresholds.

Compared to classical tools like TRF, T-reks, and TRASH, FastSTR achieves:

  • High-speed parallel processing — Processes genomic fragments in parallel, achieving up to 10× faster runtime.
  • 🧠 Context-aware motif recognition — Uses an N-gram + Markov model to identify representative motifs without predefined motif libraries.
  • 🧩 Segmented global alignment — Efficiently handles ultra-long or complex STRs while maintaining base-level precision.
  • 🔍 Smart interval merging — Applies an interval-gain decision strategy to accurately resolve overlapping STRs.
  • 🧬 Enhanced detection in complex regions — Identifies confounding or nested repeat regions (e.g., centromeric satellites) with a novel density-based concentration test.
  • 💾 Lightweight & scalable — Requires few dependencies, easy to install and run, and supports multiple operating systems.

⚙️ Installation

Option 1: Install via pip

pip install faststr

Option 2: Install via conda

(coming soon)

conda install -c bioconda faststr

Option 3: Local installation (development)

git clone https://github.com/yourname/faststr.git
cd faststr
pip install -e .

🚀 Quick Start

Basic Command

faststr [--strict | --normal | --loose] [--default] genome.fa

Example

faststr --strict --default genome.fa

This runs FastSTR in strict mode using the default model to identify STRs in the genome.fa file.


⚡ Command Line Options

Argument Type Default Description
match int 2 Match score
mismatch int 5 Mismatch score
gap_open int 7 Gap opening penalty
gap_extend int 3 Gap extension penalty
p_indel int 15 Indel percentage threshold
p_match int 80 Match percentage threshold
score int 50 Alignment score threshold
quality_control bool False Enable read-level quality control
DNA_file str Path to DNA FASTA input
-f str Output directory
-s int 1 Start index
-e int 0 End index
-l int 15000 Sub-read length
-o int 1000 Overlap length
-p int 1 Number of CPU cores
-b float 0.045 Motif coverage threshold

🧠 Alignment Modes

Mode Description
--strict High precision, recommended for curated assemblies
--normal Balanced mode, suitable for most datasets
--loose High sensitivity, tolerant of mismatches

🧬 Model Presets

Preset Description
--default Standard scoring model
(future) --sensitive Optimized for noisy long reads
(future) --speed Optimized for large-scale detection

📥 Input & Output

Input

  • DNA sequences in FASTA format

Output

File Pattern Description
*detail.dat Contains all STR positions and motifs, quality statistics for each STR, and STR counts per chromosome.
*align.dat Detailed alignment of all STRs against reference STRs, including mismatches and indels.
*.csv Merged STR intervals with representative motifs and summary statistics for each interval.
*.log Processing logs.

🧪 Usage

1️⃣ Identify STRs in a genome

faststr --normal --default human_genome.fa

2️⃣ Use multiple cores

faststr --strict --default genome.fa -p 8

📈 Performance

Dataset Genome Size Tool Runtime Recall Precision
Human (T2T) 2.94 G TRF 18 h 31 min - -
FastSTR 1 h 13 min 0.950 0.994
Mouse (GRCm39) 2.57 G TRF 1 h 41 min - -
FastSTR 38 min 0.966 0.997
Zebrafish (GRCz11) 1.58 G TRF 2 h 51 min - -
FastSTR 25 min 0.945 0.998

Note: TRF is used as the ground-truth. FastSTR runs based on 72 CPUs.


📚 Citation

If you use FastSTR in your research, please cite:

Xingyu Liao et al.,
Efficient Identification of Short Tandem Repeats via Context-Aware Motif Discovery and Ultra-Fast Sequence Alignment,
Nat. Methods, 2025.


📄 License

This project is licensed under the MIT License.
See LICENSE for more details.


🧾 Changelog

v1.0.0 (2025)

  • Initial release of FastSTR
  • Supports three alignment modes and one default model
  • Implemented parallel computation
  • Added .csv, .dat, .log outputs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faststr-1.0.0.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

faststr-1.0.0-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file faststr-1.0.0.tar.gz.

File metadata

  • Download URL: faststr-1.0.0.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for faststr-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f985076b0c44b85e2b6bf076c29d0e7e275a81c019ff25f63a71022b00c23d1e
MD5 193d6e9e115d36390e204f62c11fa075
BLAKE2b-256 ff5543a38a82922eba3c416002b607bed9a31cb06a13fc5eddef2604f91e2eb7

See more details on using hashes here.

File details

Details for the file faststr-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: faststr-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for faststr-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 38e828942579483172429001ef7e3b164f22d70f57de4ba34af62fecd3f743cd
MD5 41f0acf692a973bedb6a04b5fc844f99
BLAKE2b-256 3ab52f1e5b5316c031acf02bf077cb7f1c41172d03988d7241e4e9cd81b4ea87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page