FastSTR: A high-performance tool for short tandem repeat (STR) detection and analysis from genome assemblies.
Project description
🧬 FastSTR
FastSTR — Ultra-fast and accurate identification of Short Tandem Repeats (STRs) from long-read DNA sequences. Developed for genome-wide STR detection, consensus construction, and comparative STR analysis.
📘 Table of Contents
- Overview
- Installation
- Quick Start
- Command Line Options
- Input & Output
- Usage
- Performance
- Citation
- License
- Changelog
🌍 Overview
FastSTR is a novel and efficient tool for de novo detection of short tandem repeats (STRs) in genomic sequences. It combines fast motif recognition with accurate sequence alignment to achieve both high precision and completeness in STR identification. FastSTR is optimized for large-scale genomic datasets and enables rapid detection of repetitive elements without relying on predefined motif libraries or fixed repeat-length thresholds.
Compared to classical tools like TRF, T-reks, and TRASH, FastSTR achieves:
- ⚡ High-speed parallel processing — Processes genomic fragments in parallel, achieving up to 10× faster runtime.
- 🧠 Context-aware motif recognition — Uses an N-gram + Markov model to identify representative motifs without predefined motif libraries.
- 🧩 Segmented global alignment — Efficiently handles ultra-long or complex STRs while maintaining base-level precision.
- 🔍 Smart interval merging — Applies an interval-gain decision strategy to accurately resolve overlapping STRs.
- 🧬 Enhanced detection in complex regions — Identifies confounding or nested repeat regions (e.g., centromeric satellites) with a novel density-based concentration test.
- 💾 Lightweight & scalable — Requires few dependencies, easy to install and run, and supports multiple operating systems.
⚙️ Installation
Option 1: Install via pip
pip install faststr
Option 2: Install via conda
(coming soon)
conda install -c bioconda faststr
Option 3: Local installation (development)
git clone https://github.com/yourname/faststr.git
cd faststr
pip install -e .
🚀 Quick Start
Basic Command
faststr [--strict | --normal | --loose] [--default] genome.fa
Example
faststr --strict --default genome.fa
This runs FastSTR in strict mode using the default model to identify STRs in the genome.fa file.
⚡ Command Line Options
| Argument | Type | Default | Description |
|---|---|---|---|
match |
int | 2 | Match score |
mismatch |
int | 5 | Mismatch score |
gap_open |
int | 7 | Gap opening penalty |
gap_extend |
int | 3 | Gap extension penalty |
p_indel |
int | 15 | Indel percentage threshold |
p_match |
int | 80 | Match percentage threshold |
score |
int | 50 | Alignment score threshold |
quality_control |
bool | False | Enable read-level quality control |
DNA_file |
str | — | Path to DNA FASTA input |
-f |
str | — | Output directory |
-s |
int | 1 | Start index |
-e |
int | 0 | End index |
-l |
int | 15000 | Sub-read length |
-o |
int | 1000 | Overlap length |
-p |
int | 1 | Number of CPU cores |
-b |
float | 0.045 | Motif coverage threshold |
🧠 Alignment Modes
| Mode | Description |
|---|---|
--strict |
High precision, recommended for curated assemblies |
--normal |
Balanced mode, suitable for most datasets |
--loose |
High sensitivity, tolerant of mismatches |
🧬 Model Presets
| Preset | Description |
|---|---|
--default |
Standard scoring model |
(future) --sensitive |
Optimized for noisy long reads |
(future) --speed |
Optimized for large-scale detection |
📥 Input & Output
Input
- DNA sequences in FASTA format
Output
| File Pattern | Description |
|---|---|
*detail.dat |
Contains all STR positions and motifs, quality statistics for each STR, and STR counts per chromosome. |
*align.dat |
Detailed alignment of all STRs against reference STRs, including mismatches and indels. |
*.csv |
Merged STR intervals with representative motifs and summary statistics for each interval. |
*.log |
Processing logs. |
🧪 Usage
1️⃣ Identify STRs in a genome
faststr --normal --default human_genome.fa
2️⃣ Use multiple cores
faststr --strict --default genome.fa -p 8
📈 Performance
| Dataset | Genome Size | Tool | Runtime | Recall | Precision |
|---|---|---|---|---|---|
| Human (T2T) | 2.94 G | TRF | 18 h 31 min | - | - |
| FastSTR | 1 h 13 min | 0.950 | 0.994 | ||
| Mouse (GRCm39) | 2.57 G | TRF | 1 h 41 min | - | - |
| FastSTR | 38 min | 0.966 | 0.997 | ||
| Zebrafish (GRCz11) | 1.58 G | TRF | 2 h 51 min | - | - |
| FastSTR | 25 min | 0.945 | 0.998 |
Note: TRF is used as the ground-truth. FastSTR runs based on 72 CPUs.
📚 Citation
If you use FastSTR in your research, please cite:
Xingyu Liao et al.,
Efficient Identification of Short Tandem Repeats via Context-Aware Motif Discovery and Ultra-Fast Sequence Alignment,
Nat. Methods, 2025.
📄 License
This project is licensed under the MIT License.
See LICENSE for more details.
🧾 Changelog
v1.0.0 (2025)
- Initial release of FastSTR
- Supports three alignment modes and one default model
- Implemented parallel computation
- Added
.csv,.dat,.logoutputs
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file faststr-1.0.0.tar.gz.
File metadata
- Download URL: faststr-1.0.0.tar.gz
- Upload date:
- Size: 27.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f985076b0c44b85e2b6bf076c29d0e7e275a81c019ff25f63a71022b00c23d1e
|
|
| MD5 |
193d6e9e115d36390e204f62c11fa075
|
|
| BLAKE2b-256 |
ff5543a38a82922eba3c416002b607bed9a31cb06a13fc5eddef2604f91e2eb7
|
File details
Details for the file faststr-1.0.0-py3-none-any.whl.
File metadata
- Download URL: faststr-1.0.0-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38e828942579483172429001ef7e3b164f22d70f57de4ba34af62fecd3f743cd
|
|
| MD5 |
41f0acf692a973bedb6a04b5fc844f99
|
|
| BLAKE2b-256 |
3ab52f1e5b5316c031acf02bf077cb7f1c41172d03988d7241e4e9cd81b4ea87
|