Fast k-mer–based detection of any target sequence in paired single-cell FASTQ files
Project description
seqseeker
Fast k-mer–based detection of any target sequence in paired single-cell FASTQ files
seqseeker finds reads containing a user-specified nucleotide sequence (GFP,
mCherry, Cre, any transgene) directly in raw gzipped FASTQ files from 10x
Chromium experiments, attributes each hit to a cell barcode + UMI, and writes
output tables ready for Seurat.
No custom genome build. No Cell Ranger re-run. Per-sample runtimes of 5–15 minutes vs. hours for alignment.
Table of contents
- Installation
- Quick start
- CLI reference
- Python API
- R interface
- Bash wrapper
- Built-in sequences
- Output files
- Loading into Seurat
- Why not just build a custom genome?
- How it works
- Tuning guide
- Citation
Installation
Requirements: Python ≥ 3.10, gcc or cc (for compiling the C core), zlib.
pip install seqseeker
On first use seqseeker automatically compiles its C backend. You can also
pre-compile manually:
seqseeker compile
macOS
# Install Xcode Command Line Tools if you don't have gcc:
xcode-select --install
pip install seqseeker
Linux
sudo apt-get install build-essential zlib1g-dev # Debian/Ubuntu
# or
sudo yum install gcc zlib-devel # RHEL/CentOS
pip install seqseeker
R package
# Requires the Python package to be installed first (above).
# Install from GitHub:
if (!requireNamespace("remotes")) install.packages("remotes")
remotes::install_github("ethanfenton/seqseeker", subdir = "R/seqseeker")
Quick start
Python / CLI
# Search for eGFP using a built-in sequence
seqseeker search \
--sequence-name egfp \
--input /data/novogene/01.RawData \
--output results/gfp
# Any gene — pass a FASTA file
seqseeker search \
--sequence-file my_transgene.fasta \
--input /data/fastq/ \
--output results/transgene
# Raw sequence string
seqseeker search \
--sequence ATGGTGAGCAAGGGCGAGGAG... \
--input /data/fastq/ \
--output results/custom
# Only the 3' end (faster, specific)
seqseeker search \
--sequence-name egfp \
--region 3prime --region-len 200 \
--input /data/fastq/ \
--output results/gfp_3p
# Parallel processing
seqseeker search \
--sequence-name egfp \
--input /data/fastq/ \
--workers 0 # 0 = auto-detect CPUs
Python API
import seqseeker
results = seqseeker.search(
input_dirs=["/data/novogene/01.RawData",
"/data/novogene2/01.RawData"],
sequence_name="egfp",
output_dir="results/gfp",
workers=8,
)
print(f"Total hits: {results['n_hits']:,}")
print(f"Unique barcodes:{results['n_unique_barcodes']:,}")
R
library(seqseeker)
results <- seqseeker(
input = c("/data/novogene/01.RawData",
"/data/novogene2/01.RawData"),
sequence_name = "egfp",
output = "results/gfp",
workers = 8L
)
# Add to Seurat
seurat_obj <- seqseeker_add_to_seurat(seurat_obj, results)
FeaturePlot(seurat_obj, features = "seq_hits")
CLI reference
seqseeker <command> [options]
Commands:
search Search FASTQ files for a target sequence
compile Pre-compile the C search binary
list-sequences Show built-in sequences
seqseeker search options:
Input/output:
--input/-i DIR [DIR ...] FASTQ directories (required)
--output/-o DIR Output directory [default: seqseeker_output]
Target sequence (exactly one required):
--sequence/-q ACGT... Raw nucleotide string
--sequence-file/-f FILE FASTA file
--sequence-name/-n NAME Built-in name (egfp, mcherry, venus, tdtomato)
Region selection:
--region/-R full|5prime|3prime Part of target to index [default: full]
--region-len/-L N Length for 5'/3' slice [default: 200]
k-mer tuning:
--kmer-size/-k K k-mer length [default: 20, range: 8–31]
--min-kmers/-m N Min hits to call read positive [default: 1]
10x read layout:
--barcode-len/-b N Barcode length in R1 [default: 16]
--umi-len/-u N UMI length in R1 [default: 12]
Parallelism:
--workers/-w N Parallel processes [default: 1, 0=auto]
Filtering:
--sample-filter SAMPLE ... Process only these sample IDs
--report-interval N Progress every N reads [default: 1000000]
Advanced:
--binary PATH Path to pre-compiled kmer_search binary
Python API
seqseeker.search(
input_dirs, # str | list[str | Path]
output_dir = "seqseeker_output",
*,
sequence = None, # raw nucleotide string
sequence_file = None, # FASTA file path
sequence_name = None, # built-in name
region = "full", # "full" | "5prime" | "3prime"
region_len = 200,
kmer_size = 20,
barcode_len = 16,
umi_len = 12,
min_kmers = 1,
workers = 1, # 0 = auto
sample_filter = None,
report_interval = 1_000_000,
binary_path = None,
) -> dict
Returns a dict with keys:
output_dir, raw_csv, per_cell_csv, positions_csv,
n_hits, n_unique_barcodes.
R interface
# Search
results <- seqseeker(input, output, sequence_name, ...)
# Load existing results
results <- seqseeker_load_results("results/gfp")
# Add to Seurat
seurat_obj <- seqseeker_add_to_seurat(seurat_obj, results,
barcode_suffix = "-1",
meta_col = "gfp_reads")
# List built-ins
seqseeker_list_sequences()
Bash wrapper
The bash/seqseeker script is a thin wrapper that delegates to the Python
package. After pip install seqseeker the seqseeker entry point is already
on your PATH. The bash script is useful in cluster environments where you want
an explicit shell command without activating a Python environment:
# Set a custom Python path
SEQSEEKER_PYTHON=/path/to/env/bin/python3 seqseeker search ...
Built-in sequences
Name Length Description
--------------------------------------------------------------
egfp 720 bp Enhanced GFP (canonical, non-codon-optimised)
venus 720 bp Venus YFP (T203Y/S65G/V68L/S72A mutations)
mcherry 711 bp mCherry red fluorescent protein
tdtomato 1386 bp tdTomato (tandem dimer Tomato)
Run seqseeker list-sequences to see the current list.
For a codon-optimised variant of eGFP (common in AAV constructs), use
--sequence-file pointing to your construct FASTA rather than the built-in.
Output files
All outputs land in --output (default: seqseeker_output/):
| File | Description |
|---|---|
hits_raw.csv |
One row per positive read; sample, lane, barcode, UMI, hit count, position, strand, read name, R2 sequence |
hits_per_cell.csv |
One row per unique barcode; n_reads, n_umi, samples seen |
kmer_positions.csv |
Histogram of first k-mer match position along the target (useful for diagnosing partial integrations) |
_per_pair/ |
Intermediate per-FASTQ-pair CSVs (safe to delete after aggregation) |
Loading into Seurat
# Method 1: via seqseeker R package
library(seqseeker)
results <- seqseeker_load_results("results/gfp")
seurat_obj <- seqseeker_add_to_seurat(seurat_obj, results)
# Method 2: base R
hits <- read.csv("results/gfp/hits_per_cell.csv")
hits$bc <- paste0(hits$barcode, "-1")
seurat_obj$gfp_reads <- hits$n_reads[match(colnames(seurat_obj), hits$bc)]
seurat_obj$gfp_reads[is.na(seurat_obj$gfp_reads)] <- 0
Why not just build a custom genome?
See docs/why_kmer_search.md for the full argument. In short:
- ~100× faster — no STAR index build, no full alignment pass.
- Higher sensitivity — searches all raw reads, not just those that survive MAPQ filtering; catches partial integrations and junction reads.
- No false negatives from multi-mapping — transgene k-mers that share similarity with endogenous loci are still detected if any portion is unique.
- No genome required — works from the sequence string alone.
- Direct barcode attribution — output is immediately Seurat-ready.
How it works
- The target sequence (and its reverse complement) is encoded into a compact open-addressing hash table of 2-bit k-mers at startup (~1 ms).
- R2 reads are scanned with a rolling window; any k-mer hit increments a
counter. Once
--min-kmersthreshold is met the read is emitted. - The cell barcode and UMI are extracted from the corresponding R1 read
(first
barcode_len+ nextumi_lenbases). - Results are aggregated per barcode across all FASTQ pairs.
The C core processes ~50 M reads/min/core on a modern server; total time for a 200 M read 10x library is typically <5 min on 4 cores.
Tuning guide
| Goal | Recommended settings |
|---|---|
| Initial survey (maximum sensitivity) | k=20, min=1, region=full |
| Reduce noise after confirming signal | k=25, min=3 |
| Fast specific detection of 3' end | region=3prime, region-len=200 |
| Partial / truncated integration | k=15, min=1, region=full |
| Codon-optimised variant | --sequence-file construct.fasta |
Inspect kmer_positions.csv to see where hits cluster along the target — a
peak at position 600/720 (3' end of eGFP) is expected for normal expression;
a flat distribution suggests unspecific noise.
Citation
If you use seqseeker in published work, please cite:
Fenton E (2026). seqseeker: fast k-mer detection of target sequences in single-cell FASTQ files. GitHub: https://github.com/ethanfenton/seqseeker
License
MIT © 2026 Ethan Fenton
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seqseeker-0.1.0.tar.gz.
File metadata
- Download URL: seqseeker-0.1.0.tar.gz
- Upload date:
- Size: 25.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22924bf59cc8386064bf421a3805450b97ffec0c0099a3e27854216c402beb86
|
|
| MD5 |
0260cd40f7b6c5087d7e485092216c39
|
|
| BLAKE2b-256 |
ef724ddc1272d85337bf1e6f0be6528fdac4d30feeff9babf2810084aeeb77ef
|
File details
Details for the file seqseeker-0.1.0-py3-none-any.whl.
File metadata
- Download URL: seqseeker-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
061364c424f5eff8baefee54d662ba36ebe6d8af40a1fb084024224d46f9943e
|
|
| MD5 |
66befae79a18b28d7a0057cac230026e
|
|
| BLAKE2b-256 |
d176938c41deff6b1cf104c2bea43231b4661d19ef75c18762703bf4d39f487a
|