Skip to main content

RNA FISH oligos/probes design tool.

Project description

License: MIT PyPI version Github Actions Status CodeFactor codecov Maintainability DOI

Logo of eFISHent.

eFISHent

Design RNA smFISH oligonucleotide probes from the command line. One command to install, one command to design probes.

Key features:

  • Pre-built genome indices for 7 organisms — no index building needed
  • Automatic gene sequence download from NCBI
  • Multi-layer off-target detection (genome alignment, transcriptome BLAST, repeat masking, expression weighting, and more)
  • Adaptive probe length to normalize Tm across the probe set
  • Protocol presets (smfish, merfish, dna-fish, etc.)
  • Automated probe validation with PASS/FLAG/FAIL recommendations

Installation

Tested on macOS and Linux with Python 3.10+. Works on HPC/cluster servers via SSH — no sudo, Docker, or conda needed. For Windows, use WSL.

curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash

Restart your shell, then verify:

efishent --check
Installation options

With BLAST+ and transcriptome tools (for transcriptome-level off-target filtering):

curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --with-blast

Custom install path:

curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --prefix /path/to/install

Update:

efishent --update

Development install:

git clone https://github.com/BBQuercus/eFISHent.git
cd eFISHent/
./install.sh --deps-only
uv venv && source .venv/bin/activate
uv pip install -e .

Uninstall:

curl -LsSf https://raw.githubusercontent.com/BBQuercus/eFISHent/main/install.sh | bash -s -- --uninstall

Or simply: rm -rf ~/.local/efishent

Quick Start

The fastest way to design probes — genome indices are downloaded automatically:

efishent --genome hg38 --gene-name "MALAT1" --organism-name "homo sapiens" --preset smfish

That's it. This downloads the pre-built human genome index on first use and designs smFISH probes for MALAT1, a longer transcript that typically yields many more candidate probes than ACTB.

Available Genomes

Organism Aliases
Human hg38, GRCh38, human
Mouse mm39, GRCm39, mouse
Zebrafish danRer11, GRCz11, zebrafish
Rat rn7, GRCr8, rat
Drosophila dm6, BDGP6, fly
C. elegans ce11, WBcel235, worm, elegans
Yeast sacCer3, R64, yeast
efishent --list-genomes          # List all available genomes
efishent --download-genome hg38  # Pre-download for offline use

Indices are cached in ~/.local/efishent/indices/ by default. Override with --index-cache-dir /path/to/dir or the EFISHENT_INDEX_DIR environment variable.

Specifying Your Target Gene

Three ways to provide the target sequence:

Method Example
Gene name + organism --gene-name "MALAT1" --organism-name "homo sapiens"
Ensembl ID --ensembl-id ENSG00000128272 --organism-name "homo sapiens"
FASTA file --sequence-file ./my_gene.fasta

Using Your Own Genome

For organisms without a pre-built index, provide your own reference genome:

# Build indices once (can take 30-60 min for large genomes)
efishent --reference-genome <genome.fa> --build-indices True

# Design probes
efishent --reference-genome <genome.fa> --gene-name <gene> --organism-name <organism>
Downloading genomes and annotations

For any organism, download the genome FASTA and GTF annotation from Ensembl or UCSC. Prefer primary_assembly if available, otherwise toplevel. Unzip with gunzip.

Example for human (GRCh38):

# Reference genome
wget https://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

# GTF annotation (for intergenic filtering, rRNA filtering, expression weighting)
wget https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.115.gtf.gz
gunzip Homo_sapiens.GRCh38.115.gtf.gz

Ensembl GTFs use gene_biotype while GENCODE uses gene_type — eFISHent supports both.

Reference transcriptome (optional, for BLAST cross-validation):

gffread Homo_sapiens.GRCh38.115.gtf -g Homo_sapiens.GRCh38.dna.primary_assembly.fa -w transcriptome.fa

# Append rRNA sequences (18S/28S/5.8S are NOT in standard GTFs)
efetch -db nucleotide -id NR_003286.4 -format fasta >> transcriptome.fa  # 45S pre-rRNA
efetch -db nucleotide -id NR_023363.1 -format fasta >> transcriptome.fa  # 5S rRNA

The major rRNA genes exist in ~300 tandem copies in unassembled regions, so they're absent from standard GTFs. Including them in the transcriptome FASTA ensures the BLAST filter catches probes binding these abundant sequences.

Count table (optional, for expression-weighted filtering):

Download a normalized RNA-seq dataset (FPKM/TPM) for your cell line from GEO or Expression Atlas. The file needs Ensembl gene IDs in column 1 and normalized counts in column 2.

Presets

Use --preset to apply optimized parameters for common FISH protocols:

Preset Description
smfish Standard smFISH (20-24nt probes, adaptive length, 10% formamide)
merfish MERFISH encoding probes (tight Tm, 30% formamide)
dna-fish DNA FISH (longer probes, relaxed specificity)
strict Maximum specificity (low k-mer tolerance, low-complexity filter)
relaxed Maximum probe yield (permissive thresholds + rescue filters)
exogenous Exogenous genes — GFP, Renilla, reporters (no k-mer filter, strict BLAST)

Use --preset list to see details. Explicit arguments override preset values.

Workflow
flowchart TD
    A["Gene Sequence<br/><i>FASTA file or NCBI download</i>"] --> B["Generate Candidate Probes<br/><i>Sliding window (adaptive or fixed length)</i>"]

    B --> C["Basic Filtering<br/><i>TM, GC, homopolymers, low-complexity</i>"]

    C --> D["Genome Alignment<br/><i>Bowtie2 (default) or Bowtie</i><br/><i>+ repeat masking, intergenic, Tm scoring</i>"]

    D --> E{"Transcriptome<br/>provided?"}
    E -- Yes --> F["Transcriptome BLAST<br/><i>Off-target detection (TrueProbes params)</i>"]
    E -- No --> G["K-mer Filtering<br/><i>Jellyfish frequency count</i>"]
    F --> G

    G --> H["Secondary Structure<br/><i>deltaG prediction (RNAstructure)</i>"]

    H --> H2["Accessibility Scoring<br/><i>RNA folding (optional)</i>"]

    H2 --> I["Quality-Weighted Optimization<br/><i>Greedy or optimal (MILP) + gap filling + Tm refinement</i>"]

    I --> K["Validation Report<br/><i>Quality scores, off-target genes, recommendations</i>"]

    K --> J["Final Probe Set"]

    style A fill:#e1f5fe
    style J fill:#e8f5e9
  1. Candidate probes are generated from the input sequence using a sliding window. When --adaptive-length is enabled, probe lengths are adjusted based on local GC content to normalize Tm.
  2. Basic filtering removes probes failing sequence criteria: melting temperature, GC content, homopolymer runs, optionally low-complexity regions, and optionally G-quadruplex motifs (--filter-g-quadruplex).
  3. Probes are aligned to the reference genome using Bowtie2 (sensitive local alignment with OligoMiner/Tigerfish parameters). Optional filters refine off-target counting: repeat masking, intergenic filtering, thermodynamic scoring, and expression weighting.
  4. If a reference transcriptome is provided, probes are BLASTed against expressed transcripts to catch off-targets that genome alignment alone may miss (e.g., splice junctions).
  5. Short k-mers are counted using Jellyfish — probes with frequently occurring k-mers are discarded.
  6. Secondary structure is predicted using a nearest-neighbor thermodynamic model — probes with too-stable structures are filtered.
  7. If --accessibility-scoring is enabled, target RNA accessibility is scored using RNA folding predictions.
  8. Quality-weighted optimization selects non-overlapping probes maximizing coverage. A gap-filling pass covers remaining regions and Tm uniformity refinement swaps outlier probes.
  9. The output includes per-probe quality scores, off-target gene names, expression risk, and PASS/FLAG/FAIL recommendations.

Output

eFISHent produces three files per run:

File Description
GENE_HASH.fasta Final probes in FASTA format
GENE_HASH.csv Detailed probe table (see columns below)
GENE_HASH.txt Run parameters and command for reproducibility

The HASH is a unique identifier based on the parameters used — rerunning with the same parameters reuses cached results.

Output CSV columns
Column Description
name Probe identifier
sequence Probe nucleotide sequence
start, end Position along the target gene
length Probe length in nucleotides
GC GC content (%)
TM Predicted melting temperature (deg C)
deltaG Secondary structure free energy (kcal/mol)
kmers Maximum k-mer count in reference genome
count Genome off-target hit count
txome_off_targets Transcriptome off-target count (when --reference-transcriptome is used)
off_target_genes Off-target gene names with hit counts, e.g., ACTG1(3), MYH9(1)
worst_match Best off-target match quality, e.g., 95%/20bp/0mm
expression_risk Expression risk for off-target genes, e.g., ACTG1:HIGH(850)
quality Composite quality score (0-100)
recommendation PASS, FLAG(reason), or FAIL

Probe Set Analysis

Analyze an existing probe set with comprehensive metrics and a PDF report:

efishent \
    --reference-genome <genome.fa> \
    --sequence-file <gene.fa> \
    --analyze-probeset <probes.fasta>
Analysis report contents
Plot Description
Lengths Distribution of probe lengths
Melting temperatures Boxplot of calculated Tm values
GC Content Boxplot of GC percentages
G quadruplet Count of G-quadruplet motifs per probe
K-mer count Maximum k-mer frequency in genome
Free energy Predicted secondary structure stability (deltaG)
Off target count Number of off-target binding sites per probe
Binding affinity Probe-to-probe similarity matrix (potential cross-hybridization)
Gene coverage Visual map of probe positions along the target sequence

Parameters

Core Parameters

Parameter Description
--reference-genome Path to reference genome FASTA
--genome Use a pre-built genome index (e.g., hg38, mm39, zebrafish)
--gene-name Gene name for automatic sequence download from NCBI
--organism-name Organism name (used with --gene-name or --ensembl-id)
--sequence-file Path to target gene FASTA file
--preset Parameter preset: smfish, merfish, dna-fish, strict, relaxed, exogenous
--threads Number of threads for parallel processing
--is-plus-strand Strand orientation of the gene of interest
--is-endogenous Whether the gene is endogenous to the organism

Probe Design Parameters

Parameter Description
--min-length, --max-length Probe length range in nucleotides
--spacing Minimum distance between probes
--min-tm, --max-tm Melting temperature range
--min-gc, --max-gc GC content range (%)
--formamide-concentration Formamide concentration (%)
--na-concentration Sodium ion concentration (mM)
--adaptive-length Adjust probe length by local GC to normalize Tm
--max-homopolymer-length Max homopolymer run (default: 5, 0 to disable)
--filter-low-complexity Filter dinucleotide repeats and low entropy regions
--filter-g-quadruplex Filter G-quadruplex motifs in target
--max-deltag Secondary structure free energy threshold
--target-regions Target region: exon (default), intron, both, cds-only, utr-only
--accessibility-scoring Score target RNA accessibility via RNA folding
--optimization-method greedy (default, fast) or optimal (MILP, max coverage)
--optimization-time-limit Time limit in seconds for optimal solver
--sequence-similarity Max allowed inter-probe similarity (%) to avoid cross-hybridization
Off-target filtering parameters

Genome alignment (default):

Parameter Description
--max-off-targets Maximum genome hits per probe (default: 0)
--aligner bowtie2 (default) or bowtie (legacy)
--mask-repeats Ignore off-targets in repetitive regions (uses dustmasker)
--intergenic-off-targets Ignore off-targets outside annotated genes (requires --reference-annotation)
--off-target-min-tm Min Tm (deg C) for an off-target to count. Set to hybridization temp to rescue thermodynamically unstable hits (default: 0)
--filter-rrna Remove probes hitting rRNA genes (requires --reference-annotation)

Transcriptome BLAST (optional):

Parameter Description
--reference-transcriptome Transcriptome FASTA for BLAST cross-validation
--max-transcriptome-off-targets Max transcriptome hits per probe (default: 0)
--blast-identity-threshold Min % identity for BLAST hit (default: 75)
--min-blast-match-length Min effective alignment length (default: max(18, 0.8 * min_probe_length))

Expression weighting (optional):

Parameter Description
--reference-annotation GTF annotation file
--encode-count-table Normalized RNA-seq count table (FPKM/TPM)
--max-expression-percentage Top expression percentile to exclude
--max-probes-per-off-target Cap on probes hitting same off-target gene (default: 0 = disabled, recommended: 5)
Index and cache parameters
Parameter Description
--build-indices Build genome indices (bowtie2, jellyfish, BLAST)
--download-genome Pre-download a genome index for offline use
--list-genomes List available pre-built genomes
--index-cache-dir Override index cache directory (default: ~/.local/efishent/indices/). Also settable via EFISHENT_INDEX_DIR
--kmer-length K-mer length for Jellyfish filtering
--max-kmers Max k-mer occurrences in genome before discarding probe
--save-intermediates Keep all intermediate files for debugging

Examples

Full examples

smFISH with pre-built index (simplest):

efishent --genome hg38 --gene-name "MALAT1" --organism-name "homo sapiens" --preset smfish --threads 8

smFISH with full off-target filtering:

efishent \
    --reference-genome ./hg-38.fa \
    --reference-annotation ./hg-38.gtf \
    --reference-transcriptome ./transcriptome.fa \
    --gene-name "GAPDH" \
    --organism-name "homo sapiens" \
    --preset smfish \
    --mask-repeats True \
    --intergenic-off-targets True \
    --filter-rrna True \
    --max-probes-per-off-target 5 \
    --threads 8

Long probes (45-50nt) with optimal solver:

efishent \
    --reference-genome ./hg-38.fa \
    --gene-name "norad" \
    --organism-name "homo sapiens" \
    --is-plus-strand True \
    --optimization-method optimal \
    --min-length 45 \
    --max-length 50 \
    --formamide-concentration 45 \
    --threads 8

Exogenous gene (GFP, Renilla, etc.):

efishent \
    --reference-genome ./hg38.fa \
    --reference-transcriptome ./transcriptome.fa \
    --reference-annotation ./hg38.gtf \
    --sequence-file "./renilla.fasta" \
    --preset exogenous \
    --threads 8

Expression-weighted off-target filtering:

efishent \
    --reference-genome ./hg-38.fa \
    --reference-annotation ./hg-38.gtf \
    --ensembl-id ENSG00000128272 \
    --organism-name "homo sapiens" \
    --is-plus-strand False \
    --max-off-targets 5 \
    --encode-count-table ./count_table.tsv \
    --max-expression-percentage 20 \
    --threads 8

Rescue probes with thermodynamic and repeat masking filters:

efishent \
    --reference-genome ./hg-38.fa \
    --reference-annotation ./hg-38.gtf \
    --sequence-file ./my_gene.fasta \
    --mask-repeats True \
    --intergenic-off-targets True \
    --off-target-min-tm 37 \
    --threads 8

FAQ

Have questions? Open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

efishent-0.0.15.tar.gz (2.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

efishent-0.0.15-py3-none-any.whl (2.9 MB view details)

Uploaded Python 3

File details

Details for the file efishent-0.0.15.tar.gz.

File metadata

  • Download URL: efishent-0.0.15.tar.gz
  • Upload date:
  • Size: 2.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for efishent-0.0.15.tar.gz
Algorithm Hash digest
SHA256 1a7f6ef970ee34109b6ef5f97f5ec1b83e4099419ebaff38f9e0edea50651362
MD5 14fa1ba8b5c8a9f533269116f606b688
BLAKE2b-256 5233b46dcd818cadc86d34997725bb7776ba3690b0ad045fb8c86cc750ebbc9a

See more details on using hashes here.

Provenance

The following attestation bundles were made for efishent-0.0.15.tar.gz:

Publisher: pypi.yml on BBQuercus/eFISHent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file efishent-0.0.15-py3-none-any.whl.

File metadata

  • Download URL: efishent-0.0.15-py3-none-any.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for efishent-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 af2fbae134865d9847c7775b185c618870d043213520bb82dfcf95f3bbadaf85
MD5 7532a5b0885d0f39dc9550a63d97c839
BLAKE2b-256 d421e16573a7f293671cd9d14d96b5914772369ab840aa362761ffc117d26cb0

See more details on using hashes here.

Provenance

The following attestation bundles were made for efishent-0.0.15-py3-none-any.whl:

Publisher: pypi.yml on BBQuercus/eFISHent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page