Skip to main content

A Python package to design probes against overrepresented sequences in a fastq file.

Project description

PROBETHEUS

A Python package to detect overrepresented sequences in a fastq file and design probes against them. Designed for single read sequencing from immunoprecipitation experiments, riboSeq, and other single read sequencing experiments.

Installation

pip install probetheus

Features

  • Process single-end FASTQ files to find top represented sequences
  • Generate probes from top sequences with customizable lengths
  • Cluster sequences based on edit distance
  • Detect probe binding sites against reference sequences
  • Generate cumulative percentage plots with elbow point detection
  • Reanalyze results with custom elbow points
  • Subsample input files for faster analysis or testing

Usage

Processing FASTQ Files and Generating Probes

# Basic usage
probetheus process input.fastq.gz -o results.tsv

# With core sequence analysis
probetheus process input.fastq.gz -o results.tsv --find_core --core_length 25

# Process without length filtering
probetheus process input.fastq.gz -o results.tsv --skip_length_filter

# Check probe binding against reference
probetheus process input.fastq.gz -o results.tsv -r ref.fasta --max_binding_dist 2

# Process with subsampling (e.g., use 20% of reads)
probetheus process *.fastq.gz -o results.tsv --subsample 20

Reanalyzing Results

After initial processing, you can reanalyze the results with either a different elbow point or by specifying the number of probes:

# Reanalyze with a new elbow point
probetheus reanalyze --input results.tsv --elbow 5 --output-prefix new_results

# Reanalyze by specifying number of probes
probetheus reanalyze --input results.tsv --probes 10 --output-prefix new_results

This will create:

  • new_results_reanalyzed.tsv: New results file with selected sequences
  • new_results_reanalyzed_cumulative.png: Updated cumulative plot
  • new_results_reanalyzed_probes.fasta: New probe sequences

Arguments

Process Command

  • --output, -o: Output table file
  • --min-length: Minimum sequence length (default: 20)
  • --max-length: Maximum sequence length (default: 50)
  • --top-n: Number of top sequences to use for probe generation (default: 20)
  • --probe-length: Length of generated probes (default: 25)
  • --min-probe-length: Minimum acceptable probe length (default: 20)
  • --edit-distance: Maximum edit distance for clustering (default: 1)
  • --find-core: Find core sequences
  • --core-length: Length for core sequence analysis (default: 25)
  • --min-core-occurrence: Minimum fraction of sequences a core must appear in (default: 0.1)
  • --reference, -r: Reference FASTA file to check probe binding
  • --max-binding-dist: Maximum edit distance allowed for probe binding (default: 2)
  • --subsample: Subsample percentage (1-100) of reads from each file
  • --reads: Number of reads to take from each file (overrides --subsample if provided)
  • --cpus: Number of CPU cores to use (default: 8, max: number of cores - 1)

Reanalyze Command

  • --input, -i: Input results.tsv file from previous analysis
  • --elbow, -e: New elbow point (number of sequences to keep)
  • --probes, -p: Number of probes to generate (alternative to --elbow)
  • --output-prefix, -o: Prefix for output files (optional)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

probetheus-0.2.2.tar.gz (16.9 kB view details)

Uploaded Source

File details

Details for the file probetheus-0.2.2.tar.gz.

File metadata

  • Download URL: probetheus-0.2.2.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for probetheus-0.2.2.tar.gz
Algorithm Hash digest
SHA256 a16d849a813f013349cd77aa942496be372f68a335ddb678b505432ed0bbc67d
MD5 da02cd013eacb97169c80f92a9c9894a
BLAKE2b-256 7ea5099a6fea8f097f7b9efcd4ba477a13a49fc1f7828979154162d58041d661

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page