Skip to main content

A Python package to design probes against overrepresented sequences in a fastq file.

Project description

PROBETHEUS

A Python package to detect overrepresented sequences in a fastq file and design probes against them. Designed for single read sequencing from immunoprecipitation experiments, riboSeq, and other single read sequencing experiments.

Installation

pip install probetheus

Features

  • Process single-end FASTQ files to find top represented sequences
  • Generate probes from top sequences with customizable lengths
  • Cluster sequences based on edit distance
  • Detect probe binding sites against reference sequences
  • Generate cumulative percentage plots with elbow point detection
  • Reanalyze results with custom elbow points
  • Subsample input files for faster analysis or testing

Usage

Processing FASTQ Files and Generating Probes

# Basic usage
probetheus process input.fastq.gz -o results.tsv

# With core sequence analysis
probetheus process input.fastq.gz -o results.tsv --find_core --core_length 25

# Process without length filtering
probetheus process input.fastq.gz -o results.tsv --skip_length_filter

# Check probe binding against reference
probetheus process input.fastq.gz -o results.tsv -r ref.fasta --max_binding_dist 2

# Process with subsampling (e.g., use 20% of reads)
probetheus process *.fastq.gz -o results.tsv --subsample 20

Reanalyzing Results

After initial processing, you can reanalyze the results with either a different elbow point or by specifying the number of probes:

# Reanalyze with a new elbow point
probetheus reanalyze --input results.tsv --elbow 5 --output-prefix new_results

# Reanalyze by specifying number of probes
probetheus reanalyze --input results.tsv --probes 10 --output-prefix new_results

This will create:

  • new_results_reanalyzed.tsv: New results file with selected sequences
  • new_results_reanalyzed_cumulative.png: Updated cumulative plot
  • new_results_reanalyzed_probes.fasta: New probe sequences

Arguments

Process Command

  • --output, -o: Output table file
  • --min-length: Minimum sequence length (default: 20)
  • --max-length: Maximum sequence length (default: 50)
  • --top-n: Number of top sequences to use for probe generation (default: 20)
  • --probe-length: Length of generated probes (default: 25)
  • --min-probe-length: Minimum acceptable probe length (default: 20)
  • --edit-distance: Maximum edit distance for clustering (default: 1)
  • --find-core: Find core sequences
  • --core-length: Length for core sequence analysis (default: 25)
  • --min-core-occurrence: Minimum fraction of sequences a core must appear in (default: 0.1)
  • --reference, -r: Reference FASTA file to check probe binding
  • --max-binding-dist: Maximum edit distance allowed for probe binding (default: 2)
  • --subsample: Subsample percentage (1-100) of reads from each file
  • --reads: Number of reads to take from each file (overrides --subsample if provided)
  • --cpus: Number of CPU cores to use (default: 8, max: number of cores - 1)

Reanalyze Command

  • --input, -i: Input results.tsv file from previous analysis
  • --elbow, -e: New elbow point (number of sequences to keep)
  • --probes, -p: Number of probes to generate (alternative to --elbow)
  • --output-prefix, -o: Prefix for output files (optional)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

probetheus-0.2.3.tar.gz (16.9 kB view details)

Uploaded Source

File details

Details for the file probetheus-0.2.3.tar.gz.

File metadata

  • Download URL: probetheus-0.2.3.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for probetheus-0.2.3.tar.gz
Algorithm Hash digest
SHA256 720264a27d1bbbd666a3429279ade425c02bf8f97041fe14f43f74837b8d0f7b
MD5 e4936f4c393340e3cc239fdaf9a106e3
BLAKE2b-256 058dd5688fda6ed3f46261701d00699aa8b5b79c611633b49eaa09d029df5c08

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page