A Python package to design probes against overrepresented sequences in a fastq file.
Project description
PROBETHEUS
A Python package to detect overrepresented sequences in a fastq file and design probes against them. Designed for single read sequencing from immunoprecipitation experiments, riboSeq, and other single read sequencing experiments.
Installation
pip install probetheus
Features
- Process single-end FASTQ files to find top represented sequences
- Generate probes from top sequences with customizable lengths
- Cluster sequences based on edit distance
- Detect probe binding sites against reference sequences
- Generate cumulative percentage plots with elbow point detection
- Reanalyze results with custom elbow points
- Subsample input files for faster analysis or testing
Usage
Processing FASTQ Files and Generating Probes
# Basic usage
probetheus process input.fastq.gz -o results.tsv
# With core sequence analysis
probetheus process input.fastq.gz -o results.tsv --find_core --core_length 25
# Process without length filtering
probetheus process input.fastq.gz -o results.tsv --skip_length_filter
# Check probe binding against reference
probetheus process input.fastq.gz -o results.tsv -r ref.fasta --max_binding_dist 2
# Process with subsampling (e.g., use 20% of reads)
probetheus process *.fastq.gz -o results.tsv --subsample 20
Reanalyzing Results
After initial processing, you can reanalyze the results with either a different elbow point or by specifying the number of probes:
# Reanalyze with a new elbow point
probetheus reanalyze --input results.tsv --elbow 5 --output-prefix new_results
# Reanalyze by specifying number of probes
probetheus reanalyze --input results.tsv --probes 10 --output-prefix new_results
This will create:
new_results_reanalyzed.tsv: New results file with selected sequencesnew_results_reanalyzed_cumulative.png: Updated cumulative plotnew_results_reanalyzed_probes.fasta: New probe sequences
Arguments
Process Command
--output,-o: Output table file--min-length: Minimum sequence length (default: 20)--max-length: Maximum sequence length (default: 50)--top-n: Number of top sequences to use for probe generation (default: 20)--probe-length: Length of generated probes (default: 25)--min-probe-length: Minimum acceptable probe length (default: 20)--edit-distance: Maximum edit distance for clustering (default: 1)--find-core: Find core sequences--core-length: Length for core sequence analysis (default: 25)--min-core-occurrence: Minimum fraction of sequences a core must appear in (default: 0.1)--reference,-r: Reference FASTA file to check probe binding--max-binding-dist: Maximum edit distance allowed for probe binding (default: 2)--subsample: Subsample percentage (1-100) of reads from each file--reads: Number of reads to take from each file (overrides --subsample if provided)--cpus: Number of CPU cores to use (default: 8, max: number of cores - 1)
Reanalyze Command
--input,-i: Input results.tsv file from previous analysis--elbow,-e: New elbow point (number of sequences to keep)--probes,-p: Number of probes to generate (alternative to --elbow)--output-prefix,-o: Prefix for output files (optional)
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
probetheus-0.2.1.tar.gz
(15.1 kB
view details)
File details
Details for the file probetheus-0.2.1.tar.gz.
File metadata
- Download URL: probetheus-0.2.1.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
39e13a85f7078eb67a46020737037ea4461d3074cf122fb0ce0515718ec0c00e
|
|
| MD5 |
52c83c1748515356cf776773412fd91c
|
|
| BLAKE2b-256 |
88dd7f99956600356faad867f3e0c166742ef2082a087bceef1a41a386741239
|