This Python package, is designed to calculate the fragment length ratios from a BAM file using the input BED and reference genome files. The script provides several options for manipulating the input intervals and applying GC content correction to the coverage analysis.
Project description
fragscan_ct
This Python package, is designed to calculate the fragment length ratios from a BAM file using the input BED and reference genome files. The script provides several options for manipulating the input intervals and applying GC content correction to the coverage analysis.
Features
- Fragment Ratio Calculation: The script calculates the ratio of short to long fragments based on the input BAM file.
- Interval Manipulation: Users can choose to merge or split the intervals in the input BED file, as well as pad the coordinates before binning.
- GC Content Correction: The script applies a LOWESS (Locally Weighted Scatterplot Smoothing) algorithm to correct the coverage based on the GC content of the fragments.
- Visualization: The script generates plots to visualize the fragment length distribution and the GC-corrected coverage.
- Output: The script generates a text file containing the calculated fragment counts, ratios, z-scores, and coverage information.
Dependencies
The script requires the following Python 3 libraries:
- typer: For command-line interface
- pathlib: For handling file paths
- rich: For progress bar and console output
- plotly: For generating interactive plots
- pandas: For data manipulation
- numpy: For numerical operations
- scipy: For statistical functions
Usage
Main
❯ python fragscan_ct --help
Usage: fragscan_ct [OPTIONS] COMMAND [ARGS]...
╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ generate-fragment-ratios The `generate_fragment_ratios` function generates a new TXT file by processing a BED file and calculating fragment length │
│ ratios from a BAM file. │
│ plot-fragment-ratios The `plot_fragment_ratios` function takes in a file of files or a list of input TXT files, reads the data from the files into │
│ a Pandas DataFrame, and plots a line plot of the "Ratio" column against the "Id" column, with different colors for each │
│ "Sample_Id". The resulting plot is saved as an HTML file. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Generate Ratios
The generate_fragment_ratios
function calculates fragment ratios from a BAM file using input BED and reference genome files, with options for interval manipulation and GC correction.
❯ python fragscan_ct generate-fragment-ratios --help
Usage: fragscan_ct generate_fragment_ratios [OPTIONS]
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --reference-file -r FILE Input reference genome FASTA file to be used while traversing the BAM file [default: None] [required] │
│ * --input-bed -i FILE Input BED file to be used to traverse the BAM file [default: None] [required] │
│ * --input-bam -bam FILE Input BAM file to be used to calculate fragment length [default: None] [required] │
│ --output-txt -o TEXT Output TXT file after traversing the BAM file [default: fragment_counts.txt] │
│ * --sample-id -id TEXT Sample Identifier [default: None] [required] │
│ --merge-interval -m Merge interval in the BED file by splitting the 4th column with `:` and using the first value │
│ --split-interval -s Split the BED interval based on the BIN size specified in the `bin_size` option. │
│ --short-fragment-length -sfl <INTEGER INTEGER>... Define which fragments should be called as short fragment, provide two integers separated by a comma, the first value in the tuple is the lower bound of the fragment length range for short fragments, and the second │
│ value is the upper bound of the fragment length range for short fragments │
│ [default: 100, 150] │
│ --long-fragment-length -lfl <INTEGER INTEGER>... Define which fragments should be called as long fragment, provide two integers separated by a comma, the first value in the tuple is the lower bound of the fragment length range for long fragments, and the second │
│ value is the upper bound of the fragment length range for long fragments │
│ [default: 151, 220] │
│ --bin-size -b INTEGER Bin size to split the BED file, only used when `split_interval` is True [default: 50] │
│ --pad-size -p INTEGER Pad the coordinates with the given pad size in the BED file, before binning [default: 50] │
│ --lowess-fraction -l FLOAT When running lowess GC correction of coverage, the fraction of the data used when estimating each y-value [default: 0.75] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
The required inputs are:
- --reference-file: The reference genome FASTA file.
- --input-bed: The input BED file containing the genomic intervals of interest.
- --input-bam: The input BAM file containing the sequencing reads.
- --sample-id: The identifier for the sample being processed.
The optional parameters allow you to customize the interval manipulation and GC content correction:
- --merge-interval: Merges the intervals in the BED file.
- --split-interval: Splits the intervals in the BED file based on the --bin-size parameter.
- --bin-size: The size of the bins used when --split-interval is enabled.
- --pad-size: The size of the padding applied to the coordinates in the BED file.
- --lowess-fraction: The fraction of data used for the LOWESS GC content correction.
Example Command:
python fragscan_ct generate_fragment_ratios \
--reference-file=hg38.fa \
--input-bed=target_regions.bed \
--input-bam=sample_data.bam \
--output-txt=fragment_counts.txt \
--sample-id=sample_1 \
--short-fragment-length=100,150 \
--long-fragment-length=151,220 \
--lowess-fraction=0.75
Output
The script generates a text file named fragment_counts.txt (or the value specified in the --output-txt option) containing the following information:
- Chromosome
- Start position
- End position
- Additional information from the BED file
- Strand
- Score
- Short fragment counts
- Long fragment counts
- Raw ratio
- Coverage for short fragments
- Coverage for long fragments
- GC content for short fragments
- GC content for long fragments
Plot Ratios
The plot_fragment_ratios
function takes in a file of files or a list of input TXT files, reads the data from the files into a Pandas DataFrame, and plots a line plot of the "Ratio" column against the "Id" column, with different colors for each "Sample_Id". The resulting plot is saved as an HTML file.
❯ python fragscan_ct plot-fragment-ratios --help
Usage: fragscan_ct plot-fragment-ratios
[OPTIONS]
╭─ Options ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --list -l PATH File of files, List of txt files to be used for plotting [default: None] │
│ --input-txt -i FILE Input TXT file that was generated using generate_fragment_counts [default: None] │
│ --output-prefix -o TEXT Output HTML file prefix for the line and box plot [default: fragment_counts] │
│ --help Show this message and exit. │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Output
- Fragment length distribution
- GC-corrected coverage
These plots are saved as fragment_length_distribution.html and gc_corrected_coverage.html, respectively.
License
This project is licensed under the GPL3 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fragscan_ct-0.1.0.tar.gz
.
File metadata
- Download URL: fragscan_ct-0.1.0.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf64aae7919abef8f9e7af389d81f788b8e22c93837663f1b750f8fcb6534cf6 |
|
MD5 | 75a4b8612163b6dd49ea80e4294e2c9d |
|
BLAKE2b-256 | 934ab993bb06e6c95f74212a9cdedf4f580723c051b959ea38bd603b2e3ae00b |
File details
Details for the file fragscan_ct-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: fragscan_ct-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.12.4 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e98c21c3e7b52efdedac7c6c053678cb6ce14b42d9adc467f9c6e54ee3da36d9 |
|
MD5 | 3b8cfbe3b58e9a3d82d45cc5579e22b8 |
|
BLAKE2b-256 | c46a133b53b4e9df3a1baf279e9e5c1692f488bc4d3b02a4f3fd546fc4b7cbc6 |