Skip to main content

AmpliconSeeK: a Python toolkit for detecting amplified genomic structures and candidate extrachromosomal DNA from sequencing data

Project description

AmpliconSeeK (ASK)

AmpliconSeeK (ASK) is a Python toolkit for detecting and reconstructing amplified genomic structures and candidate extrachromosomal DNA (ecDNA) from indexed alignment files, supporting both de novo discovery and targeted search of known ecDNA structures.

Current version: 0.1.1

Table of contents

Overview

Extrachromosomal DNA (ecDNA) is a dynamic form of oncogene amplification that contributes to cancer progression through high-copy gene dosage, regulatory rewiring, and cell-to-cell heterogeneity. AmpliconSeeK (ASK) is a computational framework for identifying ecDNA-associated amplicon structures from diverse high-throughput sequencing data, including WGS, WES, ChIP-seq, MNase-seq, ATAC-seq, scATAC-seq, and target-capture sequencing. ASK integrates copy-number signal from genomic bin counts with breakpoint-level evidence, including soft-clipped reads, split reads, supplementary alignments, breakpoint pairs, and junction sequences, to infer amplified segments and reconstruct candidate circular or linear amplicons. Candidate structures are annotated with genes, cancer genes, and super-enhancers and visualized with ASK-style amplicon plots.

ASK provides two main workflows:

Workflow Command Description
De novo detection ask Detect amplified segments, breakpoint pairs, and candidate circular amplicons directly from a BAM file.
Targeted search ask-search Search a new BAM file for evidence supporting a known ecDNA structure.

ASK can be applied to sequencing assays with genomic alignment signals, including WGS, WES, ChIP-seq, MNase-seq, ATAC-seq, scATAC-seq, and target-capture sequencing.

Software dependencies

  • ASK has been tested on macOS and Linux.
  • ASK uses indexed alignment files and standard Python packages.
  • Required Python packages include pysam, pandas, numpy, statsmodels, matplotlib, seaborn, scipy, and scikit-learn.

Installation

How to install Python and required packages

Install Miniconda by following https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Set up bioconda channels:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Create an environment with the required Python packages:

conda create -n ask --no-channel-priority pysam pandas numpy matplotlib statsmodels seaborn scipy scikit-learn

Activate environment and install ASK:

conda activate ask
pip install ask-ecdna

Now, you are ready to run ASK:

ask --help
ask-search --help

Input data preparation

Required input data

ASK requires the following input data:

Data Type Required for ask Required for ask-search Description
BAM Yes Yes Sorted and indexed alignment file
BAM index Yes Yes .bai index file
Genome annotation Recommended Recommended Gene annotation BED12 file
Cancer gene list Optional Optional Cancer gene census file
Super-enhancer annotation Optional Optional BED file for SE annotation
Known ecDNA structure No Yes ASK circular table or manually prepared known structure

BAM

The input alignment file should be sorted and indexed:

sample.bam
sample.bam.bai

For de novo detection, duplicate marking is recommended before running ASK.

Reference annotation data

ASK includes commonly used annotation files under data/. For example, with --genome hg38, ASK expects files such as:

data/hg38_refgene_process.bed12
data/se_hg38_sort.bed
data/Census_all_20200624_14_22_39.tsv

Custom annotation files can be provided manually:

--genefile /path/to/gene.bed12
--sefile /path/to/super_enhancer.bed
--cgfile /path/to/cancer_gene.tsv

The genome build used by the BAM file and annotation files should match.

Known ecDNA structure for search

ask-search accepts an ASK circular amplicon table:

*_ask_amplicon_circular.tsv

It also accepts a manually prepared known-structure table. At minimum, the table should contain:

AmpliconID Chrom Start End
circ_0 chr7 54830975 56117062

If segment-level order and strand are available, include them:

AmpliconID Chrom Start End Strand
circ_0 chr7 54830975 55200000 +
circ_0 chr7 55500000 56117062 +

ASK uses the known structure to derive reference breakpoint pairs for targeted search.

ASK output filenames follow this convention:

{outprefix}_ask_{result_name}.tsv

For example:

sample_ask_amplicon_circular.tsv
sample_ask_breakpoint_pair.tsv
sample_ask_bin_count_norm.tsv

De novo ecDNA detection

How to run from BAM file

Run the example BAM file included in this repository:

cd /path/to/AmpliconSeeK

ask \
  -i exampledata/testdata.bam \
  -o exampledata/testdata/samplename \
  -g hg38 \
  --subseg \
  --juncread 5 \
  --SA_with_nm

Output

The command generates ASK-style output:

testdata/
├──  samplename_ask_amplicon_circular.tsv
├──  samplename_ask_amplicon_circular_stat.tsv
├──  samplename_ask_amplicon_linear.tsv
├──  samplename_ask_amplified_segment.tsv
├──  samplename_ask_bin_count.tsv
├──  samplename_ask_bin_count_norm.tsv
├──  samplename_ask_breakpoint.tsv
├──  samplename_ask_breakpoint_pair.tsv
├──  samplename_ask_breakpoint_pair_raw.tsv
├──  samplename_ask_breakpoint_seg.tsv
├──  samplename_ask_clip_count.bedgraph
├──  samplename_ask_cn_segmentation.tsv
├──  samplename_ask_sc_support_matrix.tsv
├──  samplename_ask_sc_normal_alignment_matrix.tsv
├──  samplename_ask_junctionseq
│   ├──  circ_0.tsv
│   ├──  circ_1.tsv
│   ├──  circ_2.tsv
│   └──  circ_3.tsv
├──  samplename_ask_plot
│   ├──  ampseg_0.pdf
│   ├──  circular_circ_0.pdf
│   ├──  circular_circ_1.pdf
│   ├──  circular_circ_2.pdf
│   └──  circular_circ_3.pdf
├──  samplename_ask_stats.tsv
├──  samplename_ask_step1.pdat
├──  samplename_ask_step2.pdat
├──  samplename_ask_step3.pdat
└──  samplename_ask_step4.pdat

The single-cell matrix files are generated only when real cell barcodes are detected in breakpoint-supporting reads.

Segmentation mode

ASK uses -d/--segmode to specify the input data type for copy-number segmentation; the default is standard for whole-genome-like data, while bias is recommended for coverage-biased assays such as ATAC-seq, scATAC-seq, ChIP-seq, WES, MNase-seq, and target-capture sequencing.

Mode Recommended data types Description
standard WGS and low-bias whole-genome-like data, including ChIP-input/input control data Uses raw read counts in genomic bins. This is the default mode.
bias Coverage-biased assays such as ATAC-seq, scATAC-seq, ChIP-seq, WES, MNase-seq, and target-capture sequencing Uses sub-bin robust statistics and bias correction to reduce local coverage bias.

For WGS or ChIP-input data, the default mode is usually sufficient:

ask -i sample.bam -o sample_ask/sample -g hg38 -d standard

For ATAC-seq, scATAC-seq, ChIP-seq, WES, MNase-seq, or target-capture data, use:

ask -i sample.bam -o sample_ask/sample -g hg38 -d bias

How to prepare BAM file

Map FASTQ files to the genome:

# paired end
bwa_index=/path/to/hg38.fa
bwa mem -t 5 ${bwa_index} test_R1.fastq.gz test_R2.fastq.gz | samtools view -Shb - > test_unsorted.bam

# single end
bwa mem -t 5 ${bwa_index} test.fastq.gz | samtools view -Shb - > test_unsorted.bam

Sort and mark duplicates:

samtools fixmate --threads 5 -m test_unsorted.bam - \
    | samtools sort --threads 5 -T ./ - \
    | samtools markdup --threads 5 -T ./ -S -s - test.bam

Make index:

samtools index test.bam

Targeted ecDNA search

Use a known ecDNA structure and a new BAM. For the example data, first run the ask command above, then use its circular amplicon table as the known structure:

ask-search \
  --circular query_sample=exampledata/testdata/samplename_ask_amplicon_circular.tsv \
  --bam exampledata/testdata.bam \
  --genome hg38 \
  --min-junc-cnt 5 \
  -o exampledata/testdata_search/testdata_search

If running directly from the source tree:

python ask/ecDNA_search.py \
  --circular query_sample=exampledata/testdata/samplename_ask_amplicon_circular.tsv \
  --bam exampledata/testdata.bam \
  --genome hg38 \
  --min-junc-cnt 5 \
  -o exampledata/testdata_search/testdata_search

What search mode does

ask-search is a targeted workflow:

  1. Parse the known ecDNA structure.
  2. Derive reference breakpoint pairs from the known segments.
  3. Collect reads around relevant chromosomes and breakpoint neighborhoods.
  4. Match observed breakpoint-pair evidence to the reference breakpoint pairs.
  5. Reconstruct supported circular structures from the observed evidence.
  6. Report ASK-style outputs and Junction Concordance Score.

Parameters

Parameter Required Default Description
--circular Yes - Known ecDNA structure insample_id=known_ecDNA.tsv format
--bam Yes - Query BAM file
-o, --outdir Yes - Output directory
--outprefix No outdir/<bam-stem> ASK-style output prefix
--genome No hg38 Genome build for default annotation files
--target-genes No None Optional comma-separated cancer genes used to filter reference structures
--window No 200 Breakpoint-neighborhood search window in bp
--mapq No 20 Minimum mapping quality
--nmmax No 1 Maximum NM mismatch count
--min-junc-cnt No 1 Minimum junction read count used before DFS circular reconstruction
--bpp-min-dist No 50 Minimum same-chromosome breakpoint-pair distance in bp
--jcs-min-support No 5 Minimum supporting reads required to validate one reference junction
--min-jcs No 0.5 Circle-level JCS detection threshold

Output

The command generates ASK-style search output:

ask_search/
├── known_breakpoint_seed.tsv
├── known_ecDNA_breakpoint_pairs.tsv
├── known_ecDNA_segments.tsv
├── sample_search_ask_alignment_sequence.tsv
├── sample_search_ask_amplicon_circular_new.tsv
├── sample_search_ask_amplicon_circular_stat_new.tsv
├── sample_search_ask_amplicon_linear.tsv
├── sample_search_ask_amplified_segment.tsv
├── sample_search_ask_bin_count.tsv
├── sample_search_ask_bin_count_norm.tsv
├── sample_search_ask_breakpoint_pair.tsv
├── sample_search_ask_breakpoint_pair_raw.tsv
├── sample_search_ask_breakpoint_seq.tsv
├── sample_search_ask_breakpoint.tsv
├── sample_search_ask_clip_count.bedgraph
├── sample_search_ask_cn_segmentation.tsv
├── sample_search_ask_jcs.tsv
├── sample_search_ask_sc_support_matrix.tsv
├── sample_search_ask_sc_normal_alignment_matrix.tsv
├── sample_search_ask_stats.tsv
├── sample_search_ask_step1.pdat
├── sample_search_ask_step2.pdat
├── sample_search_ask_step3.pdat
├── sample_search_ask_step4.pdat
├── sample_search_ask_junctionseq/
└── plot/

Output files

File or Directory Generated by Description
*_ask_amplicon_circular.tsv ask, ask-search Candidate circular amplicon/ecDNA structures
*_ask_amplicon_circular_stat.tsv ask, ask-search Summary statistics for circular amplicons
*_ask_amplicon_linear.tsv ask, ask-search Candidate linear amplicon structures
*_ask_amplified_segment.tsv ask, ask-search Amplified genomic segments inferred from copy number signal
*_ask_breakpoint.tsv ask, ask-search Candidate breakpoint positions
*_ask_breakpoint_pair.tsv ask, ask-search Final breakpoint pairs used for amplicon reconstruction
*_ask_breakpoint_pair_raw.tsv ask, ask-search Raw breakpoint-pair candidates before final filtering
*_ask_breakpoint_seq.tsv ask, ask-search Breakpoint-associated sequence information
*_ask_alignment_sequence.tsv ask, ask-search Read-level alignment sequence evidence for breakpoint junctions
*_ask_junctionseq/ ask, ask-search Per-amplicon junction sequence files
*_ask_bin_count.tsv ask, ask-search Raw genomic bin counts
*_ask_bin_count_norm.tsv ask, ask-search Normalized bin counts for copy number estimation
*_ask_cn_segmentation.tsv ask, ask-search Copy number segmentation result
*_ask_clip_count.bedgraph ask, ask-search Soft-clipping evidence track
*_ask_sc_support_matrix.tsv ask, ask-search Single-cell junction-support matrix, generated only when barcodes are detected
*_ask_sc_normal_alignment_matrix.tsv ask, ask-search Single-cell normal-alignment matrix, generated only when barcodes are detected
*_ask_stats.tsv ask, ask-search Run-level summary statistics
*_ask_step1.pdat to *_ask_step4.pdat ask, ask-search Intermediate cache files
*_ask_jcs.tsv ask-search Junction Concordance Score summary
known_ecDNA_segments.tsv ask-search Parsed known ecDNA segments used as the search target
known_ecDNA_breakpoint_pairs.tsv ask-search Reference breakpoint pairs derived from the known structure
known_breakpoint_seed.tsv ask-search Breakpoint seed table used for targeted evidence collection
plot/ ask, ask-search Amplicon visualization figures

File formats

Amplicon tables

*_ask_amplicon_circular.tsv and *_ask_amplicon_linear.tsv report reconstructed circular and linear amplicons. Each row is one segment assigned to an amplicon.

Column Description
Chrom Chromosome of the segment.
Start, End Segment genomic coordinates.
Strand Segment orientation in the reconstructed structure.
SplitCount Number of split/junction-supporting reads associated with the segment.
CN Segment copy number estimate.
AmpliconID Reconstructed amplicon identifier, such as circ_0 or line_0.
Gene Genes overlapping the segment.
CancerGene Cancer genes overlapping the segment.
SE Super-enhancer annotations overlapping the segment.

*_ask_amplicon_circular_stat.tsv summarizes each circular amplicon.

Column Description
AmpliconID Circular amplicon identifier.
Chrom1, Start, Chrom2, End Outer genomic span used to summarize the amplicon.
Seg_num Number of segments in the circle.
Length Total segment length.
SplitCount_sum, SplitCount_mean, SplitCount_std Junction-support read count summary across segments.
CN_sum, CN_mean, CN_std Copy number summary across segments.
FCleft_sum, FCright_sum Copy-number fold-change evidence at left and right boundaries.
invCNCV_sum, invCNCV_mean Inverse copy-number coefficient-of-variation score; larger values indicate smoother CN.
invSplitCV Inverse split-read coefficient-of-variation score.
Gene_num Number of genes overlapping the amplicon.
Cancergene_num Number of cancer genes overlapping the amplicon.
SE_num Number of super-enhancer annotations overlapping the amplicon.
FCleft_mean_1, FCright_mean_1 Mean boundary fold-change values used in scoring.
Score Final amplicon score.

Copy number tables

*_ask_bin_count.tsv contains raw bin-level read counts.

Column Description
Chrom Chromosome.
Coord Bin coordinate.
Count Read count in the bin.
CN Copy number estimate for the bin.

*_ask_bin_count_norm.tsv contains normalized bin counts.

Column Description
Chrom, Coord, Count, CN Same as *_ask_bin_count.tsv, after normalization.
Log2Ratio Log2 copy-number ratio used for segmentation.

*_ask_cn_segmentation.tsv reports copy-number segments.

Column Description
Chrom, Start, End Copy-number segment coordinates.
Count Segment-level read count summary.
CN Segment copy number estimate.
Log2Ratio Segment log2 copy-number ratio.

*_ask_amplified_segment.tsv reports amplified segments selected from copy-number and breakpoint evidence.

Column Description
Chrom, Start, End Amplified segment coordinates.
Count Segment read count summary.
CN Segment copy number estimate.
ClipLeft, ClipRight Clipped-read support at left and right segment boundaries.
Gene Genes overlapping the segment.
CancerGene Cancer genes overlapping the segment.

*_ask_breakpoint_seg.tsv stores breakpoint-derived segments used during graph construction.

Column Description
Chrom, Start, End Breakpoint-derived segment coordinates.
CN Copy number assigned to the segment.

Breakpoint tables

*_ask_breakpoint.tsv reports candidate breakpoint positions.

Column Description
Chrom Chromosome of the breakpoint.
Coord Breakpoint coordinate.
Clip Breakpoint side: L for left-clipped boundary or R for right-clipped boundary.
CleanBP Whether the breakpoint passes clean-breakpoint filtering.
ClipDepth Number of clipped reads supporting the breakpoint.
InDepth, OutDepth Local read-depth summaries inside and outside the breakpoint.

*_ask_breakpoint_pair_raw.tsv and *_ask_breakpoint_pair.tsv report breakpoint-pair evidence. The raw file contains candidate pairs before final reconstruction filtering; the final file contains pairs used in amplicon reconstruction.

Column Description
Chrom1, Coord1, Clip1 First breakpoint side.
Chrom2, Coord2, Clip2 Second breakpoint side.
Count Supporting read count for the breakpoint pair.
offset Junction offset. Negative values indicate overlap between the two breakpoint-side sequences; positive values indicate an insertion/gap.
Seq Junction sequence or support type such as PE_Support.
Readbarcode Cell barcodes supporting the breakpoint pair. Empty lists indicate no single-cell barcode was detected.

Single-cell matrix files

For single-cell assays, ASK automatically writes two barcode-level matrices when breakpoint-pair barcodes are detected. The two matrices always have the same rows and columns; missing values are filled with 0.

File Description
*_ask_sc_support_matrix.tsv Junction-support matrix. Each row is a breakpoint pair and each barcode column stores the number of junction-supporting reads in that cell.
*_ask_sc_normal_alignment_matrix.tsv Normal-alignment matrix. Each row is a breakpoint pair and each barcode column stores the number of normal reads spanning the breakpoint positions in that cell.
Column Description
JunctionID Breakpoint-pair coordinate identifier, formatted as `Chrom1:Coord1:Clip1
AmpliconID Amplicon containing the junction.
<barcode> One column per cell barcode. Values are read counts.

Normal-alignment reads must span the breakpoint coordinate and must not be supplementary, secondary, SA-tagged, or clipped near the breakpoint.

Run summary

*_ask_stats.tsv is a plain-text run summary rather than a standard table. It records counts such as the number of amplified segments, breakpoint pairs, circular amplicons, and linear amplicons.

JCS file

*_ask_jcs.tsv is generated by targeted search mode.

Column Description
CircleID Reference circle identifier
total_reference_junctions Number of reference junctions derived from the known ecDNA structure
validated_junctions Number of reference junctions supported in the query BAM
total_support_reads Total supporting reads across validated junctions
JCS Junction Concordance Score
Detected Whether the circle passes the JCS threshold

JCS is computed as:

JCS = validated reference junctions / total reference junctions

By default, a reference junction is considered validated when it has at least five supporting reads, and a circle is marked detected when JCS > 0.5.

Algorithm overview

ASK reconstructs amplified structures from coverage and breakpoint evidence:

  1. Alignment evidence extraction from an indexed BAM.
  2. Read counting in genomic bins.
  3. Copy number normalization and segmentation.
  4. Amplified segment detection.
  5. Breakpoint detection from clipping and supplementary-alignment evidence.
  6. Breakpoint-pair construction.
  7. Graph-based circular and linear amplicon reconstruction.
  8. Gene, cancer gene, and super-enhancer annotation.
  9. ASK-style visualization.

The targeted ask-search workflow follows the same evidence model but constrains the initial evidence collection using a known ecDNA structure.

Checkpointing and modular usage

ASK writes intermediate .pdat files:

File Stage
*_ask_step1.pdat Alignment evidence and bin counts
*_ask_step2.pdat Copy number and amplified segment detection
*_ask_step3.pdat Breakpoint-pair detection
*_ask_step4.pdat Amplicon reconstruction

These files are useful for debugging, rerunning downstream steps, and comparing parameter choices. When rerunning from scratch, use a fresh output prefix or remove incompatible intermediate files.

ASK can also be used modularly:

Use Case Suggested Entry Point
Start from BAM ask
Start from known ecDNA structure ask-search
Compare one reference ecDNA across samples Run ask-search once per query BAM
Replot existing ASK outputs Use the generated circular, linear, copy number, and bin-count tables

License

ASK is released under the MIT License. See LICENSE for details.

Contact

For questions and feedback, please open an issue on GitHub or contact Nana Wei (nnwei@shsmu.edu.cn).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ask_ecdna-0.1.1-py3-none-any.whl (55.8 MB view details)

Uploaded Python 3

File details

Details for the file ask_ecdna-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ask_ecdna-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 55.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.6

File hashes

Hashes for ask_ecdna-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 336edd777043bd06d1f4fae02106417b1cc00c2607e91333ef8a361da04908b9
MD5 33d677aca6a04aef4c16d434f4ce55eb
BLAKE2b-256 477238ab7a5b49486716a5b7a150ea8ea6b4d8fc36aab37bd171a1a2286f791f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page