ecDNA Analysis Toolkit - Deep learning-based extrachromosomal DNA prediction

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

shixiangwang

These details have not been verified by PyPI

Project links

Paper

Project description

otk: ecDNA Analysis Toolkit

otk (ecDNA Analysis Toolkit) is a deep learning-based tool for analyzing extrachromosomal DNA (ecDNA), predicting whether genes are detected as ecDNA cargo genes at the gene level, and classifying focal amplification types at the sample level.

Core Features

Deep learning-based ecDNA cargo gene prediction
Sample-level focal amplification type classification
Support for analysis from BAM files or processed copy number data
Efficient command-line interface
GPU acceleration support

Technology Stack

Python 3.8+
PyTorch 2.0+
NumPy
Pandas
scikit-learn
Click (command-line interface)

Installation Guide

From Source

Clone the repository:

git clone https://github.com/WangLabCSU/otk.git
cd otk

Install with pip:

pip install -e .

Dependencies

The following dependencies will be installed automatically:

pandas>=2.0
numpy>=1.24
torch>=2.0
scikit-learn>=1.3
tqdm>=4.65
click>=8.1
matplotlib>=3.7
seaborn>=0.12
pyyaml>=6.0

Usage

otk provides two main command-line subcommands: train and predict.

Model Training

Use the otk train command to train the model:

otk train --config configs/model_config.yml --output models/ --gpu 0

Parameters:

--config, -c: Path to configuration file (default: configs/model_config.yml)
--output, -o: Output directory for trained models (default: models/)
--gpu, -g: GPU device ID to use (default: 0)

Model Prediction

Use the otk predict command for predictions:

otk predict --model models/best_model.pth --input data/test_data.csv --output predictions/ --gpu -1

Parameters:

--model, -m: Path to trained model (required)
--input, -i: Path to input data file (required)
--output, -o: Output directory for predictions (default: predictions/)
--gpu, -g: GPU device ID to use (default: -1, i.e., use CPU)

Data Format

Input Data Format

Input data should be in CSV format with the following columns:

Required identifier columns:

sample: Tumor sample ID
gene_id: Gene ID

Copy number features:

segVal: Total gene copy number
minor_cn: Minor gene copy number
intersect_ratio: Proportion of overlap between copy number detection segment and gene region

Sample-level genomic features (same value for all genes in a sample):

purity: Tumor purity estimate
ploidy: Tumor genome ploidy estimate
AScore: Aneuploidy score
pLOH: Proportion of genome with loss of heterozygosity (LOH)
cna_burden: Proportion of genome with copy number alterations

Copy number signature features:

CN1 to CN19: 19 copy number signature activity estimates

Clinical features:

age: Patient age
gender: Patient gender (0/1 encoded)

Tumor type features (one-hot encoded, 24 cancer types):

type_BLCA, type_BRCA, type_CESC, type_COAD, type_DLBC, type_ESCA, type_GBM, type_HNSC
type_KICH, type_KIRC, type_KIRP, type_LGG, type_LIHC, type_LUAD, type_LUSC, type_OV
type_PRAD, type_READ, type_SARC, type_SKCM, type_STAD, type_THCA, type_UCEC, type_UVM

Gene frequency features:

freq_Linear: Prior estimated frequency of gene in linear focal amplifications
freq_BFB: Prior estimated frequency of gene in breakage-fusion-bridge (BFB) events
freq_Circular: Prior estimated frequency of gene in circular focal amplifications (ecDNA)
freq_HR: Prior estimated frequency of gene in homologous recombination events

Target column (for training data):

y: Binary label indicating whether the gene is detected as an ecDNA cargo gene (1) or not (0)

Output Data Format

Prediction results are saved as a CSV file with the following columns:

Gene-level predictions:

sample: Tumor sample ID
gene_id: Gene ID
prediction_prob: Probability of being an ecDNA cargo gene (0-1)
prediction: Binary classification result (0 = not ecDNA cargo, 1 = ecDNA cargo)

Sample-level predictions:

sample_level_prediction_label: Sample-level focal amplification type classification:
- nofocal: No focal amplification detected
- noncircular: Non-circular focal amplification detected
- circular: Circular focal amplification (ecDNA) detected
sample_level_prediction: Numerical encoding of sample-level classification (0 = nofocal, 1 = noncircular, 2 = circular)

Note: Sample-level classification follows these rules:

If any gene in the sample is predicted as ecDNA cargo (prediction = 1), the sample is classified as circular
If no ecDNA cargo genes but any gene has segVal > ploidy + 2, the sample is classified as noncircular
Otherwise, the sample is classified as nofocal

Model Architecture

otk supports multiple model architectures with unified interface:

Available Models

Model	Type	Description
xgb_new	XGBoost	Optimized with feature engineering
xgb_paper	XGBoost	Paper reproduction (11 features)
baseline_mlp	Neural Network	Simple MLP baseline
transformer	Neural Network	Transformer architecture
deep_residual	Neural Network	Deep residual network
optimized_residual	Neural Network	Optimized residual network
dgit_super	Neural Network	Deep gated interaction transformer
tabpfn	TabPFN	TabPFN ensemble

Unified Interface

All models inherit from BaseEcDNAModel and provide:

fit(X_train, y_train, X_val, y_val) - Training
predict_proba(X) - Probability prediction
predict(X) - Binary prediction
save(path) / load(path) - Persistence

Data Split

All models use unified data split (80/10/10) with seed=2026 for reproducibility.

Training Script

Use the unified training script:

# Train single model
python train_unified.py --model xgb_new

# Train all models
python train_unified.py --all

Configuration File

Model configuration uses YAML format, with example configuration files located in configs/. You can modify parameters in the configuration files as needed, such as model architecture and training parameters.

Examples

Training Examples

# Train model with default configuration
otk train

# Train model with custom configuration file
otk train --config my_config.yml

Prediction Examples

# Make predictions using a trained model
otk predict --model models/best_model.pth --input test_data.csv

Performance Metrics

The following performance metrics are recorded during model training:

auPRC (Area under Precision-Recall Curve)
AUC (Area under ROC Curve)
F1 Score
Precision
Recall

Contribution Guide

We welcome community contributions! If you have any questions or suggestions, please submit them through GitHub Issues.

Development Process

Fork the repository
Create a feature branch
Implement features or fix bugs
Run tests
Submit a Pull Request

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Citation

If you use otk in your research, please cite the following paper:

Wang, S., Wu, C. Y., He, M. M., Yong, J. X., Chen, Y. X., Qian, L. M., ... & Zhao, Q. (2024). Machine learning-based extrachromosomal DNA identification in large-scale cohorts reveals its clinical implications in cancer. Nature Communications, 15(1), 1-17.

Contact

Project homepage: https://github.com/WangLabCSU/otk
Email: wangshx@csu.edu.cn

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

shixiangwang

These details have not been verified by PyPI

Project links

Paper

Release history Release notifications | RSS feed

1.0.2

Apr 25, 2026

1.0.1

Apr 25, 2026

This version

1.0.0

Apr 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

otk_ecdna-1.0.0.tar.gz (65.9 MB view details)

Uploaded Apr 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

otk_ecdna-1.0.0-py3-none-any.whl (64.3 MB view details)

Uploaded Apr 25, 2026 Python 3

File details

Details for the file otk_ecdna-1.0.0.tar.gz.

File metadata

Download URL: otk_ecdna-1.0.0.tar.gz
Upload date: Apr 25, 2026
Size: 65.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for otk_ecdna-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`1346a744a5248108e1bca62d9c6ee97865b7efaf1f6bc2eb05e90589a5b42497`
MD5	`d1b505323a89c9c2fa3fc0956bf97209`
BLAKE2b-256	`b6828a60b3edc3f41684648f0fb77f21348477c3e093bf0cac2fe5a42d56c9f9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for otk_ecdna-1.0.0.tar.gz:

Publisher: publish.yml on WangLabCSU/otk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: otk_ecdna-1.0.0.tar.gz
- Subject digest: 1346a744a5248108e1bca62d9c6ee97865b7efaf1f6bc2eb05e90589a5b42497
- Sigstore transparency entry: 1377154364
- Sigstore integration time: Apr 25, 2026
Source repository:
- Permalink: WangLabCSU/otk@92e22e88582ad282606517cbc8e40335e0448632
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/WangLabCSU
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@92e22e88582ad282606517cbc8e40335e0448632
- Trigger Event: push

File details

Details for the file otk_ecdna-1.0.0-py3-none-any.whl.

File metadata

Download URL: otk_ecdna-1.0.0-py3-none-any.whl
Upload date: Apr 25, 2026
Size: 64.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for otk_ecdna-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`384dc89b59f17d5e53be6797fa8e88e72c62241e4f59c25fc4a77407fc784fbf`
MD5	`468ccb5a360c4deb460e789f514794f7`
BLAKE2b-256	`d1fec1ca7e2da59edb9e9c5caa36253e65683868282108e85445509e154eca48`

See more details on using hashes here.

Provenance

The following attestation bundles were made for otk_ecdna-1.0.0-py3-none-any.whl:

Publisher: publish.yml on WangLabCSU/otk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: otk_ecdna-1.0.0-py3-none-any.whl
- Subject digest: 384dc89b59f17d5e53be6797fa8e88e72c62241e4f59c25fc4a77407fc784fbf
- Sigstore transparency entry: 1377154443
- Sigstore integration time: Apr 25, 2026
Source repository:
- Permalink: WangLabCSU/otk@92e22e88582ad282606517cbc8e40335e0448632
- Branch / Tag: refs/tags/v1.0.0
- Owner: https://github.com/WangLabCSU
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@92e22e88582ad282606517cbc8e40335e0448632
- Trigger Event: push

otk-ecdna 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

otk: ecDNA Analysis Toolkit

Core Features

Technology Stack

Installation Guide

From Source

Dependencies

Usage

Model Training

Model Prediction

Data Format

Input Data Format

Output Data Format

Model Architecture

Available Models

Unified Interface

Data Split

Training Script

Configuration File

Examples

Training Examples

Prediction Examples

Performance Metrics

Contribution Guide

Development Process

License

Citation

Contact

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance