ecDNA Analysis Toolkit - Deep learning-based extrachromosomal DNA prediction
Project description
otk: ecDNA Analysis Toolkit
otk (ecDNA Analysis Toolkit) is a deep learning-based tool for analyzing extrachromosomal DNA (ecDNA), predicting whether genes are detected as ecDNA cargo genes at the gene level, and classifying focal amplification types at the sample level.
Core Features
- Deep learning-based ecDNA cargo gene prediction
- Sample-level focal amplification type classification
- Support for analysis from BAM files or processed copy number data
- Efficient command-line interface
- GPU acceleration support
Technology Stack
- Python 3.8+
- PyTorch 2.0+
- NumPy
- Pandas
- scikit-learn
- Click (command-line interface)
Installation Guide
From Source
- Clone the repository:
git clone https://github.com/WangLabCSU/otk.git
cd otk
- Install with pip:
pip install -e .
Dependencies
The following dependencies will be installed automatically:
- pandas>=2.0
- numpy>=1.24
- torch>=2.0
- scikit-learn>=1.3
- tqdm>=4.65
- click>=8.1
- matplotlib>=3.7
- seaborn>=0.12
- pyyaml>=6.0
Usage
otk provides two main command-line subcommands: train and predict.
Model Training
Use the otk train command to train the model:
otk train --config configs/model_config.yml --output models/ --gpu 0
Parameters:
--config, -c: Path to configuration file (default: configs/model_config.yml)--output, -o: Output directory for trained models (default: models/)--gpu, -g: GPU device ID to use (default: 0)
Model Prediction
Use the otk predict command for predictions:
otk predict --model models/best_model.pth --input data/test_data.csv --output predictions/ --gpu -1
Parameters:
--model, -m: Path to trained model (required)--input, -i: Path to input data file (required)--output, -o: Output directory for predictions (default: predictions/)--gpu, -g: GPU device ID to use (default: -1, i.e., use CPU)
Data Format
Input Data Format
Input data should be in CSV format with the following columns:
Required identifier columns:
sample: Tumor sample IDgene_id: Gene ID
Copy number features:
segVal: Total gene copy numberminor_cn: Minor gene copy numberintersect_ratio: Proportion of overlap between copy number detection segment and gene region
Sample-level genomic features (same value for all genes in a sample):
purity: Tumor purity estimateploidy: Tumor genome ploidy estimateAScore: Aneuploidy scorepLOH: Proportion of genome with loss of heterozygosity (LOH)cna_burden: Proportion of genome with copy number alterations
Copy number signature features:
CN1toCN19: 19 copy number signature activity estimates
Clinical features:
age: Patient agegender: Patient gender (0/1 encoded)
Tumor type features (one-hot encoded, 24 cancer types):
type_BLCA,type_BRCA,type_CESC,type_COAD,type_DLBC,type_ESCA,type_GBM,type_HNSCtype_KICH,type_KIRC,type_KIRP,type_LGG,type_LIHC,type_LUAD,type_LUSC,type_OVtype_PRAD,type_READ,type_SARC,type_SKCM,type_STAD,type_THCA,type_UCEC,type_UVM
Gene frequency features:
freq_Linear: Prior estimated frequency of gene in linear focal amplificationsfreq_BFB: Prior estimated frequency of gene in breakage-fusion-bridge (BFB) eventsfreq_Circular: Prior estimated frequency of gene in circular focal amplifications (ecDNA)freq_HR: Prior estimated frequency of gene in homologous recombination events
Target column (for training data):
y: Binary label indicating whether the gene is detected as an ecDNA cargo gene (1) or not (0)
Output Data Format
Prediction results are saved as a CSV file with the following columns:
Gene-level predictions:
sample: Tumor sample IDgene_id: Gene IDprediction_prob: Probability of being an ecDNA cargo gene (0-1)prediction: Binary classification result (0 = not ecDNA cargo, 1 = ecDNA cargo)
Sample-level predictions:
sample_level_prediction_label: Sample-level focal amplification type classification:nofocal: No focal amplification detectednoncircular: Non-circular focal amplification detectedcircular: Circular focal amplification (ecDNA) detected
sample_level_prediction: Numerical encoding of sample-level classification (0 = nofocal, 1 = noncircular, 2 = circular)
Note: Sample-level classification follows these rules:
- If any gene in the sample is predicted as ecDNA cargo (
prediction= 1), the sample is classified ascircular - If no ecDNA cargo genes but any gene has
segVal > ploidy + 2, the sample is classified asnoncircular - Otherwise, the sample is classified as
nofocal
Model Architecture
otk supports multiple model architectures with unified interface:
Available Models
| Model | Type | Description |
|---|---|---|
| xgb_new | XGBoost | Optimized with feature engineering |
| xgb_paper | XGBoost | Paper reproduction (11 features) |
| baseline_mlp | Neural Network | Simple MLP baseline |
| transformer | Neural Network | Transformer architecture |
| deep_residual | Neural Network | Deep residual network |
| optimized_residual | Neural Network | Optimized residual network |
| dgit_super | Neural Network | Deep gated interaction transformer |
| tabpfn | TabPFN | TabPFN ensemble |
Unified Interface
All models inherit from BaseEcDNAModel and provide:
fit(X_train, y_train, X_val, y_val)- Trainingpredict_proba(X)- Probability predictionpredict(X)- Binary predictionsave(path)/load(path)- Persistence
Data Split
All models use unified data split (80/10/10) with seed=2026 for reproducibility.
Training Script
Use the unified training script:
# Train single model
python train_unified.py --model xgb_new
# Train all models
python train_unified.py --all
Configuration File
Model configuration uses YAML format, with example configuration files located in configs/. You can modify parameters in the configuration files as needed, such as model architecture and training parameters.
Examples
Training Examples
# Train model with default configuration
otk train
# Train model with custom configuration file
otk train --config my_config.yml
Prediction Examples
# Make predictions using a trained model
otk predict --model models/best_model.pth --input test_data.csv
Performance Metrics
The following performance metrics are recorded during model training:
- auPRC (Area under Precision-Recall Curve)
- AUC (Area under ROC Curve)
- F1 Score
- Precision
- Recall
Contribution Guide
We welcome community contributions! If you have any questions or suggestions, please submit them through GitHub Issues.
Development Process
- Fork the repository
- Create a feature branch
- Implement features or fix bugs
- Run tests
- Submit a Pull Request
License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
Citation
If you use otk in your research, please cite the following paper:
Wang, S., Wu, C. Y., He, M. M., Yong, J. X., Chen, Y. X., Qian, L. M., ... & Zhao, Q. (2024). Machine learning-based extrachromosomal DNA identification in large-scale cohorts reveals its clinical implications in cancer. Nature Communications, 15(1), 1-17.
Contact
- Project homepage: https://github.com/WangLabCSU/otk
- Email: wangshx@csu.edu.cn
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file otk_ecdna-1.0.0.tar.gz.
File metadata
- Download URL: otk_ecdna-1.0.0.tar.gz
- Upload date:
- Size: 65.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1346a744a5248108e1bca62d9c6ee97865b7efaf1f6bc2eb05e90589a5b42497
|
|
| MD5 |
d1b505323a89c9c2fa3fc0956bf97209
|
|
| BLAKE2b-256 |
b6828a60b3edc3f41684648f0fb77f21348477c3e093bf0cac2fe5a42d56c9f9
|
Provenance
The following attestation bundles were made for otk_ecdna-1.0.0.tar.gz:
Publisher:
publish.yml on WangLabCSU/otk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
otk_ecdna-1.0.0.tar.gz -
Subject digest:
1346a744a5248108e1bca62d9c6ee97865b7efaf1f6bc2eb05e90589a5b42497 - Sigstore transparency entry: 1377154364
- Sigstore integration time:
-
Permalink:
WangLabCSU/otk@92e22e88582ad282606517cbc8e40335e0448632 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/WangLabCSU
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@92e22e88582ad282606517cbc8e40335e0448632 -
Trigger Event:
push
-
Statement type:
File details
Details for the file otk_ecdna-1.0.0-py3-none-any.whl.
File metadata
- Download URL: otk_ecdna-1.0.0-py3-none-any.whl
- Upload date:
- Size: 64.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
384dc89b59f17d5e53be6797fa8e88e72c62241e4f59c25fc4a77407fc784fbf
|
|
| MD5 |
468ccb5a360c4deb460e789f514794f7
|
|
| BLAKE2b-256 |
d1fec1ca7e2da59edb9e9c5caa36253e65683868282108e85445509e154eca48
|
Provenance
The following attestation bundles were made for otk_ecdna-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on WangLabCSU/otk
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
otk_ecdna-1.0.0-py3-none-any.whl -
Subject digest:
384dc89b59f17d5e53be6797fa8e88e72c62241e4f59c25fc4a77407fc784fbf - Sigstore transparency entry: 1377154443
- Sigstore integration time:
-
Permalink:
WangLabCSU/otk@92e22e88582ad282606517cbc8e40335e0448632 -
Branch / Tag:
refs/tags/v1.0.0 - Owner: https://github.com/WangLabCSU
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@92e22e88582ad282606517cbc8e40335e0448632 -
Trigger Event:
push
-
Statement type: