Intron classification tool for identifying U2-type and U12-type introns using SVM
Project description
intronIC - (intron Interrogator and Classifier)
Version 2.0 - Refactored Edition with Streamlined ML Architecture
intronIC is a bioinformatics tool for extracting and classifying intron sequences as U12-type (minor) or U2-type (major) using a support vector machine trained on position-weight matrix scores.
Quick Start
Installation
pip install intronIC
Basic Usage
# Classify introns (default model loaded automatically)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8
# Extract sequences only (no classification)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8
# Train a custom model (optional - most users don't need this)
intronIC train -n my_model -p 8
Test Run
# Quick installation test using bundled test data
intronIC test -p 4
# Or show where test data is located
intronIC test --show-only
Documentation
For complete documentation, see the intronIC Wiki:
- Quick Start Guide - Installation, dependencies, resource usage
- Overview - Classification approach and scientific background
- Usage Info - Complete CLI reference
- Output Files - File formats and interpretation
- Technical Details - Algorithm and ML architecture
- Example Usage - Common workflows
- About - Background and motivation
What's New in Version 2.0
This refactored version maintains 100% algorithmic fidelity and CLI compatibility with the original intronIC while providing a modernized, maintainable codebase:
Key Improvements
- Corrected ML Architecture: Fixed double-scaling issue and train/test mismatch
- Single scaling step via RobustScaler with centering (removes composition bias)
- Configurable augmented features (5D standard or custom)
- Two-stage optimization (C via balanced_accuracy, calibration via log-loss)
- Modular Architecture: Organized into logical packages instead of a single 6,000+-line file
- Enhanced Code Quality: Type hints throughout, immutable data structures, better error handling
- Bug Fixes: Corrected data leakage in z-score normalization, fixed type_id assignment
- Modern Tooling: Support for
pixianduvpackage managers - Improved Documentation: Comprehensive wiki and inline documentation
Key Features
- SVM-based classification with probability scores (0-100%)
- Default pretrained model loaded automatically - works for virtually all species
- Streaming mode (default) for ~85% memory reduction on large genomes
- Parallel processing for improved performance (
-p 8recommended) - Fast runtimes: ~6-10 minutes for human genome with default settings
- Comprehensive metadata including phase, position, parent gene/transcript
Scientific Background
Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome, while a small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome. U12-type introns have:
- Highly conserved TCCTTAAC branch point motif
- Terminal dinucleotides: AT-AC (~25%) or GT-AG (~75%)
- Functional importance and evolutionary conservation
intronIC identifies U12-type introns using:
- PWM Scoring: Apply position-weight matrices to 5' splice site, branch point, and 3' splice site
- Normalization: Convert raw scores to z-scores (prevents data leakage)
- SVM Classification: Linear SVM with balanced class weights outputs probability scores
For detailed algorithm description, see the Technical Details wiki page.
Citation
If you use intronIC in your research, please cite:
Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett. Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078. https://doi.org/10.1093/nar/gkaa464
Support
- Documentation: intronIC Wiki
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
git clone https://github.com/glarue/intronIC.git
cd intronIC
make install # Set up development environment
make test # Run tests
License
intronIC is released under the GNU General Public License v3.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file intronic-2.0.10.tar.gz.
File metadata
- Download URL: intronic-2.0.10.tar.gz
- Upload date:
- Size: 25.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dda38f3320968cf7c9bcb0fb702848812fb828d147884442bac59992a0962a49
|
|
| MD5 |
fe102bf30e2cf62aa6cdd4e878b11563
|
|
| BLAKE2b-256 |
4ce9302853f6e410d0e546557aad029ccd8c3b254c50d9eb06706742a599c61e
|
File details
Details for the file intronic-2.0.10-py3-none-any.whl.
File metadata
- Download URL: intronic-2.0.10-py3-none-any.whl
- Upload date:
- Size: 25.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04cd38c663802c93c0a350a2fb34d866c5658c575ec71d25b0e4254af2e0feb4
|
|
| MD5 |
cbc31668c842342efefadf6b4a174f7f
|
|
| BLAKE2b-256 |
2e425cffbca618f118b666536f4b3a8e4690ef780935d8a1aa27f0478b492dc5
|