Intron classification tool for identifying U2-type and U12-type introns using SVM
Project description
intronIC (intron Interrogator and Classifier)
Classify intron sequences as U12-type (minor spliceosome) or U2-type (major spliceosome). A 126-model multispecies RBF SVM ensemble scores each intron against position-weight matrices and outputs a calibrated probability (0-100%).
Quick Start
pip install intronIC
# Classify introns (loads default model automatically)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8
# Extract sequences without classification
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8
# Verify installation with bundled test data
intronIC test -p 4
What's New in v2.4
- Default model is now the v3 multispecies bundle: 3 seeds × 42 calibrated SVMs (126 total) trained on 41,333 introns across 90 species and 14 clades. Holdout F1 = 1.000 vs the v2.3 default's 0.9975, and ~54% lower production-equivalent FPR on U12-absent species.
- Default classification threshold lowered from 95 → 90, made safe by the v3 model's tighter calibration. Pass
--threshold 95to restore prior behavior. --streaming(default) and--in-memorynow produce bit-identical classifications. Mode choice affects only the runtime/memory tradeoff. Reference run on Homo sapiens GRCh38 + Ensembl 104,-p 6, ~227k scored introns: streaming ~16 min / 5.4 GB peak, in-memory ~15 min / 10.1 GB peak.- Self-describing model bundles carry config + training metadata alongside the weights; see
docs/v3_bundle_schema.md. - v2.3 model bundles continue to load unchanged; old runs reproduce by passing
--model <v2.3-bundle.pkl>. - See CHANGELOG.md for full release history.
What's New in v2.3
- 42-model RBF SVM ensemble on a streamlined 6D feature set
- Bayesian score adjustment suppresses false positives in species lacking a distinct U12-type intron population, using a species-level valley prior and per-intron ensemble agreement
- Species-specific U2-type background correction for cross-species composition bias
- Default threshold raised to 95% for higher-confidence calls (now lowered to 90 in v2.4)
Key Features
- Probability scores (0-100%) from a 126-model calibrated SVM ensemble (3 seeds × 42 sub-models, isotonic calibration)
- Pretrained model loaded automatically for cross-species analysis
- Streaming mode (default) roughly halves peak memory on large genomes (e.g., ~5.4 GB vs ~10.1 GB for full human at
-p 6); bit-identical classifications - Parallel scoring via
-p Nfor linear speedup - Comprehensive metadata: phase, position, parent gene/transcript
How It Works
Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome; a small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome. U12-type introns carry a conserved TCCTTAAC branch point motif and have either AT-AC (~25%) or GT-AG (~75%) terminal dinucleotides.
intronIC identifies U12-type introns in five stages:
- PWM scoring — score the 5' splice site, branch point, and 3' splice site against position-weight matrices
- Background correction — blend species-specific nucleotide frequencies into U2-type PWMs to correct composition bias
- Normalization — convert raw log-odds to z-scores via robust scaling
- SVM classification — 126-model RBF SVM ensemble (v2.4 default; 3 seeds × 42 sub-models) produces per-intron probabilities and ensemble agreement (sigma)
- Score adjustment — adjust probabilities using a species-level valley prior and an ensemble disagreement penalty
See Technical Details for the full algorithm description.
Documentation
Full documentation lives in the intronIC Wiki:
- Quick Start — Installation, dependencies, resource usage
- Overview — Classification approach and scientific background
- Output Files — File formats and score interpretation
- Technical Details — Algorithm, features, score adjustment
- Usage Info — Complete CLI reference
- Example Usage — Common workflows
- Changelog — Release notes and version history
Citation
If you use intronIC in your research, please cite:
Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066-7078. doi:10.1093/nar/gkaa464
Support
- intronIC Wiki — Documentation
- GitHub Issues — Bug reports
- GitHub Discussions — Questions and ideas
Contributing
See CONTRIBUTING.md for guidelines.
git clone https://github.com/glarue/intronIC.git
cd intronIC
make install # Set up development environment
make test # Run tests
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file intronic-2.4.2.tar.gz.
File metadata
- Download URL: intronic-2.4.2.tar.gz
- Upload date:
- Size: 55.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
96594b1b07ddc8dd51983291aca6b88428bf1ccb90c6ed740ed309ee493c80cd
|
|
| MD5 |
ed21dce83ec2d3ed8f4cdad1fb1c786d
|
|
| BLAKE2b-256 |
df9579e0dee8bed149a263954015f1b444419452185ff50c2995897bd567f9c1
|
Provenance
The following attestation bundles were made for intronic-2.4.2.tar.gz:
Publisher:
publish.yml on glarue/intronIC
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
intronic-2.4.2.tar.gz -
Subject digest:
96594b1b07ddc8dd51983291aca6b88428bf1ccb90c6ed740ed309ee493c80cd - Sigstore transparency entry: 1500024633
- Sigstore integration time:
-
Permalink:
glarue/intronIC@660154e2cedd53522036c7b6014097c0f2380ae5 -
Branch / Tag:
refs/tags/v2.4.2 - Owner: https://github.com/glarue
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@660154e2cedd53522036c7b6014097c0f2380ae5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file intronic-2.4.2-py3-none-any.whl.
File metadata
- Download URL: intronic-2.4.2-py3-none-any.whl
- Upload date:
- Size: 55.5 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6f64acfa481d382d0207a086ca254bb55ea2c1af5eb9a1ea4c82829ad0a4d05
|
|
| MD5 |
27485eec3e140f4740a7a42520cbcfec
|
|
| BLAKE2b-256 |
862e31b3d2c8ca7ce09a872e9911595bf941c6137c9660dbf3ea21a98cc010db
|
Provenance
The following attestation bundles were made for intronic-2.4.2-py3-none-any.whl:
Publisher:
publish.yml on glarue/intronIC
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
intronic-2.4.2-py3-none-any.whl -
Subject digest:
b6f64acfa481d382d0207a086ca254bb55ea2c1af5eb9a1ea4c82829ad0a4d05 - Sigstore transparency entry: 1500024697
- Sigstore integration time:
-
Permalink:
glarue/intronIC@660154e2cedd53522036c7b6014097c0f2380ae5 -
Branch / Tag:
refs/tags/v2.4.2 - Owner: https://github.com/glarue
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@660154e2cedd53522036c7b6014097c0f2380ae5 -
Trigger Event:
release
-
Statement type: