Skip to main content

Intron classification tool for identifying U2-type and U12-type introns using SVM

Project description

intronIC_logo

intronIC (intron Interrogator and Classifier)

Classify intron sequences as U12-type (minor spliceosome) or U2-type (major spliceosome). A 126-model multispecies RBF SVM ensemble scores each intron against position-weight matrices and outputs a calibrated probability (0-100%).


Quick Start

pip install intronIC
# Classify introns (loads default model automatically)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Extract sequences without classification
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Verify installation with bundled test data
intronIC test -p 4

What's New in v2.4

  • Default model is now the v3 multispecies bundle: 3 seeds × 42 calibrated SVMs (126 total) trained on 41,333 introns across 90 species and 14 clades. Holdout F1 = 1.000 vs the v2.3 default's 0.9975, and ~54% lower production-equivalent FPR on U12-absent species.
  • Default classification threshold lowered from 95 → 90, made safe by the v3 model's tighter calibration. Pass --threshold 95 to restore prior behavior.
  • --streaming (default) and --in-memory now produce bit-identical classifications. Mode choice affects only the runtime/memory tradeoff. Reference run on Homo sapiens GRCh38 + Ensembl 104, -p 6, ~227k scored introns: streaming ~16 min / 5.4 GB peak, in-memory ~15 min / 10.1 GB peak.
  • Self-describing model bundles carry config + training metadata alongside the weights; see docs/v3_bundle_schema.md.
  • v2.3 model bundles continue to load unchanged; old runs reproduce by passing --model <v2.3-bundle.pkl>.
  • See CHANGELOG.md for full release history.

What's New in v2.3

  • 42-model RBF SVM ensemble on a streamlined 6D feature set
  • Bayesian score adjustment suppresses false positives in species lacking a distinct U12-type intron population, using a species-level valley prior and per-intron ensemble agreement
  • Species-specific U2-type background correction for cross-species composition bias
  • Default threshold raised to 95% for higher-confidence calls (now lowered to 90 in v2.4)

Key Features

  • Probability scores (0-100%) from a 126-model calibrated SVM ensemble (3 seeds × 42 sub-models, isotonic calibration)
  • Pretrained model loaded automatically for cross-species analysis
  • Streaming mode (default) roughly halves peak memory on large genomes (e.g., ~5.4 GB vs ~10.1 GB for full human at -p 6); bit-identical classifications
  • Parallel scoring via -p N for linear speedup
  • Comprehensive metadata: phase, position, parent gene/transcript

How It Works

Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome; a small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome. U12-type introns carry a conserved TCCTTAAC branch point motif and have either AT-AC (~25%) or GT-AG (~75%) terminal dinucleotides.

intronIC identifies U12-type introns in five stages:

  1. PWM scoring — score the 5' splice site, branch point, and 3' splice site against position-weight matrices
  2. Background correction — blend species-specific nucleotide frequencies into U2-type PWMs to correct composition bias
  3. Normalization — convert raw log-odds to z-scores via robust scaling
  4. SVM classification — 126-model RBF SVM ensemble (v2.4 default; 3 seeds × 42 sub-models) produces per-intron probabilities and ensemble agreement (sigma)
  5. Score adjustment — adjust probabilities using a species-level valley prior and an ensemble disagreement penalty

See Technical Details for the full algorithm description.


Documentation

Full documentation lives in the intronIC Wiki:


Citation

If you use intronIC in your research, please cite:

Moyer DC, Larue GE, Hershberger CE, Roy SW, Padgett RA. (2020) Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research 48(13):7066-7078. doi:10.1093/nar/gkaa464


Support


Contributing

See CONTRIBUTING.md for guidelines.

git clone https://github.com/glarue/intronIC.git
cd intronIC
make install    # Set up development environment
make test       # Run tests

License

GNU General Public License v3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intronic-2.4.2.tar.gz (55.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

intronic-2.4.2-py3-none-any.whl (55.5 MB view details)

Uploaded Python 3

File details

Details for the file intronic-2.4.2.tar.gz.

File metadata

  • Download URL: intronic-2.4.2.tar.gz
  • Upload date:
  • Size: 55.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for intronic-2.4.2.tar.gz
Algorithm Hash digest
SHA256 96594b1b07ddc8dd51983291aca6b88428bf1ccb90c6ed740ed309ee493c80cd
MD5 ed21dce83ec2d3ed8f4cdad1fb1c786d
BLAKE2b-256 df9579e0dee8bed149a263954015f1b444419452185ff50c2995897bd567f9c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for intronic-2.4.2.tar.gz:

Publisher: publish.yml on glarue/intronIC

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file intronic-2.4.2-py3-none-any.whl.

File metadata

  • Download URL: intronic-2.4.2-py3-none-any.whl
  • Upload date:
  • Size: 55.5 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for intronic-2.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b6f64acfa481d382d0207a086ca254bb55ea2c1af5eb9a1ea4c82829ad0a4d05
MD5 27485eec3e140f4740a7a42520cbcfec
BLAKE2b-256 862e31b3d2c8ca7ce09a872e9911595bf941c6137c9660dbf3ea21a98cc010db

See more details on using hashes here.

Provenance

The following attestation bundles were made for intronic-2.4.2-py3-none-any.whl:

Publisher: publish.yml on glarue/intronIC

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page