Skip to main content

Intron classification tool for identifying U2-type and U12-type introns using SVM

Project description

intronIC_logo

intronIC - (intron Interrogator and Classifier)

Version 2.0 - Refactored Edition with Corrected ML Architecture

intronIC is a bioinformatics tool for extracting and classifying intron sequences as U12-type (minor) or U2-type (major) using a support vector machine trained on position-weight matrix scores.


Quick Start

Installation

pip install intronIC

Basic Usage

# Classify introns (default model loaded automatically)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Extract sequences only (no classification)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Train a custom model (optional - most users don't need this)
intronIC train -n my_model -p 8

Test Run

# Download test data or use the included test files
intronIC -g test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz \
         -a test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
         -n homo_sapiens_chr19 -p 4

Expected: ~29,000 introns extracted, ~30 U12-type introns found (runtime: 1-2 minutes with -p 4).


Documentation

For complete documentation, see the intronIC Wiki:


What's New in Version 2.0

This refactored version maintains 100% algorithmic fidelity and CLI compatibility with the original intronIC while providing a modernized, maintainable codebase:

Key Improvements

  • Corrected ML Architecture: Fixed double-scaling issue and train/test mismatch
    • Single scaling step via RobustScaler with centering (removes composition bias)
    • Configurable augmented features (5D standard or custom)
    • Two-stage optimization (C via balanced_accuracy, calibration via log-loss)
  • Modular Architecture: Organized into logical packages instead of a single 6,000+-line file
  • Enhanced Code Quality: Type hints throughout, immutable data structures, better error handling
  • Bug Fixes: Corrected data leakage in z-score normalization, fixed type_id assignment
  • Modern Tooling: Support for pixi and uv package managers
  • Improved Documentation: Comprehensive wiki and inline documentation

Key Features

  • SVM-based classification with probability scores (0-100%)
  • Default pretrained model loaded automatically - works for virtually all species
  • Streaming mode (default) for ~85% memory reduction on large genomes
  • Parallel processing for improved performance (-p 8 recommended)
  • Fast runtimes: ~6-10 minutes for human genome with default settings
  • Comprehensive metadata including phase, position, parent gene/transcript

Scientific Background

Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome, while a small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome. U12-type introns have:

  • Highly conserved TCCTTAAC branch point motif
  • Terminal dinucleotides: AT-AC (~25%) or GT-AG (~75%)
  • Functional importance and evolutionary conservation

intronIC identifies U12-type introns using:

  1. PWM Scoring: Apply position-weight matrices to 5' splice site, branch point, and 3' splice site
  2. Normalization: Convert raw scores to z-scores (prevents data leakage)
  3. SVM Classification: Linear SVM with balanced class weights outputs probability scores

For detailed algorithm description, see the Technical Details wiki page.


Citation

If you use intronIC in your research, please cite:

Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett. Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078. https://doi.org/10.1093/nar/gkaa464


Support


Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

git clone https://github.com/glarue/intronIC.git
cd intronIC
make install    # Set up development environment
make test       # Run tests

License

intronIC is released under the GNU General Public License v3.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intronic-2.0.2.tar.gz (25.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

intronic-2.0.2-py3-none-any.whl (25.3 MB view details)

Uploaded Python 3

File details

Details for the file intronic-2.0.2.tar.gz.

File metadata

  • Download URL: intronic-2.0.2.tar.gz
  • Upload date:
  • Size: 25.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for intronic-2.0.2.tar.gz
Algorithm Hash digest
SHA256 19f1fd94ef09b5eb96286ae2c9efd59b2f43afd1c7c4c0de1eecb6ae0752e5b3
MD5 063c529eb0a4cde5889f702766f67891
BLAKE2b-256 8190f43b74c842539a1d7ec2e1df8a2a7f21a3dbf6176ad6c97e4f0b2f3bbef7

See more details on using hashes here.

File details

Details for the file intronic-2.0.2-py3-none-any.whl.

File metadata

  • Download URL: intronic-2.0.2-py3-none-any.whl
  • Upload date:
  • Size: 25.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for intronic-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 df3c25f1e1e6982a774420e29b3b6616e29c032173aece378acab9a39216103d
MD5 12c0ffd57fb480b5e87f6b2973b5e5fc
BLAKE2b-256 4c94bd2f8a8a448f6cd58c8c4468e5db3598ae151a73207f151bf7b8946a7b6e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page