Skip to main content

Intron classification tool for identifying U2-type and U12-type introns using SVM

Project description

intronIC_logo

intronIC - (intron Interrogator and Classifier)

Version 2.0.0 - Refactored Edition with Corrected Architecture

intronIC is a bioinformatics tool for extracting and classifying intron sequences as U12-type (minor) or U2-type (major) using a support vector machine (SVM) trained on position-weight matrix (PWM) scores. It can be used with a genome and annotation file, or with pre-extracted intron sequences. Alternatively, intronIC can extract all annotated intron sequences without classification (using the extract subcommand).


About This Refactored Version

This refactored version maintains 100% algorithmic fidelity and CLI compatibility with the original intronIC while providing a modernized, maintainable codebase:

Key Improvements

  • Corrected ML Architecture (v2.0): Fixed double-scaling issue and train/test mismatch
    • Single scaling step via RobustScaler with centering (removes composition bias)
    • Configurable augmented features with 5D standard (absdiff_bp_3, absdiff_5_bp) or custom feature sets
    • Two-stage optimization (C via balanced_accuracy, calibration via log-loss)
    • L1/L2 penalty search with class weight multiplier optimization
  • Modular Architecture: Organized into logical packages (extraction, scoring, classification, output) instead of a single 6,000+-line file
  • Enhanced Code Quality: Type hints throughout, immutable data structures, better error handling
  • Bug Fixes: Corrected data leakage in z-score normalization, fixed type_id assignment logic
  • Better Testing: Structured for unit and integration testing
  • Modern Tooling: Support for pixi and uv package managers
  • Enhanced Logging: Clearer progress reporting with section markers and detailed training logs
  • Improved Documentation: Comprehensive inline documentation and external guides

What's Preserved

  • Same Classification Algorithm: Linear SVM with balanced class weights
  • Same Feature Extraction: PWM scoring of 5' splice site, branch point, and 3' splice site
  • Same Output Formats: All .iic files maintain compatibility (with minor enhancements)
  • Same Performance: Comparable runtime and memory usage to original
  • Validated Accuracy: Identical classification results on test data

Scientific Background

Minor (U12-type) vs Major (U2-type) Introns

Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome and typically have:

  • 5' splice site: GT
  • 3' splice site: AG
  • Branch point: A within a loose consensus

A small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome and typically have:

  • 5' splice site: AT (AT-AC type) or GT (GT-AG type)
  • 3' splice site: AC (AT-AC type) or AG (GT-AG type)
  • Branch point: Highly conserved TCCTTAAC motif

Classification Approach

intronIC uses a three-step scoring and classification pipeline:

  1. PWM Scoring: Apply position-weight matrices to three key regions (5' splice site, branch point, 3' splice site) to calculate raw log-odds scores
  2. Normalization: Convert raw scores to z-scores using parameters fit on reference sequences only (prevents data leakage)
  3. SVM Classification: Train an ensemble of linear SVMs on reference U12/U2 introns, output probability scores (0-100%)

The output probability represents the classifier's confidence that an intron is U12-type. By default, introns with scores >90% are considered high-confidence U12-type predictions.

ML Pipeline Architecture

intronIC uses a single scaling step architecture to prevent double-scaling and ensure train/test consistency:

Raw PWM Scores (LLRs)
         ↓
ScoreNormalizer (EXTERNAL to pipeline)
  - RobustScaler(with_centering=True)
  - Fitted on reference introns only
  - Transforms: raw LLRs → z-scores
  - Removes composition bias via centering
         ↓
Z-Scores [five_z_score, bp_z_score, three_z_score]
         ↓
ML Pipeline (NO scaler inside)
  ├─ BothEndsStrongTransformer
  │  └─ Augments 3D → 5D features (standard config):
  │     • Pass-through: five_z, bp_z, three_z
  │     • absdiff_bp_3 = |bp_z - three_z| (BP/3' imbalance penalty)
  │     • absdiff_5_bp = |five_z - bp_z| (5'/BP imbalance penalty)
  │  └─ Or custom 4D-7D with different features:
  │     • min_all, absdiff_5_3, min_5_bp, max_5_bp, etc.
  ├─ LinearSVC
  │  └─ L1 or L2 penalty (grid-searched), balanced class weights
  └─ CalibratedClassifierCV
     └─ External calibration (sigmoid or isotonic)
         ↓
U12 Probability (0-100%)

Key Design Principles:

  1. Single Scaling Step: Scaling happens ONLY in ScoreNormalizer (external to pipeline). The pipeline receives pre-scaled z-scores and does NOT re-scale them. This prevents double-scaling.

  2. Train/Test Consistency: Both training and prediction extract z-scores from introns and pass them to the pipeline, ensuring identical data transformations.

  3. Domain Adaptation: ScoreNormalizer can be refitted per-species (adaptive mode) or reused from training species (human mode) for cross-species classification.

  4. Feature Engineering: BothEndsStrongTransformer adds configurable composite features. The standard 5D configuration adds absdiff_bp_3 and absdiff_5_bp (BP/3' and 5'/BP imbalance penalties) based on L1 regularization analysis. See config/config.yaml for all available features.

  5. Hyperparameter Optimization:

    • Grid search over: C parameter, L1/L2 penalty, class weight multipliers
    • Stage 1: Optimize C using balanced_accuracy (discrimination quality)
    • Stage 2: Select calibration method (sigmoid vs isotonic) using log-loss (probability quality)
  6. YAML Configuration: All optimizer settings are configurable via config/config.yaml including feature selection, penalty options, class weight multipliers, and CV parameters.

This architecture was validated on C. elegans, achieving 1 false positives (1/109,830) vs 130 with uncentered scaling.


Installation

Quick Install (Recommended)

pip install intronIC

That's it! This installs intronIC and all dependencies from PyPI.

From Source (Development/Latest)

For the latest development version or to contribute:

git clone https://github.com/glarue/intronIC.git
cd intronIC
pip install -e .

Using pixi (Reproducible Environments)

Pixi provides fully reproducible environments with locked dependencies—ideal for HPC clusters or when exact reproducibility is required:

# Install pixi (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash

# Clone and install
git clone https://github.com/glarue/intronIC.git
cd intronIC
pixi install

# Run intronIC through pixi
pixi run intronIC -h

# Or run the included test
pixi run test-small

When to use pixi:

  • HPC/cluster environments with strict reproducibility requirements
  • When you need isolated, self-contained environments
  • If you prefer conda-style environment management

Verify Installation

intronIC --version
intronIC -h

Dependencies

intronIC requires Python 3.10+ and the following packages:

  • numpy >=1.19.0 - Numerical operations
  • scipy >=1.5.0 - Scientific computing
  • scikit-learn >=0.22, <2.0 - SVM classifier
  • matplotlib >=3.3.0 - Plotting
  • networkx >=2.5.1 - Graph operations for annotation parsing
  • rich >=10.0 - Terminal progress bars
  • biogl >=0.1.0 - Bioinformatics utilities

All dependencies are automatically installed by pixi, uv, or pip.

intronIC was developed on Linux and has been tested on macOS and Windows.


Quick Start

Installation (One Command)

pip install intronIC

Basic Commands

# Classify introns (train on-the-fly)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Use pretrained model (faster)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name --model model.pkl -p 8

# Extract sequences only (no classification)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Train a model (no genome needed)
intronIC train -n my_model -p 8

Test Run (Human Chr19)

# With test data included in repository
intronIC -g test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz \
         -a test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
         -n homo_sapiens_chr19 -p 4

Expected results:

  • ~29,000 introns extracted
  • ~30 U12-type introns (score ≥90%)
  • ~8 AT-AC type U12 introns
  • Output files: homo_sapiens_chr19.*.iic

Usage

Commands

intronIC supports three subcommands:

Command Description
(default) Classify introns from genome + annotation
train Train a model on reference data (no genome needed)
extract Extract sequences only (no classification)

Default Mode: Classify Introns

The default mode extracts introns and classifies them as U12 or U2 type:

# Basic usage (trains model on-the-fly)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name

# With pretrained model (faster, recommended)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --model homo_sapiens.model.pkl -p 8

# Memory-efficient streaming mode
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --model homo_sapiens.model.pkl --streaming -p 8

Train Subcommand

Train a classifier model without needing a genome:

# Basic training with built-in references
intronIC train -n my_model -p 8

# With custom configuration
intronIC --config config/config.yaml train -n my_model -p 12

# With custom reference sequences
intronIC train -n my_model -p 8 \
  --reference_u12s custom_u12.iic \
  --reference_u2s custom_u2.iic

# Quick training (skip nested CV evaluation)
intronIC train -n my_model --eval_mode none -p 8

Output: my_model.model.pkl - use with --model for classification.

Extract Subcommand

Extract intron sequences without classification:

# Extract from annotation (streaming mode by default)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name

# Extract from BED file
intronIC extract -g genome.fa.gz -b introns.bed -n species_name

# With custom flank length
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name --flank-len 20

Output: .introns.iic, .meta.iic, .bed.iic files (no classification scores).

Required Arguments

Argument Short Description
--genome -g Genome FASTA file (gzip supported)
--annotation -a GFF3/GTF annotation file (gzip supported)
--species-name -n Species name / output prefix

Alternative inputs (instead of -a):

  • -b FILE - BED file with intron coordinates
  • -q FILE - Pre-extracted sequences file

Common Options

Argument Short Description Default
--processes -p Number of CPU cores 1
--threshold -t U12 probability threshold (0-100) 90
--model Pretrained model file None
--config YAML configuration file Auto-discovered
--streaming Memory-efficient mode False
--feature-type -f Feature type: cds, exon, or both both
--allow-multiple-isoforms -i Include all isoforms False (longest only)
--exclude-overlapping -v Exclude overlapping introns False
--no-nc Exclude non-canonical introns False
--recursive Recursive training False

Usage Examples

1. Basic classification:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -p 8

2. With pretrained model (recommended for speed):

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species \
  --model homo_sapiens.model.pkl -p 8

3. Streaming mode for large genomes:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species \
  --model homo_sapiens.model.pkl --streaming -p 8

4. Extract sequences only:

intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n my_species -p 8

5. Train custom model:

intronIC train -n my_trained_model -p 8 --config config/config.yaml

6. Stricter threshold (95%):

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -t 95 -p 8

7. Include all isoforms, CDS only:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -i -f cds -p 8

8. Classify from BED coordinates:

intronIC -g genome.fa.gz -b intron_coordinates.bed -n my_species \
  --model homo_sapiens.model.pkl

Output Files

All output files are tab-delimited with the .iic extension and named {species_name}.{type}.iic.

Main Output Files

1. .meta.iic - Comprehensive Metadata

Contains detailed information for each intron:

  • Intron name/label with tags
  • Relative score (distance from threshold)
  • Terminal dinucleotides (e.g., GT-AG, AT-AC)
  • Motif schematic (showing branch point context)
  • Branch point region sequence
  • Intron length
  • Parent transcript ID
  • Grandparent gene ID
  • Intron index and total family size
  • Fractional position in transcript
  • Exon phase
  • Type ID (u2 or u12)
  • Attributes (longest_isoform, corrected, etc.)

2. .bed.iic - BED-format Coordinates

Standard BED format with scores:

  • Chromosome
  • Start (0-based, BED standard)
  • Stop (1-based)
  • Label (intron_name;probability)
  • SVM score (0-100, integer)
  • Strand

3. .seqs.iic - Sequences

Intron sequences with flanking regions:

  • Intron name
  • 5' flanking sequence (exonic)
  • Intron sequence
  • 3' flanking sequence (exonic)
  • SVM score (if classification performed)

4. .scores.iic (or .score_info.iic) - Detailed Scoring

Per-intron breakdown of all scores:

  • Name and scores (relative, SVM, decision_distance)
  • 5' splice site: sequence, raw score, z-score
  • Branch point: sequences (U12 and U2 versions), raw score, z-score
  • 3' splice site: sequence, raw score, z-score

5. Mapping Files

  • .dupe_map.iic - Maps duplicate introns to their representative
  • .overlap_map.iic - Maps overlapping intron coordinates

6. Visualization Files (.png)

  • *_scatter.png - 2D scatter plot of classified introns with marginal distributions
  • *_training_scatter.png - Scatter plot of training data
  • *_training_hexplot.png - Hexbin density plot of reference introns
  • *_pr_curve.png - Precision-Recall AUC curves for model evaluation

7. Log Files

  • .log - Main log file with pipeline progress and summary statistics
  • .training.log - Detailed training log (when models are trained, not with --model)

Identifying U12-type Introns

U12-type introns are identified by their relative score > 0 (equivalent to SVM score > threshold):

# Extract U12-type introns from meta file
awk '($2!="NA" && $2>0)' species_name.meta.iic

# Count U12-type introns
awk '($2!="NA" && $2>0)' species_name.meta.iic | wc -l

# Get U12-type intron names
awk '($2!="NA" && $2>0) {print $1}' species_name.meta.iic

# Filter by higher confidence (relative score > 10)
awk '($2!="NA" && $2>10)' species_name.meta.iic

Understanding the Scores

SVM Score (0-100):

  • Probability that the intron is U12-type
  • 50 = equal probability of U2 or U12
  • 90 = high confidence U12 (default threshold)

  • <10 = high confidence U2

Relative Score:

  • Distance from the threshold
  • Calculated as: svm_score - threshold
  • Positive values = above threshold (U12-type at chosen confidence)
  • Negative values = below threshold (U2-type)
  • Makes filtering easier: just check if > 0

Type ID (u2 or u12):

  • Binary classification based on raw classifier decision (50% boundary)
  • Independent of the user-chosen threshold
  • Used for organizing output and statistics

Decision Distance:

  • Log-odds ratio: log(probability / (1 - probability))
  • 0 = equal probability (50%)
  • Positive = favors U12
  • Negative = favors U2
  • Useful for understanding classifier confidence

A Note on the -n (Name) Argument

By default, intronIC expects species names in binomial format (genus, species) separated by a non-alphanumeric character:

  • homo_sapiens
  • homo.sapiens
  • homo-sapiens

intronIC formats the name internally into a tag for intron IDs (e.g., HomSap), using only the first two elements.

Output files are named using the full argument supplied to -n, so:

  • homo_sapiens → files named homo_sapiens.*
  • homo_sapiens.v2 → files named homo_sapiens.v2.*
  • Intron IDs in both would use HomSap tag

To use the name argument exactly as provided without any parsing, add the --na flag:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n "My Custom Name" --na

Resource Usage

Memory

Memory usage scales with the number of annotated introns in the genome:

  • Small genomes (<50,000 introns): <1 GB
  • Typical genomes (50,000-200,000 introns): 1-3 GB
  • Large genomes (>200,000 introns): 3-5 GB
  • Human genome (Ensembl 95, ~1 million introns): ~5 GB
  • Streaming mode (with --model --streaming): ~0.5 GB regardless of genome size

Most modern computers should handle even large genomes without issue. For memory-constrained environments, use streaming mode with a pretrained model.

Runtime

Runtime depends on genome size, annotation density, and whether models are pre-trained:

Genome Introns Train Mode Pretrained (--model -p 5)
Chr19 (test) ~29,000 5-15 min <1 min
Small genome ~50,000 10-30 min 1-2 min
Human (full) ~200,000 20-40 min ~3 min

Tips for faster runs:

  • Use --model with a pretrained model to skip training (fastest)
  • Use -p N for parallel processing (recommended: 5-8 cores)
  • Use --streaming with --model for large genomes with memory constraints
  • Use small reference sets for testing (--reference_u12s, --reference_u2s)
  • Extract sequences first with extract subcommand, then classify separately if iterating on parameters

Advanced Usage

Using Pretrained Models

For cross-species classification using a model trained on another species:

# Use a specific trained model file
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --model /path/to/trained_species.model.pkl

This is the recommended approach for:

  • Classifying species without curated U12 references
  • Applying a human-trained model to other vertebrates
  • Fast classification when training data is unavailable

The pretrained model contains:

  • Trained SVM ensemble with optimized hyperparameters
  • Frozen scaler from training species (for cross-species normalization)
  • Model metadata (training parameters, feature configuration)

Streaming Mode

For large genomes with memory constraints, streaming mode processes introns per-chromosome:

# Memory-efficient streaming with pretrained model
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --model trained.model.pkl --streaming -p 8

Streaming mode provides ~90% memory savings by:

  • Processing one chromosome at a time
  • Writing results immediately (not accumulating in memory)
  • Using the frozen scaler from the pretrained model

Requirements: Streaming mode requires --model (pretrained model with frozen scaler).

Configuration Files

intronIC uses YAML configuration files for advanced parameter tuning:

# Use custom configuration
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --config config/profiles/production.yaml

Configuration files are auto-discovered from (in priority order):

  1. --config PATH (explicit CLI argument)
  2. ./.intronIC.yaml (current directory)
  3. ~/.config/intronIC/config.yaml (XDG config)
  4. Built-in defaults

Key configurable parameters include:

  • Feature selection: Choose which augmented features to use (5D standard or custom)
  • Penalty options: L1, L2, or both for regularization search
  • Class weight multipliers: Fine-tune precision/recall tradeoff
  • CV parameters: Number of folds, optimization rounds
  • Ensemble settings: Number of models, subsampling ratio

See config/config.yaml for full documentation of all options.

Recursive Training

For species distant from the training data, recursive training can improve accuracy:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name --recursive

This performs two passes:

  1. Initial classification to identify high-confidence U12-type introns
  2. Build species-specific PWMs and retrain models
  3. Re-classify all introns with the updated models

Custom Reference Sequences

For specialized analyses, you can provide custom reference sequences:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --reference_u12s my_u12_introns.iic \
  --reference_u2s my_u2_introns.iic

Reference files should follow the .iic format (tab-delimited: name, 5'_flank, intron_seq, 3'_flank).

Two-Stage Workflow

For large genomes or parameter tuning, you can separate extraction from classification:

Stage 1: Extract sequences only

intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name
# Produces: species_name.introns.iic (and .meta.iic, .bed.iic)

Stage 2: Classify extracted sequences

intronIC -q species_name.introns.iic -n species_name -t 95
# Much faster for testing different thresholds or references

Troubleshooting

Common Issues

"No U12-type introns found"

  • Normal for some small genomes or chromosomes
  • Try lowering threshold: -t 80
  • Check that annotation contains sufficient introns
  • Consider using --recursive for distant species

"Out of memory" errors

  • Use a machine with more RAM for very large genomes
  • Try processing chromosomes separately using BED input
  • Reduce parallelization: -p 1 or -p 2

"No introns extracted"

  • Check that genome and annotation use matching chromosome names
  • Verify annotation format (GFF3 or GTF)
  • Try different feature type: -f cds or -f exon
  • Check annotation file is not corrupted

Slow performance

  • Use parallel processing: -p 4 or -p 8
  • Use --model to skip model training
  • Use smaller reference sets for testing
  • Consider extracting sequences first (extract subcommand), then classify separately

Classification results differ from original intronIC

  • Minor differences can occur due to:
    • Random seed in cross-validation
    • sklearn version differences
    • Floating-point precision
  • Major differences are unexpected; please file an issue

Getting Help

For refactoring-specific questions, see REFACTOR_SUMMARY.md.


Testing the Installation

To verify your installation works correctly, download the test data and run:

# Download test data (if not cloned from repo)
# Or use your own genome + annotation files

# Run on Human Chr19 test data
intronIC -g test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz \
         -a test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
         -n test_run -p 4

# With pixi (from cloned repo)
pixi run test-small

Expected output:

  • Several .iic files named test_run.*
  • A .log file with classification summary
  • PNG plots showing score distributions
  • Console output showing ~30 U12-type introns found

Project Structure

The codebase is organized into logical modules under src/intronIC/:

src/intronIC/
├── cli/                 # Command-line interface and orchestration
│   ├── main.py          # Pipeline entry point
│   ├── args.py          # Argument parsing
│   ├── config.py        # Configuration management
│   └── reporter.py      # Progress reporting
├── core/                # Core data structures
│   ├── intron.py        # Intron class and related types
│   └── reference.py     # Reference sequence management
├── extraction/          # Intron extraction from various sources
│   ├── annotation.py    # GFF3/GTF parsing
│   ├── bed.py           # BED file parsing
│   ├── sequences.py     # Sequence file parsing
│   └── filter.py        # Quality control and filtering
├── scoring/             # PWM scoring and normalization
│   ├── pwm.py           # Position-weight matrix operations
│   ├── scorer.py        # Score calculation
│   └── normalizer.py    # Z-score normalization
├── classification/      # SVM training and prediction
│   ├── trainer.py       # Model training with nested CV
│   ├── predictor.py     # Ensemble prediction
│   ├── nested_cv.py     # Nested cross-validation
│   └── split_eval.py    # Evaluation utilities
├── output/              # Output file generation
│   ├── writers.py       # All output writers
│   └── formatter.py     # Formatting utilities
├── visualization/       # Plotting functions
│   └── plots.py         # All visualization code
├── utils/               # Utility modules
│   ├── genome.py        # Genome file handling
│   ├── logging_utils.py # Enhanced logging
│   └── sequences.py     # Sequence utilities
└── __main__.py          # Module entry point

Citing intronIC

If you use intronIC in your research, please cite:

Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett. Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078. https://doi.org/10.1093/nar/gkaa464


About intronIC

intronIC was created to provide a customizable, open-source method for identifying minor (U12-type) spliceosomal introns from genomic data. U12-type introns are rare (~0.5% of introns) but functionally important, and contain distinct splicing motifs that make them amenable to computational identification.

Why intronIC?

Earlier U12 databases (U12DB, SpliceRack, ERISdb) were valuable resources but:

  • Static by design (not updated with new genome releases)
  • Based on older genome annotations
  • Limited to pre-selected species
  • Used heuristic classification criteria

intronIC addresses these limitations:

  • Works with any genome + annotation
  • Uses the well-established SVM classification approach
  • Produces interpretable probability scores
  • Allows customization of training data and parameters
  • Provides extensive metadata for downstream analysis
  • Regularly updated with algorithm improvements

Classification Method

intronIC's approach combines sequence motif analysis with machine learning:

  1. Position-Weight Matrices (PWMs): Capture sequence preferences at three key regions

    • 5' splice site (donor): Recognizes GT/AT at intron start
    • Branch point: Identifies TCCTTAAC-like motifs in U12-type introns
    • 3' splice site (acceptor): Recognizes AG/AC at intron end
  2. Z-Score Normalization: Converts raw PWM scores to standardized features

    • Fit on reference sequences only (prevents data leakage)
    • Accounts for different score ranges across regions
  3. Linear SVM Classifier: Learns decision boundary in 3D feature space

    • Trained on curated U12-type and U2-type reference sets
    • Balanced class weights handle imbalanced data (~0.5% expected U12-type)
    • Probability calibration provides confidence estimates
  4. Ensemble Averaging: Reduces variance through multiple models

    • Each model trained on different U2 subsamples
    • F1-weighted voting combines predictions
    • Produces robust, reliable probabilities

This approach avoids arbitrary score thresholds and provides probabilistic classifications that researchers can interpret based on their specific needs (e.g., high-confidence predictions for experimental validation vs. comprehensive catalogs).

The Refactoring

This refactored version maintains complete algorithmic fidelity to the original while dramatically improving code organization and maintainability. The original 6,093-line monolithic file has been restructured into 15+ focused modules, each with a single responsibility.

Key improvements include:

  • Fixed data leakage bug in z-score normalization
  • Corrected type_id assignment logic
  • Added comprehensive type hints
  • Immutable data structures for thread safety
  • Better logging and error messages
  • Structured for testing and extension

For complete details, see REFACTOR_SUMMARY.md.


Contributing

Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines.

Quick start for contributors:

git clone https://github.com/glarue/intronIC.git
cd intronIC
make install    # Set up development environment
make test       # Run tests
make help       # See all available commands

For major changes, please open an issue first to discuss the proposed changes.


License

intronIC is released under the GNU General Public License v3.0.


Acknowledgments

Developed by Graham E. Larue with contributions from the Roy Lab and Padgett Lab.

Reference database curation: Devlin C. Moyer, Courtney E. Hershberger

Special thanks to the bioinformatics community for tools and libraries that make this work possible.


For more detailed documentation, algorithm descriptions, and examples, visit the intronIC wiki.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intronic-2.0.1.tar.gz (50.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

intronic-2.0.1-py3-none-any.whl (25.3 MB view details)

Uploaded Python 3

File details

Details for the file intronic-2.0.1.tar.gz.

File metadata

  • Download URL: intronic-2.0.1.tar.gz
  • Upload date:
  • Size: 50.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for intronic-2.0.1.tar.gz
Algorithm Hash digest
SHA256 f9fd5f99fc8930978f49fa07da6a21912d337347f188569a1bfc1a99fd51744f
MD5 a61def7bee45592ed855304e27332a6e
BLAKE2b-256 5866cd130d52ed33382e01e410f6d7792cc2b910a93279183a268dcb250a92e3

See more details on using hashes here.

File details

Details for the file intronic-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: intronic-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 25.3 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for intronic-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e83a3bec4ab8efe2403441ef2057039a197c5e4f3369a6ed533231859266cbe5
MD5 f6f16a6b70ebfa4fa9055858ab0164c6
BLAKE2b-256 35ddb7dec2cf708099ba0e7251f29708f04d4aed1f7ec0e6d4c955c4a4a1a7c1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page