Intron classification tool for identifying U2-type and U12-type introns using SVM

These details have not been verified by PyPI

Project links

Project description

intronIC_logo

intronIC - (intron Interrogator and Classifier)

Version 2.0.0 - Refactored Edition with Corrected Architecture

intronIC is a bioinformatics tool for extracting and classifying intron sequences as U12-type (minor) or U2-type (major) using a support vector machine (SVM) trained on position-weight matrix (PWM) scores. It can be used with a genome and annotation file, or with pre-extracted intron sequences. Alternatively, intronIC can extract all annotated intron sequences without classification (using the extract subcommand).

About This Refactored Version

This refactored version maintains 100% algorithmic fidelity and CLI compatibility with the original intronIC while providing a modernized, maintainable codebase:

Key Improvements

Corrected ML Architecture (v2.0): Fixed double-scaling issue and train/test mismatch
- Single scaling step via RobustScaler with centering (removes composition bias)
- Configurable augmented features with 5D standard (absdiff_bp_3, absdiff_5_bp) or custom feature sets
- Two-stage optimization (C via balanced_accuracy, calibration via log-loss)
- L1/L2 penalty search with class weight multiplier optimization
Modular Architecture: Organized into logical packages (extraction, scoring, classification, output) instead of a single 6,000+-line file
Enhanced Code Quality: Type hints throughout, immutable data structures, better error handling
Bug Fixes: Corrected data leakage in z-score normalization, fixed type_id assignment logic
Better Testing: Structured for unit and integration testing
Modern Tooling: Support for pixi and uv package managers
Enhanced Logging: Clearer progress reporting with section markers and detailed training logs
Improved Documentation: Comprehensive inline documentation and external guides

What's Preserved

Same Classification Algorithm: Linear SVM with balanced class weights
Same Feature Extraction: PWM scoring of 5' splice site, branch point, and 3' splice site
Same Output Formats: All .iic files maintain compatibility (with minor enhancements)
Same Performance: Comparable runtime and memory usage to original
Validated Accuracy: Identical classification results on test data

Scientific Background

Minor (U12-type) vs Major (U2-type) Introns

Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome and typically have:

5' splice site: GT
3' splice site: AG
Branch point: A within a loose consensus

A small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome and typically have:

5' splice site: AT (AT-AC type) or GT (GT-AG type)
3' splice site: AC (AT-AC type) or AG (GT-AG type)
Branch point: Highly conserved TCCTTAAC motif

Classification Approach

intronIC uses a three-step scoring and classification pipeline:

PWM Scoring: Apply position-weight matrices to three key regions (5' splice site, branch point, 3' splice site) to calculate raw log-odds scores
Normalization: Convert raw scores to z-scores using parameters fit on reference sequences only (prevents data leakage)
SVM Classification: Train an ensemble of linear SVMs on reference U12/U2 introns, output probability scores (0-100%)

The output probability represents the classifier's confidence that an intron is U12-type. By default, introns with scores >90% are considered high-confidence U12-type predictions.

ML Pipeline Architecture

intronIC uses a single scaling step architecture to prevent double-scaling and ensure train/test consistency:

Raw PWM Scores (LLRs)
         ↓
ScoreNormalizer (EXTERNAL to pipeline)
  - RobustScaler(with_centering=True)
  - Fitted on reference introns only
  - Transforms: raw LLRs → z-scores
  - Removes composition bias via centering
         ↓
Z-Scores [five_z_score, bp_z_score, three_z_score]
         ↓
ML Pipeline (NO scaler inside)
  ├─ BothEndsStrongTransformer
  │  └─ Augments 3D → 5D features (standard config):
  │     • Pass-through: five_z, bp_z, three_z
  │     • absdiff_bp_3 = |bp_z - three_z| (BP/3' imbalance penalty)
  │     • absdiff_5_bp = |five_z - bp_z| (5'/BP imbalance penalty)
  │  └─ Or custom 4D-7D with different features:
  │     • min_all, absdiff_5_3, min_5_bp, max_5_bp, etc.
  ├─ LinearSVC
  │  └─ L1 or L2 penalty (grid-searched), balanced class weights
  └─ CalibratedClassifierCV
     └─ External calibration (sigmoid or isotonic)
         ↓
U12 Probability (0-100%)

Key Design Principles:

Single Scaling Step: Scaling happens ONLY in ScoreNormalizer (external to pipeline). The pipeline receives pre-scaled z-scores and does NOT re-scale them. This prevents double-scaling.
Train/Test Consistency: Both training and prediction extract z-scores from introns and pass them to the pipeline, ensuring identical data transformations.
Domain Adaptation: ScoreNormalizer can be refitted per-species (adaptive mode) or reused from training species (human mode) for cross-species classification.
Feature Engineering: BothEndsStrongTransformer adds configurable composite features. The standard 5D configuration adds absdiff_bp_3 and absdiff_5_bp (BP/3' and 5'/BP imbalance penalties) based on L1 regularization analysis. See config/config.yaml for all available features.
Hyperparameter Optimization:
- Grid search over: C parameter, L1/L2 penalty, class weight multipliers
- Stage 1: Optimize C using balanced_accuracy (discrimination quality)
- Stage 2: Select calibration method (sigmoid vs isotonic) using log-loss (probability quality)
YAML Configuration: All optimizer settings are configurable via config/config.yaml including feature selection, penalty options, class weight multipliers, and CV parameters.

This architecture was validated on C. elegans, achieving 1 false positives (1/109,830) vs 130 with uncentered scaling.

Installation

Quick Install (Recommended)

pip install intronIC

That's it! This installs intronIC and all dependencies from PyPI.

From Source (Development/Latest)

For the latest development version or to contribute:

git clone https://github.com/glarue/intronIC.git
cd intronIC
pip install -e .

Using `pixi` (Reproducible Environments)

Pixi provides fully reproducible environments with locked dependencies—ideal for HPC clusters or when exact reproducibility is required:

# Install pixi (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash

# Clone and install
git clone https://github.com/glarue/intronIC.git
cd intronIC
pixi install

# Run intronIC through pixi
pixi run intronIC -h

# Or run the included test
pixi run test-small

When to use pixi:

HPC/cluster environments with strict reproducibility requirements
When you need isolated, self-contained environments
If you prefer conda-style environment management

Verify Installation

intronIC --version
intronIC -h

Dependencies

intronIC requires Python 3.10+ and the following packages:

numpy >=1.19.0 - Numerical operations
scipy >=1.5.0 - Scientific computing
scikit-learn >=0.22, <2.0 - SVM classifier
matplotlib >=3.3.0 - Plotting
networkx >=2.5.1 - Graph operations for annotation parsing
rich >=10.0 - Terminal progress bars
biogl >=0.1.0 - Bioinformatics utilities

All dependencies are automatically installed by pixi, uv, or pip.

intronIC was developed on Linux and has been tested on macOS and Windows.

Quick Start

Installation (One Command)

pip install intronIC

Basic Commands

# Classify introns (train on-the-fly)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Use pretrained model (faster)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name --model model.pkl -p 8

# Extract sequences only (no classification)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8

# Train a model (no genome needed)
intronIC train -n my_model -p 8

Test Run (Human Chr19)

# With test data included in repository
intronIC -g test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz \
         -a test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
         -n homo_sapiens_chr19 -p 4

Expected results:

~29,000 introns extracted
~30 U12-type introns (score ≥90%)
~8 AT-AC type U12 introns
Output files: homo_sapiens_chr19.*.iic

Usage

Commands

intronIC supports three subcommands:

Command	Description
(default)	Classify introns from genome + annotation
`train`	Train a model on reference data (no genome needed)
`extract`	Extract sequences only (no classification)

Default Mode: Classify Introns

The default mode extracts introns and classifies them as U12 or U2 type:

# Basic usage (trains model on-the-fly)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name

# With pretrained model (faster, recommended)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --model homo_sapiens.model.pkl -p 8

# Memory-efficient streaming mode
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --model homo_sapiens.model.pkl --streaming -p 8

Train Subcommand

Train a classifier model without needing a genome:

# Basic training with built-in references
intronIC train -n my_model -p 8

# With custom configuration
intronIC --config config/config.yaml train -n my_model -p 12

# With custom reference sequences
intronIC train -n my_model -p 8 \
  --reference_u12s custom_u12.iic \
  --reference_u2s custom_u2.iic

# Quick training (skip nested CV evaluation)
intronIC train -n my_model --eval_mode none -p 8

Output: my_model.model.pkl - use with --model for classification.

Extract Subcommand

Extract intron sequences without classification:

# Extract from annotation (streaming mode by default)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name

# Extract from BED file
intronIC extract -g genome.fa.gz -b introns.bed -n species_name

# With custom flank length
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name --flank-len 20

Output: .introns.iic, .meta.iic, .bed.iic files (no classification scores).

Required Arguments

Argument	Short	Description
`--genome`	`-g`	Genome FASTA file (gzip supported)
`--annotation`	`-a`	GFF3/GTF annotation file (gzip supported)
`--species-name`	`-n`	Species name / output prefix

Alternative inputs (instead of -a):

-b FILE - BED file with intron coordinates
-q FILE - Pre-extracted sequences file

Common Options

Argument	Short	Description	Default
`--processes`	`-p`	Number of CPU cores	1
`--threshold`	`-t`	U12 probability threshold (0-100)	90
`--model`		Pretrained model file	None
`--config`		YAML configuration file	Auto-discovered
`--streaming`		Memory-efficient mode	False
`--feature-type`	`-f`	Feature type: `cds`, `exon`, or `both`	both
`--allow-multiple-isoforms`	`-i`	Include all isoforms	False (longest only)
`--exclude-overlapping`	`-v`	Exclude overlapping introns	False
`--no-nc`		Exclude non-canonical introns	False
`--recursive`		Recursive training	False

Usage Examples

1. Basic classification:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -p 8

2. With pretrained model (recommended for speed):

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species \
  --model homo_sapiens.model.pkl -p 8

3. Streaming mode for large genomes:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species \
  --model homo_sapiens.model.pkl --streaming -p 8

4. Extract sequences only:

intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n my_species -p 8

5. Train custom model:

intronIC train -n my_trained_model -p 8 --config config/config.yaml

6. Stricter threshold (95%):

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -t 95 -p 8

7. Include all isoforms, CDS only:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -i -f cds -p 8

8. Classify from BED coordinates:

intronIC -g genome.fa.gz -b intron_coordinates.bed -n my_species \
  --model homo_sapiens.model.pkl

Output Files

All output files are tab-delimited with the .iic extension and named {species_name}.{type}.iic.

Main Output Files

1. .meta.iic - Comprehensive Metadata

Contains detailed information for each intron:

Intron name/label with tags
Relative score (distance from threshold)
Terminal dinucleotides (e.g., GT-AG, AT-AC)
Motif schematic (showing branch point context)
Branch point region sequence
Intron length
Parent transcript ID
Grandparent gene ID
Intron index and total family size
Fractional position in transcript
Exon phase
Type ID (u2 or u12)
Attributes (longest_isoform, corrected, etc.)

2. .bed.iic - BED-format Coordinates

Standard BED format with scores:

Chromosome
Start (0-based, BED standard)
Stop (1-based)
Label (intron_name;probability)
SVM score (0-100, integer)
Strand

3. .seqs.iic - Sequences

Intron sequences with flanking regions:

Intron name
5' flanking sequence (exonic)
Intron sequence
3' flanking sequence (exonic)
SVM score (if classification performed)

4. .scores.iic (or .score_info.iic) - Detailed Scoring

Per-intron breakdown of all scores:

Name and scores (relative, SVM, decision_distance)
5' splice site: sequence, raw score, z-score
Branch point: sequences (U12 and U2 versions), raw score, z-score
3' splice site: sequence, raw score, z-score

5. Mapping Files

.dupe_map.iic - Maps duplicate introns to their representative
.overlap_map.iic - Maps overlapping intron coordinates

6. Visualization Files (.png)

*_scatter.png - 2D scatter plot of classified introns with marginal distributions
*_training_scatter.png - Scatter plot of training data
*_training_hexplot.png - Hexbin density plot of reference introns
*_pr_curve.png - Precision-Recall AUC curves for model evaluation

7. Log Files

.log - Main log file with pipeline progress and summary statistics
.training.log - Detailed training log (when models are trained, not with --model)

Identifying U12-type Introns

U12-type introns are identified by their relative score > 0 (equivalent to SVM score > threshold):

# Extract U12-type introns from meta file
awk '($2!="NA" && $2>0)' species_name.meta.iic

# Count U12-type introns
awk '($2!="NA" && $2>0)' species_name.meta.iic | wc -l

# Get U12-type intron names
awk '($2!="NA" && $2>0) {print $1}' species_name.meta.iic

# Filter by higher confidence (relative score > 10)
awk '($2!="NA" && $2>10)' species_name.meta.iic

Understanding the Scores

SVM Score (0-100):

Probability that the intron is U12-type
50 = equal probability of U2 or U12
90 = high confidence U12 (default threshold)
<10 = high confidence U2

Relative Score:

Distance from the threshold
Calculated as: svm_score - threshold
Positive values = above threshold (U12-type at chosen confidence)
Negative values = below threshold (U2-type)
Makes filtering easier: just check if > 0

Type ID (u2 or u12):

Binary classification based on raw classifier decision (50% boundary)
Independent of the user-chosen threshold
Used for organizing output and statistics

Decision Distance:

Log-odds ratio: log(probability / (1 - probability))
0 = equal probability (50%)
Positive = favors U12
Negative = favors U2
Useful for understanding classifier confidence

A Note on the `-n` (Name) Argument

By default, intronIC expects species names in binomial format (genus, species) separated by a non-alphanumeric character:

homo_sapiens ✅
homo.sapiens ✅
homo-sapiens ✅

intronIC formats the name internally into a tag for intron IDs (e.g., HomSap), using only the first two elements.

Output files are named using the full argument supplied to -n, so:

homo_sapiens → files named homo_sapiens.*
homo_sapiens.v2 → files named homo_sapiens.v2.*
Intron IDs in both would use HomSap tag

To use the name argument exactly as provided without any parsing, add the --na flag:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n "My Custom Name" --na

Resource Usage

Memory

Memory usage scales with the number of annotated introns in the genome:

Small genomes (<50,000 introns): <1 GB
Typical genomes (50,000-200,000 introns): 1-3 GB
Large genomes (>200,000 introns): 3-5 GB
Human genome (Ensembl 95, ~1 million introns): ~5 GB
Streaming mode (with --model --streaming): ~0.5 GB regardless of genome size

Most modern computers should handle even large genomes without issue. For memory-constrained environments, use streaming mode with a pretrained model.

Runtime

Runtime depends on genome size, annotation density, and whether models are pre-trained:

Genome	Introns	Train Mode	Pretrained (`--model -p 5`)
Chr19 (test)	~29,000	5-15 min	<1 min
Small genome	~50,000	10-30 min	1-2 min
Human (full)	~200,000	20-40 min	~3 min

Tips for faster runs:

Use --model with a pretrained model to skip training (fastest)
Use -p N for parallel processing (recommended: 5-8 cores)
Use --streaming with --model for large genomes with memory constraints
Use small reference sets for testing (--reference_u12s, --reference_u2s)
Extract sequences first with extract subcommand, then classify separately if iterating on parameters

Advanced Usage

Using Pretrained Models

For cross-species classification using a model trained on another species:

# Use a specific trained model file
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --model /path/to/trained_species.model.pkl

This is the recommended approach for:

Classifying species without curated U12 references
Applying a human-trained model to other vertebrates
Fast classification when training data is unavailable

The pretrained model contains:

Trained SVM ensemble with optimized hyperparameters
Frozen scaler from training species (for cross-species normalization)
Model metadata (training parameters, feature configuration)

Streaming Mode

For large genomes with memory constraints, streaming mode processes introns per-chromosome:

# Memory-efficient streaming with pretrained model
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --model trained.model.pkl --streaming -p 8

Streaming mode provides ~90% memory savings by:

Processing one chromosome at a time
Writing results immediately (not accumulating in memory)
Using the frozen scaler from the pretrained model

Requirements: Streaming mode requires --model (pretrained model with frozen scaler).

Configuration Files

intronIC uses YAML configuration files for advanced parameter tuning:

# Use custom configuration
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --config config/profiles/production.yaml

Configuration files are auto-discovered from (in priority order):

--config PATH (explicit CLI argument)
./.intronIC.yaml (current directory)
~/.config/intronIC/config.yaml (XDG config)
Built-in defaults

Key configurable parameters include:

Feature selection: Choose which augmented features to use (5D standard or custom)
Penalty options: L1, L2, or both for regularization search
Class weight multipliers: Fine-tune precision/recall tradeoff
CV parameters: Number of folds, optimization rounds
Ensemble settings: Number of models, subsampling ratio

See config/config.yaml for full documentation of all options.

Recursive Training

For species distant from the training data, recursive training can improve accuracy:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name --recursive

This performs two passes:

Initial classification to identify high-confidence U12-type introns
Build species-specific PWMs and retrain models
Re-classify all introns with the updated models

Custom Reference Sequences

For specialized analyses, you can provide custom reference sequences:

intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
  --reference_u12s my_u12_introns.iic \
  --reference_u2s my_u2_introns.iic

Reference files should follow the .iic format (tab-delimited: name, 5'_flank, intron_seq, 3'_flank).

Two-Stage Workflow

For large genomes or parameter tuning, you can separate extraction from classification:

Stage 1: Extract sequences only

intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name
# Produces: species_name.introns.iic (and .meta.iic, .bed.iic)

Stage 2: Classify extracted sequences

intronIC -q species_name.introns.iic -n species_name -t 95
# Much faster for testing different thresholds or references

Troubleshooting

Common Issues

"No U12-type introns found"

Normal for some small genomes or chromosomes
Try lowering threshold: -t 80
Check that annotation contains sufficient introns
Consider using --recursive for distant species

"Out of memory" errors

Use a machine with more RAM for very large genomes
Try processing chromosomes separately using BED input
Reduce parallelization: -p 1 or -p 2

"No introns extracted"

Check that genome and annotation use matching chromosome names
Verify annotation format (GFF3 or GTF)
Try different feature type: -f cds or -f exon
Check annotation file is not corrupted

Slow performance

Use parallel processing: -p 4 or -p 8
Use --model to skip model training
Use smaller reference sets for testing
Consider extracting sequences first (extract subcommand), then classify separately

Classification results differ from original intronIC

Minor differences can occur due to:
- Random seed in cross-validation
- sklearn version differences
- Floating-point precision
Major differences are unexpected; please file an issue

Getting Help

Documentation: See the original wiki for detailed guides
Issues: Report bugs at GitHub Issues
Questions: Open a discussion at GitHub Discussions

For refactoring-specific questions, see REFACTOR_SUMMARY.md.

Testing the Installation

To verify your installation works correctly, download the test data and run:

# Download test data (if not cloned from repo)
# Or use your own genome + annotation files

# Run on Human Chr19 test data
intronIC -g test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz \
         -a test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
         -n test_run -p 4

# With pixi (from cloned repo)
pixi run test-small

Expected output:

Several .iic files named test_run.*
A .log file with classification summary
PNG plots showing score distributions
Console output showing ~30 U12-type introns found

Project Structure

The codebase is organized into logical modules under src/intronIC/:

src/intronIC/
├── cli/                 # Command-line interface and orchestration
│   ├── main.py          # Pipeline entry point
│   ├── args.py          # Argument parsing
│   ├── config.py        # Configuration management
│   └── reporter.py      # Progress reporting
├── core/                # Core data structures
│   ├── intron.py        # Intron class and related types
│   └── reference.py     # Reference sequence management
├── extraction/          # Intron extraction from various sources
│   ├── annotation.py    # GFF3/GTF parsing
│   ├── bed.py           # BED file parsing
│   ├── sequences.py     # Sequence file parsing
│   └── filter.py        # Quality control and filtering
├── scoring/             # PWM scoring and normalization
│   ├── pwm.py           # Position-weight matrix operations
│   ├── scorer.py        # Score calculation
│   └── normalizer.py    # Z-score normalization
├── classification/      # SVM training and prediction
│   ├── trainer.py       # Model training with nested CV
│   ├── predictor.py     # Ensemble prediction
│   ├── nested_cv.py     # Nested cross-validation
│   └── split_eval.py    # Evaluation utilities
├── output/              # Output file generation
│   ├── writers.py       # All output writers
│   └── formatter.py     # Formatting utilities
├── visualization/       # Plotting functions
│   └── plots.py         # All visualization code
├── utils/               # Utility modules
│   ├── genome.py        # Genome file handling
│   ├── logging_utils.py # Enhanced logging
│   └── sequences.py     # Sequence utilities
└── __main__.py          # Module entry point

Citing intronIC

If you use intronIC in your research, please cite:

Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett. Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078. https://doi.org/10.1093/nar/gkaa464

About intronIC

intronIC was created to provide a customizable, open-source method for identifying minor (U12-type) spliceosomal introns from genomic data. U12-type introns are rare (~0.5% of introns) but functionally important, and contain distinct splicing motifs that make them amenable to computational identification.

Why intronIC?

Earlier U12 databases (U12DB, SpliceRack, ERISdb) were valuable resources but:

Static by design (not updated with new genome releases)
Based on older genome annotations
Limited to pre-selected species
Used heuristic classification criteria

intronIC addresses these limitations:

Works with any genome + annotation
Uses the well-established SVM classification approach
Produces interpretable probability scores
Allows customization of training data and parameters
Provides extensive metadata for downstream analysis
Regularly updated with algorithm improvements

Classification Method

intronIC's approach combines sequence motif analysis with machine learning:

Position-Weight Matrices (PWMs): Capture sequence preferences at three key regions
- 5' splice site (donor): Recognizes GT/AT at intron start
- Branch point: Identifies TCCTTAAC-like motifs in U12-type introns
- 3' splice site (acceptor): Recognizes AG/AC at intron end
Z-Score Normalization: Converts raw PWM scores to standardized features
- Fit on reference sequences only (prevents data leakage)
- Accounts for different score ranges across regions
Linear SVM Classifier: Learns decision boundary in 3D feature space
- Trained on curated U12-type and U2-type reference sets
- Balanced class weights handle imbalanced data (~0.5% expected U12-type)
- Probability calibration provides confidence estimates
Ensemble Averaging: Reduces variance through multiple models
- Each model trained on different U2 subsamples
- F1-weighted voting combines predictions
- Produces robust, reliable probabilities

This approach avoids arbitrary score thresholds and provides probabilistic classifications that researchers can interpret based on their specific needs (e.g., high-confidence predictions for experimental validation vs. comprehensive catalogs).

The Refactoring

This refactored version maintains complete algorithmic fidelity to the original while dramatically improving code organization and maintainability. The original 6,093-line monolithic file has been restructured into 15+ focused modules, each with a single responsibility.

Key improvements include:

Fixed data leakage bug in z-score normalization
Corrected type_id assignment logic
Added comprehensive type hints
Immutable data structures for thread safety
Better logging and error messages
Structured for testing and extension

For complete details, see REFACTOR_SUMMARY.md.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines.

Quick start for contributors:

git clone https://github.com/glarue/intronIC.git
cd intronIC
make install    # Set up development environment
make test       # Run tests
make help       # See all available commands

For major changes, please open an issue first to discuss the proposed changes.

License

intronIC is released under the GNU General Public License v3.0.

Acknowledgments

Developed by Graham E. Larue with contributions from the Roy Lab and Padgett Lab.

Reference database curation: Devlin C. Moyer, Courtney E. Hershberger

Special thanks to the bioinformatics community for tools and libraries that make this work possible.

For more detailed documentation, algorithm descriptions, and examples, visit the intronIC wiki.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.2.0

Apr 12, 2026

2.1.5

Mar 25, 2026

2.1.4

Mar 11, 2026

2.1.0

Dec 15, 2025

2.0.10

Dec 15, 2025

2.0.9

Dec 15, 2025

2.0.8

Dec 15, 2025

2.0.7

Dec 15, 2025

2.0.6

Dec 15, 2025

2.0.5

Dec 9, 2025

2.0.4

Dec 9, 2025

2.0.3

Dec 9, 2025

2.0.2

Dec 9, 2025

This version

2.0.1

Dec 9, 2025

2.0.0

Dec 9, 2025

1.5.2

Jan 18, 2024

1.5.1

Jan 10, 2024

1.4.0

Apr 27, 2023

1.3.7

Jun 10, 2022

1.3.6

Jun 9, 2022

1.3.4

Jun 9, 2022

1.3.3

Jun 9, 2022

1.3.2

Oct 20, 2021

1.3.0

Jul 19, 2021

1.2.3

May 6, 2021

1.2.2

Mar 12, 2021

1.2.0

Feb 8, 2021

1.1.1

Dec 5, 2020

1.1.0

Oct 31, 2020

1.0.14

Oct 30, 2020

1.0.13

Sep 2, 2020

1.0.12

Aug 21, 2020

1.0.11

Jul 9, 2020

1.0.10

Jun 20, 2020

1.0.9

Jun 20, 2020

1.0.8

Jun 13, 2020

1.0.7

Jun 13, 2020

1.0.6

Jun 11, 2020

1.0.5

Jun 9, 2020

1.0.4

Jun 8, 2020

1.0.3

Jun 5, 2020

1.0.2

Jun 3, 2020

1.0.1

Jun 3, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intronic-2.0.1.tar.gz (50.0 MB view details)

Uploaded Dec 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

intronic-2.0.1-py3-none-any.whl (25.3 MB view details)

Uploaded Dec 9, 2025 Python 3

File details

Details for the file intronic-2.0.1.tar.gz.

File metadata

Download URL: intronic-2.0.1.tar.gz
Upload date: Dec 9, 2025
Size: 50.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for intronic-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`f9fd5f99fc8930978f49fa07da6a21912d337347f188569a1bfc1a99fd51744f`
MD5	`a61def7bee45592ed855304e27332a6e`
BLAKE2b-256	`5866cd130d52ed33382e01e410f6d7792cc2b910a93279183a268dcb250a92e3`

See more details on using hashes here.

File details

Details for the file intronic-2.0.1-py3-none-any.whl.

File metadata

Download URL: intronic-2.0.1-py3-none-any.whl
Upload date: Dec 9, 2025
Size: 25.3 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for intronic-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e83a3bec4ab8efe2403441ef2057039a197c5e4f3369a6ed533231859266cbe5`
MD5	`f6f16a6b70ebfa4fa9055858ab0164c6`
BLAKE2b-256	`35ddb7dec2cf708099ba0e7251f29708f04d4aed1f7ec0e6d4c955c4a4a1a7c1`

See more details on using hashes here.

intronIC 2.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

intronIC - (intron Interrogator and Classifier)

About This Refactored Version

Key Improvements

What's Preserved

Scientific Background

Minor (U12-type) vs Major (U2-type) Introns

Classification Approach

ML Pipeline Architecture

Installation

Quick Install (Recommended)

From Source (Development/Latest)

Using pixi (Reproducible Environments)

Verify Installation

Dependencies

Quick Start

Installation (One Command)

Basic Commands

Test Run (Human Chr19)

Usage

Commands

Default Mode: Classify Introns

Train Subcommand

Extract Subcommand

Required Arguments

Common Options

Usage Examples

Output Files

Main Output Files

Identifying U12-type Introns

Understanding the Scores

A Note on the -n (Name) Argument

Resource Usage

Memory

Runtime

Advanced Usage

Using Pretrained Models

Streaming Mode

Configuration Files

Recursive Training

Custom Reference Sequences

Two-Stage Workflow

Troubleshooting

Common Issues

Getting Help

Testing the Installation

Project Structure

Citing intronIC

About intronIC

Why intronIC?

Classification Method

The Refactoring

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Using `pixi` (Reproducible Environments)

A Note on the `-n` (Name) Argument