Intron classification tool for identifying U2-type and U12-type introns using SVM
Project description
intronIC - (intron Interrogator and Classifier)
Version 2.0.0 - Refactored Edition with Corrected Architecture
intronIC is a bioinformatics tool for extracting and classifying intron sequences as U12-type (minor) or U2-type (major) using a support vector machine (SVM) trained on position-weight matrix (PWM) scores. It can be used with a genome and annotation file, or with pre-extracted intron sequences. Alternatively, intronIC can extract all annotated intron sequences without classification (using the extract subcommand).
About This Refactored Version
This refactored version maintains 100% algorithmic fidelity and CLI compatibility with the original intronIC while providing a modernized, maintainable codebase:
Key Improvements
- Corrected ML Architecture (v2.0): Fixed double-scaling issue and train/test mismatch
- Single scaling step via RobustScaler with centering (removes composition bias)
- Configurable augmented features with 5D standard (
absdiff_bp_3,absdiff_5_bp) or custom feature sets - Two-stage optimization (C via balanced_accuracy, calibration via log-loss)
- L1/L2 penalty search with class weight multiplier optimization
- Modular Architecture: Organized into logical packages (extraction, scoring, classification, output) instead of a single 6,000+-line file
- Enhanced Code Quality: Type hints throughout, immutable data structures, better error handling
- Bug Fixes: Corrected data leakage in z-score normalization, fixed type_id assignment logic
- Better Testing: Structured for unit and integration testing
- Modern Tooling: Support for
pixianduvpackage managers - Enhanced Logging: Clearer progress reporting with section markers and detailed training logs
- Improved Documentation: Comprehensive inline documentation and external guides
What's Preserved
- Same Classification Algorithm: Linear SVM with balanced class weights
- Same Feature Extraction: PWM scoring of 5' splice site, branch point, and 3' splice site
- Same Output Formats: All
.iicfiles maintain compatibility (with minor enhancements) - Same Performance: Comparable runtime and memory usage to original
- Validated Accuracy: Identical classification results on test data
Scientific Background
Minor (U12-type) vs Major (U2-type) Introns
Most eukaryotic introns (~99.5%) are spliced by the major (U2-type) spliceosome and typically have:
- 5' splice site: GT
- 3' splice site: AG
- Branch point: A within a loose consensus
A small fraction (~0.5%) are spliced by the minor (U12-type) spliceosome and typically have:
- 5' splice site: AT (AT-AC type) or GT (GT-AG type)
- 3' splice site: AC (AT-AC type) or AG (GT-AG type)
- Branch point: Highly conserved TCCTTAAC motif
Classification Approach
intronIC uses a three-step scoring and classification pipeline:
- PWM Scoring: Apply position-weight matrices to three key regions (5' splice site, branch point, 3' splice site) to calculate raw log-odds scores
- Normalization: Convert raw scores to z-scores using parameters fit on reference sequences only (prevents data leakage)
- SVM Classification: Train an ensemble of linear SVMs on reference U12/U2 introns, output probability scores (0-100%)
The output probability represents the classifier's confidence that an intron is U12-type. By default, introns with scores >90% are considered high-confidence U12-type predictions.
ML Pipeline Architecture
intronIC uses a single scaling step architecture to prevent double-scaling and ensure train/test consistency:
Raw PWM Scores (LLRs)
↓
ScoreNormalizer (EXTERNAL to pipeline)
- RobustScaler(with_centering=True)
- Fitted on reference introns only
- Transforms: raw LLRs → z-scores
- Removes composition bias via centering
↓
Z-Scores [five_z_score, bp_z_score, three_z_score]
↓
ML Pipeline (NO scaler inside)
├─ BothEndsStrongTransformer
│ └─ Augments 3D → 5D features (standard config):
│ • Pass-through: five_z, bp_z, three_z
│ • absdiff_bp_3 = |bp_z - three_z| (BP/3' imbalance penalty)
│ • absdiff_5_bp = |five_z - bp_z| (5'/BP imbalance penalty)
│ └─ Or custom 4D-7D with different features:
│ • min_all, absdiff_5_3, min_5_bp, max_5_bp, etc.
├─ LinearSVC
│ └─ L1 or L2 penalty (grid-searched), balanced class weights
└─ CalibratedClassifierCV
└─ External calibration (sigmoid or isotonic)
↓
U12 Probability (0-100%)
Key Design Principles:
-
Single Scaling Step: Scaling happens ONLY in ScoreNormalizer (external to pipeline). The pipeline receives pre-scaled z-scores and does NOT re-scale them. This prevents double-scaling.
-
Train/Test Consistency: Both training and prediction extract z-scores from introns and pass them to the pipeline, ensuring identical data transformations.
-
Domain Adaptation: ScoreNormalizer can be refitted per-species (adaptive mode) or reused from training species (human mode) for cross-species classification.
-
Feature Engineering: BothEndsStrongTransformer adds configurable composite features. The standard 5D configuration adds
absdiff_bp_3andabsdiff_5_bp(BP/3' and 5'/BP imbalance penalties) based on L1 regularization analysis. Seeconfig/config.yamlfor all available features. -
Hyperparameter Optimization:
- Grid search over: C parameter, L1/L2 penalty, class weight multipliers
- Stage 1: Optimize C using balanced_accuracy (discrimination quality)
- Stage 2: Select calibration method (sigmoid vs isotonic) using log-loss (probability quality)
-
YAML Configuration: All optimizer settings are configurable via
config/config.yamlincluding feature selection, penalty options, class weight multipliers, and CV parameters.
This architecture was validated on C. elegans, achieving 1 false positives (1/109,830) vs 130 with uncentered scaling.
Installation
Quick Install (Recommended)
pip install intronIC
That's it! This installs intronIC and all dependencies from PyPI.
From Source (Development/Latest)
For the latest development version or to contribute:
git clone https://github.com/glarue/intronIC.git
cd intronIC
pip install -e .
Using pixi (Reproducible Environments)
Pixi provides fully reproducible environments with locked dependencies—ideal for HPC clusters or when exact reproducibility is required:
# Install pixi (if not already installed)
curl -fsSL https://pixi.sh/install.sh | bash
# Clone and install
git clone https://github.com/glarue/intronIC.git
cd intronIC
pixi install
# Run intronIC through pixi
pixi run intronIC -h
# Or run the included test
pixi run test-small
When to use pixi:
- HPC/cluster environments with strict reproducibility requirements
- When you need isolated, self-contained environments
- If you prefer conda-style environment management
Verify Installation
intronIC --version
intronIC -h
Dependencies
intronIC requires Python 3.10+ and the following packages:
- numpy
>=1.19.0- Numerical operations - scipy
>=1.5.0- Scientific computing - scikit-learn
>=0.22, <2.0- SVM classifier - matplotlib
>=3.3.0- Plotting - networkx
>=2.5.1- Graph operations for annotation parsing - rich
>=10.0- Terminal progress bars - biogl
>=0.1.0- Bioinformatics utilities
All dependencies are automatically installed by pixi, uv, or pip.
intronIC was developed on Linux and has been tested on macOS and Windows.
Quick Start
Installation (One Command)
pip install intronIC
Basic Commands
# Classify introns (train on-the-fly)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8
# Use pretrained model (faster)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name --model model.pkl -p 8
# Extract sequences only (no classification)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name -p 8
# Train a model (no genome needed)
intronIC train -n my_model -p 8
Test Run (Human Chr19)
# With test data included in repository
intronIC -g test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz \
-a test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
-n homo_sapiens_chr19 -p 4
Expected results:
- ~29,000 introns extracted
- ~30 U12-type introns (score ≥90%)
- ~8 AT-AC type U12 introns
- Output files:
homo_sapiens_chr19.*.iic
Usage
Commands
intronIC supports three subcommands:
| Command | Description |
|---|---|
| (default) | Classify introns from genome + annotation |
train |
Train a model on reference data (no genome needed) |
extract |
Extract sequences only (no classification) |
Default Mode: Classify Introns
The default mode extracts introns and classifies them as U12 or U2 type:
# Basic usage (trains model on-the-fly)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name
# With pretrained model (faster, recommended)
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
--model homo_sapiens.model.pkl -p 8
# Memory-efficient streaming mode
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
--model homo_sapiens.model.pkl --streaming -p 8
Train Subcommand
Train a classifier model without needing a genome:
# Basic training with built-in references
intronIC train -n my_model -p 8
# With custom configuration
intronIC --config config/config.yaml train -n my_model -p 12
# With custom reference sequences
intronIC train -n my_model -p 8 \
--reference_u12s custom_u12.iic \
--reference_u2s custom_u2.iic
# Quick training (skip nested CV evaluation)
intronIC train -n my_model --eval_mode none -p 8
Output: my_model.model.pkl - use with --model for classification.
Extract Subcommand
Extract intron sequences without classification:
# Extract from annotation (streaming mode by default)
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name
# Extract from BED file
intronIC extract -g genome.fa.gz -b introns.bed -n species_name
# With custom flank length
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name --flank-len 20
Output: .introns.iic, .meta.iic, .bed.iic files (no classification scores).
Required Arguments
| Argument | Short | Description |
|---|---|---|
--genome |
-g |
Genome FASTA file (gzip supported) |
--annotation |
-a |
GFF3/GTF annotation file (gzip supported) |
--species-name |
-n |
Species name / output prefix |
Alternative inputs (instead of -a):
-b FILE- BED file with intron coordinates-q FILE- Pre-extracted sequences file
Common Options
| Argument | Short | Description | Default |
|---|---|---|---|
--processes |
-p |
Number of CPU cores | 1 |
--threshold |
-t |
U12 probability threshold (0-100) | 90 |
--model |
Pretrained model file | None | |
--config |
YAML configuration file | Auto-discovered | |
--streaming |
Memory-efficient mode | False | |
--feature-type |
-f |
Feature type: cds, exon, or both |
both |
--allow-multiple-isoforms |
-i |
Include all isoforms | False (longest only) |
--exclude-overlapping |
-v |
Exclude overlapping introns | False |
--no-nc |
Exclude non-canonical introns | False | |
--recursive |
Recursive training | False |
Usage Examples
1. Basic classification:
intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -p 8
2. With pretrained model (recommended for speed):
intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species \
--model homo_sapiens.model.pkl -p 8
3. Streaming mode for large genomes:
intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species \
--model homo_sapiens.model.pkl --streaming -p 8
4. Extract sequences only:
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n my_species -p 8
5. Train custom model:
intronIC train -n my_trained_model -p 8 --config config/config.yaml
6. Stricter threshold (95%):
intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -t 95 -p 8
7. Include all isoforms, CDS only:
intronIC -g genome.fa.gz -a annotation.gff3.gz -n my_species -i -f cds -p 8
8. Classify from BED coordinates:
intronIC -g genome.fa.gz -b intron_coordinates.bed -n my_species \
--model homo_sapiens.model.pkl
Output Files
All output files are tab-delimited with the .iic extension and named {species_name}.{type}.iic.
Main Output Files
1. .meta.iic - Comprehensive Metadata
Contains detailed information for each intron:
- Intron name/label with tags
- Relative score (distance from threshold)
- Terminal dinucleotides (e.g., GT-AG, AT-AC)
- Motif schematic (showing branch point context)
- Branch point region sequence
- Intron length
- Parent transcript ID
- Grandparent gene ID
- Intron index and total family size
- Fractional position in transcript
- Exon phase
- Type ID (u2 or u12)
- Attributes (longest_isoform, corrected, etc.)
2. .bed.iic - BED-format Coordinates
Standard BED format with scores:
- Chromosome
- Start (0-based, BED standard)
- Stop (1-based)
- Label (
intron_name;probability) - SVM score (0-100, integer)
- Strand
3. .seqs.iic - Sequences
Intron sequences with flanking regions:
- Intron name
- 5' flanking sequence (exonic)
- Intron sequence
- 3' flanking sequence (exonic)
- SVM score (if classification performed)
4. .scores.iic (or .score_info.iic) - Detailed Scoring
Per-intron breakdown of all scores:
- Name and scores (relative, SVM, decision_distance)
- 5' splice site: sequence, raw score, z-score
- Branch point: sequences (U12 and U2 versions), raw score, z-score
- 3' splice site: sequence, raw score, z-score
5. Mapping Files
.dupe_map.iic- Maps duplicate introns to their representative.overlap_map.iic- Maps overlapping intron coordinates
6. Visualization Files (.png)
*_scatter.png- 2D scatter plot of classified introns with marginal distributions*_training_scatter.png- Scatter plot of training data*_training_hexplot.png- Hexbin density plot of reference introns*_pr_curve.png- Precision-Recall AUC curves for model evaluation
7. Log Files
.log- Main log file with pipeline progress and summary statistics.training.log- Detailed training log (when models are trained, not with--model)
Identifying U12-type Introns
U12-type introns are identified by their relative score > 0 (equivalent to SVM score > threshold):
# Extract U12-type introns from meta file
awk '($2!="NA" && $2>0)' species_name.meta.iic
# Count U12-type introns
awk '($2!="NA" && $2>0)' species_name.meta.iic | wc -l
# Get U12-type intron names
awk '($2!="NA" && $2>0) {print $1}' species_name.meta.iic
# Filter by higher confidence (relative score > 10)
awk '($2!="NA" && $2>10)' species_name.meta.iic
Understanding the Scores
SVM Score (0-100):
- Probability that the intron is U12-type
- 50 = equal probability of U2 or U12
-
90 = high confidence U12 (default threshold)
- <10 = high confidence U2
Relative Score:
- Distance from the threshold
- Calculated as:
svm_score - threshold - Positive values = above threshold (U12-type at chosen confidence)
- Negative values = below threshold (U2-type)
- Makes filtering easier: just check if > 0
Type ID (u2 or u12):
- Binary classification based on raw classifier decision (50% boundary)
- Independent of the user-chosen threshold
- Used for organizing output and statistics
Decision Distance:
- Log-odds ratio:
log(probability / (1 - probability)) - 0 = equal probability (50%)
- Positive = favors U12
- Negative = favors U2
- Useful for understanding classifier confidence
A Note on the -n (Name) Argument
By default, intronIC expects species names in binomial format (genus, species) separated by a non-alphanumeric character:
homo_sapiens✅homo.sapiens✅homo-sapiens✅
intronIC formats the name internally into a tag for intron IDs (e.g., HomSap), using only the first two elements.
Output files are named using the full argument supplied to -n, so:
homo_sapiens→ files namedhomo_sapiens.*homo_sapiens.v2→ files namedhomo_sapiens.v2.*- Intron IDs in both would use
HomSaptag
To use the name argument exactly as provided without any parsing, add the --na flag:
intronIC -g genome.fa.gz -a annotation.gff3.gz -n "My Custom Name" --na
Resource Usage
Memory
Memory usage scales with the number of annotated introns in the genome:
- Small genomes (<50,000 introns): <1 GB
- Typical genomes (50,000-200,000 introns): 1-3 GB
- Large genomes (>200,000 introns): 3-5 GB
- Human genome (Ensembl 95, ~1 million introns): ~5 GB
- Streaming mode (with
--model --streaming): ~0.5 GB regardless of genome size
Most modern computers should handle even large genomes without issue. For memory-constrained environments, use streaming mode with a pretrained model.
Runtime
Runtime depends on genome size, annotation density, and whether models are pre-trained:
| Genome | Introns | Train Mode | Pretrained (--model -p 5) |
|---|---|---|---|
| Chr19 (test) | ~29,000 | 5-15 min | <1 min |
| Small genome | ~50,000 | 10-30 min | 1-2 min |
| Human (full) | ~200,000 | 20-40 min | ~3 min |
Tips for faster runs:
- Use
--modelwith a pretrained model to skip training (fastest) - Use
-p Nfor parallel processing (recommended: 5-8 cores) - Use
--streamingwith--modelfor large genomes with memory constraints - Use small reference sets for testing (
--reference_u12s,--reference_u2s) - Extract sequences first with
extractsubcommand, then classify separately if iterating on parameters
Advanced Usage
Using Pretrained Models
For cross-species classification using a model trained on another species:
# Use a specific trained model file
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
--model /path/to/trained_species.model.pkl
This is the recommended approach for:
- Classifying species without curated U12 references
- Applying a human-trained model to other vertebrates
- Fast classification when training data is unavailable
The pretrained model contains:
- Trained SVM ensemble with optimized hyperparameters
- Frozen scaler from training species (for cross-species normalization)
- Model metadata (training parameters, feature configuration)
Streaming Mode
For large genomes with memory constraints, streaming mode processes introns per-chromosome:
# Memory-efficient streaming with pretrained model
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
--model trained.model.pkl --streaming -p 8
Streaming mode provides ~90% memory savings by:
- Processing one chromosome at a time
- Writing results immediately (not accumulating in memory)
- Using the frozen scaler from the pretrained model
Requirements: Streaming mode requires --model (pretrained model with frozen scaler).
Configuration Files
intronIC uses YAML configuration files for advanced parameter tuning:
# Use custom configuration
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
--config config/profiles/production.yaml
Configuration files are auto-discovered from (in priority order):
--config PATH(explicit CLI argument)./.intronIC.yaml(current directory)~/.config/intronIC/config.yaml(XDG config)- Built-in defaults
Key configurable parameters include:
- Feature selection: Choose which augmented features to use (5D standard or custom)
- Penalty options: L1, L2, or both for regularization search
- Class weight multipliers: Fine-tune precision/recall tradeoff
- CV parameters: Number of folds, optimization rounds
- Ensemble settings: Number of models, subsampling ratio
See config/config.yaml for full documentation of all options.
Recursive Training
For species distant from the training data, recursive training can improve accuracy:
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name --recursive
This performs two passes:
- Initial classification to identify high-confidence U12-type introns
- Build species-specific PWMs and retrain models
- Re-classify all introns with the updated models
Custom Reference Sequences
For specialized analyses, you can provide custom reference sequences:
intronIC -g genome.fa.gz -a annotation.gff3.gz -n species_name \
--reference_u12s my_u12_introns.iic \
--reference_u2s my_u2_introns.iic
Reference files should follow the .iic format (tab-delimited: name, 5'_flank, intron_seq, 3'_flank).
Two-Stage Workflow
For large genomes or parameter tuning, you can separate extraction from classification:
Stage 1: Extract sequences only
intronIC extract -g genome.fa.gz -a annotation.gff3.gz -n species_name
# Produces: species_name.introns.iic (and .meta.iic, .bed.iic)
Stage 2: Classify extracted sequences
intronIC -q species_name.introns.iic -n species_name -t 95
# Much faster for testing different thresholds or references
Troubleshooting
Common Issues
"No U12-type introns found"
- Normal for some small genomes or chromosomes
- Try lowering threshold:
-t 80 - Check that annotation contains sufficient introns
- Consider using
--recursivefor distant species
"Out of memory" errors
- Use a machine with more RAM for very large genomes
- Try processing chromosomes separately using BED input
- Reduce parallelization:
-p 1or-p 2
"No introns extracted"
- Check that genome and annotation use matching chromosome names
- Verify annotation format (GFF3 or GTF)
- Try different feature type:
-f cdsor-f exon - Check annotation file is not corrupted
Slow performance
- Use parallel processing:
-p 4or-p 8 - Use
--modelto skip model training - Use smaller reference sets for testing
- Consider extracting sequences first (
extractsubcommand), then classify separately
Classification results differ from original intronIC
- Minor differences can occur due to:
- Random seed in cross-validation
- sklearn version differences
- Floating-point precision
- Major differences are unexpected; please file an issue
Getting Help
- Documentation: See the original wiki for detailed guides
- Issues: Report bugs at GitHub Issues
- Questions: Open a discussion at GitHub Discussions
For refactoring-specific questions, see REFACTOR_SUMMARY.md.
Testing the Installation
To verify your installation works correctly, download the test data and run:
# Download test data (if not cloned from repo)
# Or use your own genome + annotation files
# Run on Human Chr19 test data
intronIC -g test_data/Homo_sapiens.Chr19.Ensembl_91.fa.gz \
-a test_data/Homo_sapiens.Chr19.Ensembl_91.gff3.gz \
-n test_run -p 4
# With pixi (from cloned repo)
pixi run test-small
Expected output:
- Several
.iicfiles namedtest_run.* - A
.logfile with classification summary - PNG plots showing score distributions
- Console output showing ~30 U12-type introns found
Project Structure
The codebase is organized into logical modules under src/intronIC/:
src/intronIC/
├── cli/ # Command-line interface and orchestration
│ ├── main.py # Pipeline entry point
│ ├── args.py # Argument parsing
│ ├── config.py # Configuration management
│ └── reporter.py # Progress reporting
├── core/ # Core data structures
│ ├── intron.py # Intron class and related types
│ └── reference.py # Reference sequence management
├── extraction/ # Intron extraction from various sources
│ ├── annotation.py # GFF3/GTF parsing
│ ├── bed.py # BED file parsing
│ ├── sequences.py # Sequence file parsing
│ └── filter.py # Quality control and filtering
├── scoring/ # PWM scoring and normalization
│ ├── pwm.py # Position-weight matrix operations
│ ├── scorer.py # Score calculation
│ └── normalizer.py # Z-score normalization
├── classification/ # SVM training and prediction
│ ├── trainer.py # Model training with nested CV
│ ├── predictor.py # Ensemble prediction
│ ├── nested_cv.py # Nested cross-validation
│ └── split_eval.py # Evaluation utilities
├── output/ # Output file generation
│ ├── writers.py # All output writers
│ └── formatter.py # Formatting utilities
├── visualization/ # Plotting functions
│ └── plots.py # All visualization code
├── utils/ # Utility modules
│ ├── genome.py # Genome file handling
│ ├── logging_utils.py # Enhanced logging
│ └── sequences.py # Sequence utilities
└── __main__.py # Module entry point
Citing intronIC
If you use intronIC in your research, please cite:
Devlin C Moyer, Graham E Larue, Courtney E Hershberger, Scott W Roy, Richard A Padgett. Comprehensive database and evolutionary dynamics of U12-type introns. Nucleic Acids Research, Volume 48, Issue 13, 27 July 2020, Pages 7066–7078. https://doi.org/10.1093/nar/gkaa464
About intronIC
intronIC was created to provide a customizable, open-source method for identifying minor (U12-type) spliceosomal introns from genomic data. U12-type introns are rare (~0.5% of introns) but functionally important, and contain distinct splicing motifs that make them amenable to computational identification.
Why intronIC?
Earlier U12 databases (U12DB, SpliceRack, ERISdb) were valuable resources but:
- Static by design (not updated with new genome releases)
- Based on older genome annotations
- Limited to pre-selected species
- Used heuristic classification criteria
intronIC addresses these limitations:
- Works with any genome + annotation
- Uses the well-established SVM classification approach
- Produces interpretable probability scores
- Allows customization of training data and parameters
- Provides extensive metadata for downstream analysis
- Regularly updated with algorithm improvements
Classification Method
intronIC's approach combines sequence motif analysis with machine learning:
-
Position-Weight Matrices (PWMs): Capture sequence preferences at three key regions
- 5' splice site (donor): Recognizes GT/AT at intron start
- Branch point: Identifies TCCTTAAC-like motifs in U12-type introns
- 3' splice site (acceptor): Recognizes AG/AC at intron end
-
Z-Score Normalization: Converts raw PWM scores to standardized features
- Fit on reference sequences only (prevents data leakage)
- Accounts for different score ranges across regions
-
Linear SVM Classifier: Learns decision boundary in 3D feature space
- Trained on curated U12-type and U2-type reference sets
- Balanced class weights handle imbalanced data (~0.5% expected U12-type)
- Probability calibration provides confidence estimates
-
Ensemble Averaging: Reduces variance through multiple models
- Each model trained on different U2 subsamples
- F1-weighted voting combines predictions
- Produces robust, reliable probabilities
This approach avoids arbitrary score thresholds and provides probabilistic classifications that researchers can interpret based on their specific needs (e.g., high-confidence predictions for experimental validation vs. comprehensive catalogs).
The Refactoring
This refactored version maintains complete algorithmic fidelity to the original while dramatically improving code organization and maintainability. The original 6,093-line monolithic file has been restructured into 15+ focused modules, each with a single responsibility.
Key improvements include:
- Fixed data leakage bug in z-score normalization
- Corrected type_id assignment logic
- Added comprehensive type hints
- Immutable data structures for thread safety
- Better logging and error messages
- Structured for testing and extension
For complete details, see REFACTOR_SUMMARY.md.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines.
Quick start for contributors:
git clone https://github.com/glarue/intronIC.git
cd intronIC
make install # Set up development environment
make test # Run tests
make help # See all available commands
For major changes, please open an issue first to discuss the proposed changes.
License
intronIC is released under the GNU General Public License v3.0.
Acknowledgments
Developed by Graham E. Larue with contributions from the Roy Lab and Padgett Lab.
Reference database curation: Devlin C. Moyer, Courtney E. Hershberger
Special thanks to the bioinformatics community for tools and libraries that make this work possible.
For more detailed documentation, algorithm descriptions, and examples, visit the intronIC wiki.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file intronic-2.0.1.tar.gz.
File metadata
- Download URL: intronic-2.0.1.tar.gz
- Upload date:
- Size: 50.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9fd5f99fc8930978f49fa07da6a21912d337347f188569a1bfc1a99fd51744f
|
|
| MD5 |
a61def7bee45592ed855304e27332a6e
|
|
| BLAKE2b-256 |
5866cd130d52ed33382e01e410f6d7792cc2b910a93279183a268dcb250a92e3
|
File details
Details for the file intronic-2.0.1-py3-none-any.whl.
File metadata
- Download URL: intronic-2.0.1-py3-none-any.whl
- Upload date:
- Size: 25.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e83a3bec4ab8efe2403441ef2057039a197c5e4f3369a6ed533231859266cbe5
|
|
| MD5 |
f6f16a6b70ebfa4fa9055858ab0164c6
|
|
| BLAKE2b-256 |
35ddb7dec2cf708099ba0e7251f29708f04d4aed1f7ec0e6d4c955c4a4a1a7c1
|