Skip to main content

Comprehensive bioinformatics utilities for sequence analysis, alignment, annotation, and molecular biology workflows

Project description

🧬 Bioutils Collection

PyPI version Python 3.10+ License: MIT

Production-ready bioinformatics toolkit - 77+ optimized functions for sequence analysis, alignment, annotation, and molecular biology workflows.

✨ Highlights

  • 🚀 77+ specialized functions across 13 bioinformatics domains
  • 🔒 Fully typed with complete type hints (mypy strict)
  • 📊 Research-grade algorithms - Needleman-Wunsch, Smith-Waterman, and more
  • Performance optimized for large-scale genomic data
  • Extensively tested with comprehensive test coverage
  • 📝 Self-documenting - NumPy-style docstrings with examples

📦 Installation

pip install bioutils-collection

Requirements: Python 3.10+ with numpy, scipy, and scikit-learn

🎯 Quick Start

from bioutils_collection import (
    reverse_complement,
    gc_content,
    needleman_wunsch,
    parse_fasta,
    translate_dna_to_protein,
)

# Sequence manipulation
seq = "ATCGATCG"
rev_comp = reverse_complement(seq)  # "CGATCGAT"

# Calculate GC content
gc = gc_content("ATCGATCG")  # 0.5

# Global sequence alignment
seq1, seq2 = "GATTACA", "GCATGCU"
aligned1, aligned2, score = needleman_wunsch(seq1, seq2)

# Parse FASTA files
for header, sequence in parse_fasta("genome.fasta"):
    print(f"{header}: {len(sequence)} bp")

# Translate DNA to protein
protein = translate_dna_to_protein("ATGGCCTAA")  # "MA*"

🧬 Modules

Core Sequence Operations

  • alignment_functions - Pairwise & multiple sequence alignment (Needleman-Wunsch, Smith-Waterman, BLAST score ratio)
  • sequence_operations - Reverse complement, ORF finding, CpG islands, low-complexity filtering
  • translation_functions - DNA↔RNA transcription, translation with custom codon tables

Sequence Analysis & Statistics

  • gc_functions - GC content, GC skew, windowed GC profiling
  • sequence_statistics - Codon usage (CAI, ENC, RSCU), melting temp, isoelectric point, amino acid composition
  • data_validation - DNA/RNA/protein sequence validation

File I/O & Parsing

  • fasta_misc - FASTA parsing, writing, filtering, splitting, concatenation, primer generation
  • annotation_functions - BED/GFF/GTF/VCF parsing and conversion, annotation statistics

Pattern & Motif Discovery

  • motif_functions - Motif search, consensus generation, pattern matching
  • repeat_functions - Tandem repeat finder, palindrome detection
  • restriction_functions - Restriction enzyme site identification
  • clustering_functions - Motif clustering and grouping

🔬 Use Cases

Genomic Analysis

from bioutils_collection import find_orfs, gc_content_windows, find_cpg_islands

# Find all ORFs in a sequence
orfs = find_orfs(dna_sequence, min_length=300)

# Sliding window GC analysis
gc_proDevelopment

```bash
# Clone repository
git clone https://github.com/MForofontov/bioutils-collection.git
cd bioutils-collection

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run specific test categories
pytest -m alignment
pytest -m fasta
pytest -m translation

# Type checking
mypy bioutils_collection

# Linting
ruff check .

# Coverage report
pytest --cov=bioutils_collection --cov-report=html

📚 API Documentation

All functions include:

  • Complete type hints for static analysis
  • 📖 NumPy-style docstrings with parameter descriptions
  • 💡 Usage examples in docstrings
  • ⚠️ Complexity notes for performance-critical code
  • 📎 Algorithm references where applicable

Example:

from bioutils_collection import needleman_wunsch
help(needleman_wunsch)  # Comprehensive documentation

🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Add comprehensive tests
  4. Ensure all tests pass (pytest)
  5. Add type hints and docstrings
  6. Submit a pull request

Development Guidelines:

  • Follow existing code style (ruff formatting)
  • Add tests for all new functions
  • Update documentation
  • Keep functions focused and single-purpose

🔗 Related Projects

Protein properties

pi = calculate_isoelectric_point(protein_seq) composition = amino_acid_composition(protein_seq)

Primer design

tm = melting_temperature("ATCGATCGATCG")


## 🧪 Testing

```bash
# Run all tests
pytest

# Run specific module tests
pytest -m alignment
pytest -m fasta
pytest -m translation

# Run with coverage
pytest --cov=bioutils_collection --cov-report=html

📖 Documentation

Each function includes:

  • **� License

MIT License - see LICENSE for details.

📊 Project Stats

  • 77+ Functions across 13 specialized modules
  • 670+ Tests with comprehensive coverage
  • Type-safe with mypy strict mode
  • Python 3.10+ with modern type hints

📮 Contact & Support


⭐ Star this repo if you find it useful!

🤝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🔗 Related Projects

  • BioPython - Comprehensive bioinformatics toolkit
  • scikit-bio - Scientific Python library for bioinformatics

📮 Contact

Author: Mykyta Forofontov
Repository: https://github.com/MForofontov/bioutils-collection
Issues: https://github.com/MForofontov/bioutils-collection/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioutils_collection-0.1.1.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bioutils_collection-0.1.1-py3-none-any.whl (86.6 kB view details)

Uploaded Python 3

File details

Details for the file bioutils_collection-0.1.1.tar.gz.

File metadata

  • Download URL: bioutils_collection-0.1.1.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bioutils_collection-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ec8a2e62a2737bdd56115a272024ef092b034c671693669a7514799fb6707e5a
MD5 023f400bcda133e3384e11e6ffd55289
BLAKE2b-256 2131ba9cbcac9d5c334049beba941e7e114bb00e78fdacdfa2b9f8dbd1278762

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioutils_collection-0.1.1.tar.gz:

Publisher: publish-pypi.yml on MForofontov/bioutils-collection

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bioutils_collection-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for bioutils_collection-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6655ea01e0d5cd6de703f824464e22c9e0bd639f267796847c776cba441cad9d
MD5 93b01f3c45e1e9ca1017ca24d0814d30
BLAKE2b-256 105718b0ead44c5c48bd41a3c24e2b2dabb79a2cb543b3c41f0c51577769c498

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioutils_collection-0.1.1-py3-none-any.whl:

Publisher: publish-pypi.yml on MForofontov/bioutils-collection

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page