Skip to main content

Comprehensive bioinformatics utilities for sequence analysis, alignment, annotation, and molecular biology workflows

Project description

🧬 Bioutils Collection

PyPI version Python 3.10+ License: MIT

Production-ready bioinformatics toolkit - 77+ optimized functions for sequence analysis, alignment, annotation, and molecular biology workflows.

✨ Highlights

  • 🚀 77+ specialized functions across 13 bioinformatics domains
  • 🔒 Fully typed with complete type hints (mypy strict)
  • 📊 Research-grade algorithms - Needleman-Wunsch, Smith-Waterman, and more
  • Performance optimized for large-scale genomic data
  • Extensively tested with comprehensive test coverage
  • 📝 Self-documenting - NumPy-style docstrings with examples

📦 Installation

pip install bioutils-collection

Requirements: Python 3.10+ with numpy, scipy, and scikit-learn

🎯 Quick Start

from bioutils_collection import (
    reverse_complement,
    gc_content,
    needleman_wunsch,
    parse_fasta,
    translate_dna_to_protein,
)

# Sequence manipulation
seq = "ATCGATCG"
rev_comp = reverse_complement(seq)  # "CGATCGAT"

# Calculate GC content
gc = gc_content("ATCGATCG")  # 0.5

# Global sequence alignment
seq1, seq2 = "GATTACA", "GCATGCU"
aligned1, aligned2, score = needleman_wunsch(seq1, seq2)

# Parse FASTA files
for header, sequence in parse_fasta("genome.fasta"):
    print(f"{header}: {len(sequence)} bp")

# Translate DNA to protein
protein = translate_dna_to_protein("ATGGCCTAA")  # "MA*"

🧬 Modules

Core Sequence Operations

  • alignment_functions - Pairwise & multiple sequence alignment (Needleman-Wunsch, Smith-Waterman, BLAST score ratio)
  • sequence_operations - Reverse complement, ORF finding, CpG islands, low-complexity filtering
  • translation_functions - DNA↔RNA transcription, translation with custom codon tables

Sequence Analysis & Statistics

  • gc_functions - GC content, GC skew, windowed GC profiling
  • sequence_statistics - Codon usage (CAI, ENC, RSCU), melting temp, isoelectric point, amino acid composition
  • data_validation - DNA/RNA/protein sequence validation

File I/O & Parsing

  • fasta_misc - FASTA parsing, writing, filtering, splitting, concatenation, primer generation
  • annotation_functions - BED/GFF/GTF/VCF parsing and conversion, annotation statistics

Pattern & Motif Discovery

  • motif_functions - Motif search, consensus generation, pattern matching
  • repeat_functions - Tandem repeat finder, palindrome detection
  • restriction_functions - Restriction enzyme site identification
  • clustering_functions - Motif clustering and grouping

🔬 Use Cases

Genomic Analysis

from bioutils_collection import find_orfs, gc_content_windows, find_cpg_islands

# Find all ORFs in a sequence
orfs = find_orfs(dna_sequence, min_length=300)

# Sliding window GC analysis
gc_proDevelopment

```bash
# Clone repository
git clone https://github.com/MForofontov/bioutils-collection.git
cd bioutils-collection

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run specific test categories
pytest -m alignment
pytest -m fasta
pytest -m translation

# Type checking
mypy bioutils_collection

# Linting
ruff check .

# Coverage report
pytest --cov=bioutils_collection --cov-report=html

📚 API Documentation

All functions include:

  • Complete type hints for static analysis
  • 📖 NumPy-style docstrings with parameter descriptions
  • 💡 Usage examples in docstrings
  • ⚠️ Complexity notes for performance-critical code
  • 📎 Algorithm references where applicable

Example:

from bioutils_collection import needleman_wunsch
help(needleman_wunsch)  # Comprehensive documentation

🤝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Add comprehensive tests
  4. Ensure all tests pass (pytest)
  5. Add type hints and docstrings
  6. Submit a pull request

Development Guidelines:

  • Follow existing code style (ruff formatting)
  • Add tests for all new functions
  • Update documentation
  • Keep functions focused and single-purpose

🔗 Related Projects

Protein properties

pi = calculate_isoelectric_point(protein_seq) composition = amino_acid_composition(protein_seq)

Primer design

tm = melting_temperature("ATCGATCGATCG")


## 🧪 Testing

```bash
# Run all tests
pytest

# Run specific module tests
pytest -m alignment
pytest -m fasta
pytest -m translation

# Run with coverage
pytest --cov=bioutils_collection --cov-report=html

📖 Documentation

Each function includes:

  • **� License

MIT License - see LICENSE for details.

📊 Project Stats

  • 77+ Functions across 13 specialized modules
  • 670+ Tests with comprehensive coverage
  • Type-safe with mypy strict mode
  • Python 3.10+ with modern type hints

📮 Contact & Support


Star this repo if you find it useful!

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

📄 License

MIT License - see LICENSE file for details.

🔗 Related Projects

  • BioPython - Comprehensive bioinformatics toolkit
  • scikit-bio - Scientific Python library for bioinformatics

📮 Contact

Author: Mykyta Forofontov
Repository: https://github.com/MForofontov/bioutils-collection
Issues: https://github.com/MForofontov/bioutils-collection/issues

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioutils_collection-0.2.0.tar.gz (52.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bioutils_collection-0.2.0-py3-none-any.whl (93.7 kB view details)

Uploaded Python 3

File details

Details for the file bioutils_collection-0.2.0.tar.gz.

File metadata

  • Download URL: bioutils_collection-0.2.0.tar.gz
  • Upload date:
  • Size: 52.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bioutils_collection-0.2.0.tar.gz
Algorithm Hash digest
SHA256 038b4926ee40b211c3ac90d30890e4eb3da744e1c393b4933074be1d8b1035d6
MD5 4e2372d0fc2832d72bb2db80b1cef951
BLAKE2b-256 d7c539c280f90e4eb154f5e4aaaaa589a7818f2bb7a9459218fd8d4341dee001

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioutils_collection-0.2.0.tar.gz:

Publisher: publish-pypi.yml on MForofontov/bioutils-collection

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bioutils_collection-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for bioutils_collection-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 715f273bf33f22760211100a246754bc118e804d5d1dad0abd6e56660ff2ccf8
MD5 e3ad833aaaef5208d34d53cf8dfcd87d
BLAKE2b-256 5d472fb403ea92d8c6f5b827bedc275c3e3b39794af351722253a992ff76c08d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bioutils_collection-0.2.0-py3-none-any.whl:

Publisher: publish-pypi.yml on MForofontov/bioutils-collection

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page