Skip to main content

A comprehensive package of biological constants, serving as a foundational resource for biology and bioinformatics, complemented by functions to streamline related tasks.

Project description

Biobase

Static Badge Python Version from PEP 621 TOML PyPI version License: MIT GitHub branch status

A Python package providing standardized biological constants and scoring matrices for bioinformatics pipelines. Biobase aims to eliminate the need to repeatedly recreate common biological data structures and scoring systems in your code.

Table of Contents

Quick Start

Access amino acid properties

from biobase.constants import ONE_LETTER_CODES, MONO_MASS
print(ONE_LETTER_CODES)  # 'ACDEFGHIKLMNPQRSTVWY'
print(MONO_MASS['A'])    # 71.037113805

Use scoring matrices

from biobase.matrix import Blosum
blosum62 = Blosum(62)
print(blosum62['A']['A'])  # 4
print(blosum62['W']['C'])  # -2

Analyse DNA sequences

from biobase.analysis import Dna
sequence = "ATCGTAGC"
print(Dna.transcribe(sequence))               # 'AUCGUAGC'
print(Dna.complement(sequence))               # 'TAGCATCG'
print(Dna.complement(sequence, reverse=True)) # 'GCTACGAT'
print(Dna.calculate_gc_content(sequence))     # 50.0
print(Dna.calculate_at_content(sequence))     # 50.0
print(Dna.entropy(sequence))                  # 2.0

seq = "ccatgccctaaatggggtag"
for start, end, orf in Dna.find_orfs(seq, include_seq=True)
    print(start, end, orf)
# 2, 11, "ATGCCCTAA"
# 11, 20, "ATGGGGTAG"

Analyse Nucleotides

from biobase.analysis import Nucleotides

print(Nucleotides.molecular_weight("A"))               # 135.13
print(Nucleotides.cumulative_molecular_weight("ATCG")) # 523.48

Find protein motifs

from biobase.analysis import find_motifs
sequence = "ACDEFGHIKLMNPQRSTVWY"
print(find_motifs(sequence, "DEF"))
# [(1, 4)]

test_dict = {
    ">SP001": "ACDEFCDEFCDEFGHIKLMN",  # has matches for "CDE" that span indexes [(1, 4), (5, 8), (9, 12)]
    ">SP002": "MNPQRSTVWYACDEFGHIKL",  # has match for "CDE" that span indexes [(11, 14)]
    ">SP003": "AAAAAAAAAAAAAAAAAA12",  # invalid: contains "1", "2"
    ">SP004": "GGGGGGGGGGGGGGGGGGGG",  # no match
    ">SP005": "HHHHHHHHHHHHHHHHH@#$",  # invalid: contains "@", "#", "$"
    ">SP006": "DDDDDDDDDDDDDDDDDDDD",  # no match
    ">SP007": "CDEFGHCDEFKLCDEFPQRS",  # has matches for "CDE" that span indexes [(0, 3), (6, 9), (12, 15)]
    ">SP008": "LLLLLLLLLLLLLLLLLLLL",  # no match
    ">SP009": "KKKKKKKKKKKK123KKKKK",  # invalid: contains "1", "2", "3"
    ">SP010": "CDEACDEDCDEFAAAAAAAA",  # has matches for "CDE" that span indexes [(0, 3), (4, 7), (8, 11)]
}
matched, invalid, non_match = find_motifs(test_dict, "CDE")
print("Matches:")
for seq, matches in matched.items():
    print(f"{seq}")
    print(f"{"".join([f"{match[0]} to {match[1]}\n" for match in matches])}")
print(f"Invalid sequences:\n{"".join([f"{seq}: {invs}\n" for seq, invs in invalid.items()])}")
print(f"Sequences without matches:\n{"".join([f"- {nm}\n" for nm in non_match])}")

# Matches:
# >SP001
# 1 to 4
# 5 to 8
# 9 to 12

# >SP002
# 11 to 14

# >SP007
# 0 to 3
# 6 to 9
# 12 to 15

# >SP010
# 0 to 3
# 4 to 7
# 8 to 11

# Invalid sequences:
# >SP003: {'2', '1'}
# >SP005: {'$', '@', '#'}
# >SP009: {'2', '1', '3'}

# Sequences without matches:
# - >SP004
# - >SP006
# - >SP008

Parse FASTA

from biobase.parser import FastaParser, fasta_parser
fasta = """>CAA39742.1 cytochrome b (mitochondrion) [Sus scrofa]
MTNIRKSHPLMKIINNAFIDLPAPSNISSWWNFGSLLGICLILQILTGLFLAMHYTSDTTTAFSSVTHIC"""

# Class that yields generator
records = list(FastaParser(fasta))
r: FastaRecord = records[0]
print(r.id) # CAA39742.1
print(r.seq) # MTNIRKSHPLMKIINNAFIDLPAPSNISSWWNFGSLLGICLILQILTGLFLAMHYTSDTTTAFSSVTHIC

# Function that returns list
records = fasta_parser(fasta)
for r in records:
    print(r.id) # CAA39742.1
    print(r.seq) # MTNIRKSHPLMKIINNAFIDLPAPSNISSWWNFGSLLGICLILQILTGLFLAMHYTSDTTTAFSSVTHIC

Parse FASTQ

from biobase.parser import FastqParser, fastq_parser
fastq = """@2fa9ee19-5c51-4281-abdd-eac86
CGGTAGCCAGCTGCGTTCAGTATG
+
%%%+++'''@@@???<<<??????"""

# Class that yields generator
records = list(FastqParser(fastq))
r: FastqRecord = records[0]
print(r.id) # 2fa9ee19-5c51-4281-abdd-eac86
print(r.seq) # CGGTAGCCAGCTGCGTTCAGTATG

# Function that returns list
records = fastq_parser(fastq)
for r in records:
    print(r.id) # 2fa9ee19-5c51-4281-abdd-eac86
    print(r.seq) # CGGTAGCCAGCTGCGTTCAGTATG

Requirements

  • Python 3.10+
  • pip (for installation)

Installation

Regular Installation

pip install biobase

Development Installation

Clone the repository and install in editable mode:

git clone https://github.com/lignum-vitae/biobase.git
cd biobase
pip install -e .

Running Files

To ensure relative imports work correctly, always run files using the module path from the project root:

Run a specific file

python -m src.biobase.matrix

Data Files

  • src/biobase/matrices/: Scoring matrix data stored in JSON file format

Project Goals

Biobase aims to provide Python-friendly versions of common biological constants and tools for bioinformatics pipelines. Key objectives:

  1. Standardize biological data structures
  2. Provide efficient implementations of common scoring systems
  3. Ensure type safety and validation
  4. Maintain comprehensive documentation
  5. Support modern Python practices

Contributing

We welcome contributions! Please read our:

Stability

This project is in the beta stage. APIs may change without warning until version 1.0.0.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biobase-0.8.0.tar.gz (47.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biobase-0.8.0-py3-none-any.whl (73.4 kB view details)

Uploaded Python 3

File details

Details for the file biobase-0.8.0.tar.gz.

File metadata

  • Download URL: biobase-0.8.0.tar.gz
  • Upload date:
  • Size: 47.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for biobase-0.8.0.tar.gz
Algorithm Hash digest
SHA256 8cd70e99fe1673b50f6cb25a0f38b2da78f31bb5c6039a24e21a4da8d74823a3
MD5 7219abe63eb19476c0bd58e87234fee3
BLAKE2b-256 bc7b5d5638b3225e31ffabb3fdc94be2bef261eeee9f2afc2f2d6e82cb03dd0b

See more details on using hashes here.

File details

Details for the file biobase-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: biobase-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 73.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for biobase-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e25afc64f32c7bf45955b62fff594c397a4df49b4541019a676e2e3b5adfe19
MD5 66c9bcd69c61c6b05cf35f2744957d58
BLAKE2b-256 7a5597af03abe7dc0b492fe56b4428f434874b36c8bea32e2ca79f3612b7426e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page