Skip to main content

An object-oriented Python library for working with genomic data.

Project description

GenomeUtils

A Python library for working with annotated genomes. We developed GenomeUtils as an alternative/replacement for the no longer maintained pyensembl.

Object-oriented model for representing genomic features: genomes, chromosomes, genes, transcripts, and exons.

Features

  • Object model: Genome > Chromosome > Gene > Transcript > Exon (+ Locus)
  • Builder workflow: GenomeBuilder assembles a Genome from FASTA (DNA, cDNA) and GTF
  • Indexed lookups, optional scaffold separation, streaming/gzip handling
  • Downloader utilities: Fetch Ensembl DNA, cDNA, and GTF assets with EnsemblGenomeDownloader

Installation

You can install GenomeUtils via pip with the following command:

pip install GenomeUtils

Requires Python >= 3.10. Dependencies that will be installed automatically by pip are: biopython, gffutils, requests, gget.

Quickstart

1) Download and build a genome (complete workflow)

from pathlib import Path
from GenomeUtils.Downloaders import EnsemblGenomeDownloader
from GenomeUtils.Genome import GenomeBuilder

# Download Ensembl assets
downloader = EnsemblGenomeDownloader(
    assembly_id="GRCh38",
    ensembl_release=109,
    species="homo_sapiens",
    genomes_root_dir=Path("./data/genomes"),
)

files = downloader.download()
print(files)  # { 'dna': Path(...), 'cdna': Path(...), 'annotation': Path(...) }


# Build genome from downloaded files
# The builder automatically uses species-appropriate chromosomes:
# Human: 1-22,X,Y,M,MT | Mouse: 1-19,X,Y,M,MT
genome, scaffold_genome = (
    GenomeBuilder(id="GRCh38", species="Homo sapiens", name="Human")
      .with_dna_fasta(files['dna'])
      .with_cdna_fasta(files['cdna'])
      .with_gtf_file(files['annotation'])
      .build()
)

# For other species:
# mouse_genome = GenomeBuilder(id="GRCm39", species="Mus musculus", name="Mouse")...

# Access features
chromosome = genome.chromosome_by_id("1")
first_gene = chromosome.genes[0]
print(first_gene.id, first_gene.name)

# Fast lookups (after build() the genome is indexed)
print(genome.gene_by_id(first_gene.id))

2) Build a genome from existing files

from pathlib import Path
from GenomeUtils.Genome import GenomeBuilder

# Prepare input files (can be .gz):
dna_fasta = Path("/path/to/genome.dna.fa.gz")
cdna_fasta = Path("/path/to/genome.cdna.fa.gz")
gtf_file  = Path("/path/to/annotations.gtf.gz")

builder = GenomeBuilder(
    id="hg38",
    species="Homo sapiens",
    name="Human Reference Genome",
    separate_scaffolds=False,  # set True to split non-main scaffolds
)

# Optional: limit to specific chromosomes (must be called before with_dna_fasta)
builder.set_chromosome_filter(["chr1", "chr2", "chrX"])  # or ["1","2","X"]

genome, _ = (
    builder
      .with_dna_fasta(dna_fasta)
      .with_cdna_fasta(cdna_fasta)
      .with_gtf_file(gtf_file)
      .build()
)

# Access features
chromosome = genome.chromosome_by_id("chr1")
first_gene = chromosome.genes[0]
print(first_gene.id, first_gene.name)

# Fast lookups (after build() the genome is indexed)
print(genome.gene_by_id(first_gene.id))

3) Minimal toy example (no files)

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from GenomeUtils.Genome import Genome, Chromosome, Gene, Transcript, Exon

# Create a tiny in-memory genome
genome = Genome(id="toy", species="Test species", name="Toy Genome")
chr1_seq = SeqRecord(Seq("AGCATGATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC"), id="chr1")
chromosome = Chromosome("chr1", seq_index={"chr1": chr1_seq}, genome=genome, length=len(chr1_seq.seq))

genome.add_chromosome(chromosome)

gene = Gene(id="GENE001", chr=chromosome.id, name="MYGENE", start=5, end=35, strand='+', genome=genome, chromosome=chromosome)
chromosome.add_gene(gene)

transcript = Transcript(
    id="TRANSCRIPT001",
    chr=chromosome.id,
    start=5,
    end=35,
    strand='+',
    sequence=Seq("CATGATGCATGCATGCATGCATGCATGC"),
    gene=gene,
    genome=genome,
)

gene.add_transcript(transcript)

Exon(id="EXON001", chr=chromosome.id, start=5, end=15, strand='+', gene=gene, genome=genome).add_to_transcript(transcript)
Exon(id="EXON002", chr=chromosome.id, start=25, end=35, strand='+', gene=gene, genome=genome).add_to_transcript(transcript)


genome.index()
print(genome.gene_by_id("GENE001").name)

4) Species-specific examples

from pathlib import Path
from GenomeUtils.Downloaders import EnsemblGenomeDownloader
from GenomeUtils.Genome import GenomeBuilder

# Human genome (uses chromosomes 1-22, X, Y, M, MT)
human_genome, _ = GenomeBuilder(
    id="GRCh38", 
    species="Homo sapiens", 
    name="Human Reference Genome"
).with_dna_fasta(human_dna).with_gtf_file(human_gtf).build()

# Mouse genome (uses chromosomes 1-19, X, Y, M, MT)  
mouse_genome, _ = GenomeBuilder(
    id="GRCm39", 
    species="Mus musculus", 
    name="Mouse Reference Genome"
).with_dna_fasta(mouse_dna).with_gtf_file(mouse_gtf).build()


# Override default chromosomes if needed
custom_genome, _ = GenomeBuilder(
    id="custom", 
    species="Custom species", 
    name="Custom Genome",
    main_chromosomes=["chr1", "chr2", "chrX"]  # Only these chromosomes
).with_dna_fasta(custom_dna).with_gtf_file(custom_gtf).build()

Technical Documentation

Find the technical documentation here. APIs may evolve.

Contributing

Issues and PRs are welcome.

Copyright 2025, Alexander Schliep

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomeutils-0.1.1.tar.gz (28.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genomeutils-0.1.1-py3-none-any.whl (36.7 kB view details)

Uploaded Python 3

File details

Details for the file genomeutils-0.1.1.tar.gz.

File metadata

  • Download URL: genomeutils-0.1.1.tar.gz
  • Upload date:
  • Size: 28.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for genomeutils-0.1.1.tar.gz
Algorithm Hash digest
SHA256 3cbce72f6c58955d54eefdc1aaafe93c12b34cd9e81e5d60c9b9730302da1d05
MD5 04b1903eedc63826625d203e94fb1e6d
BLAKE2b-256 2788282d6b0375c926a0f5ee96f380972d9987cb0d0f2370f2de2a53d48ef495

See more details on using hashes here.

File details

Details for the file genomeutils-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: genomeutils-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 36.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for genomeutils-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e923bc5997179fcdc813096e0f3c1cdbdf29ec79a6f0cb1e67deb8a070beafed
MD5 35d7c6e929543e33868c5e6f64dfa5c0
BLAKE2b-256 b8fc99e223c6ce6f0ea0ee93c8f7faf6dc0c0143772b245b19cf91bb3ef894b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page