Skip to main content

An object-oriented Python library for working with genomic data.

Project description

GenomeUtils

A Python library for working with annotated genomes. We developed GenomeUtils as an alternative/replacement for the no longer maintained pyensembl.

Object-oriented model for representing genomic features: genomes, chromosomes, genes, transcripts, and exons.

Features

  • Object model: Genome > Chromosome > Gene > Transcript > Exon (+ Locus)
  • Builder workflow: GenomeBuilder assembles a Genome from FASTA (DNA, cDNA) and GTF
  • Indexed lookups, optional scaffold separation, streaming/gzip handling
  • Downloader utilities: Fetch Ensembl DNA, cDNA, and GTF assets with EnsemblGenomeDownloader

Installation

You can install GenomeUtils via pip with the following command:

pip install GenomeUtils

Requires Python >= 3.10. Dependencies that will be installed automatically by pip are: biopython, gffutils, requests, gget.

Quickstart

1) Download and build a genome (complete workflow)

from pathlib import Path
from GenomeUtils.Downloaders import EnsemblGenomeDownloader
from GenomeUtils.Genome import GenomeBuilder

# Download Ensembl assets
downloader = EnsemblGenomeDownloader(
    assembly_id="GRCh38",
    ensembl_release=109,
    species="homo_sapiens",
    genomes_root_dir=Path("./data/genomes"),
)

files = downloader.download()
print(files)  # { 'dna': Path(...), 'cdna': Path(...), 'annotation': Path(...) }


# Build genome from downloaded files
# The builder automatically uses species-appropriate chromosomes:
# Human: 1-22,X,Y,M,MT | Mouse: 1-19,X,Y,M,MT
genome, scaffold_genome = (
    GenomeBuilder(id="GRCh38", species="Homo sapiens", name="Human")
      .with_dna_fasta(files['dna'])
      .with_cdna_fasta(files['cdna'])
      .with_gtf_file(files['annotation'])
      .build()
)

# For other species:
# mouse_genome = GenomeBuilder(id="GRCm39", species="Mus musculus", name="Mouse")...

# Access features
chromosome = genome.chromosome_by_id("1")
first_gene = chromosome.genes[0]
print(first_gene.id, first_gene.name)

# Fast lookups (after build() the genome is indexed)
print(genome.gene_by_id(first_gene.id))

2) Build a genome from existing files

from pathlib import Path
from GenomeUtils.Genome import GenomeBuilder

# Prepare input files (can be .gz):
dna_fasta = Path("/path/to/genome.dna.fa.gz")
cdna_fasta = Path("/path/to/genome.cdna.fa.gz")
gtf_file  = Path("/path/to/annotations.gtf.gz")

builder = GenomeBuilder(
    id="hg38",
    species="Homo sapiens",
    name="Human Reference Genome",
    separate_scaffolds=False,  # set True to split non-main scaffolds
)

# Optional: limit to specific chromosomes (must be called before with_dna_fasta)
builder.set_chromosome_filter(["chr1", "chr2", "chrX"])  # or ["1","2","X"]

genome, _ = (
    builder
      .with_dna_fasta(dna_fasta)
      .with_cdna_fasta(cdna_fasta)
      .with_gtf_file(gtf_file)
      .build()
)

# Access features
chromosome = genome.chromosome_by_id("chr1")
first_gene = chromosome.genes[0]
print(first_gene.id, first_gene.name)

# Fast lookups (after build() the genome is indexed)
print(genome.gene_by_id(first_gene.id))

3) Minimal toy example (no files)

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from GenomeUtils.Genome import Genome, Chromosome, Gene, Transcript, Exon

# Create a tiny in-memory genome
genome = Genome(id="toy", species="Test species", name="Toy Genome")
chr1_seq = SeqRecord(Seq("AGCATGATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC"), id="chr1")
chromosome = Chromosome("chr1", seq_index={"chr1": chr1_seq}, genome=genome, length=len(chr1_seq.seq))

genome.add_chromosome(chromosome)

gene = Gene(id="GENE001", chr=chromosome, name="MYGENE", start=5, end=35, strand='+', genome=genome)
chromosome.add_gene(gene)

transcript = Transcript(
    id="TRANSCRIPT001",
    chr=chromosome,
    start=5,
    end=35,
    strand='+',
    sequence=Seq("CATGATGCATGCATGCATGCATGCATGC"),
    gene=gene,
    genome=genome,
)

gene.add_transcript(transcript)

transcript.add_exon(Exon(id="EXON001", chr=chromosome, start=5, end=15, strand='+', transcript=transcript, genome=genome))
transcript.add_exon(Exon(id="EXON002", chr=chromosome, start=25, end=35, strand='+', transcript=transcript, genome=genome))

genome.index()
print(genome.gene_by_id("GENE001").name)

4) Species-specific examples

from pathlib import Path
from GenomeUtils.Downloaders import EnsemblGenomeDownloader
from GenomeUtils.Genome import GenomeBuilder

# Human genome (uses chromosomes 1-22, X, Y, M, MT)
human_genome, _ = GenomeBuilder(
    id="GRCh38", 
    species="Homo sapiens", 
    name="Human Reference Genome"
).with_dna_fasta(human_dna).with_gtf_file(human_gtf).build()

# Mouse genome (uses chromosomes 1-19, X, Y, M, MT)  
mouse_genome, _ = GenomeBuilder(
    id="GRCm39", 
    species="Mus musculus", 
    name="Mouse Reference Genome"
).with_dna_fasta(mouse_dna).with_gtf_file(mouse_gtf).build()


# Override default chromosomes if needed
custom_genome, _ = GenomeBuilder(
    id="custom", 
    species="Custom species", 
    name="Custom Genome",
    main_chromosomes=["chr1", "chr2", "chrX"]  # Only these chromosomes
).with_dna_fasta(custom_dna).with_gtf_file(custom_gtf).build()

Project status

Early-stage library. APIs may evolve.

Contributing

Issues and PRs are welcome.

Copyright 2025, Alexander Schliep

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genomeutils-0.1.0.tar.gz (28.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genomeutils-0.1.0-py3-none-any.whl (35.9 kB view details)

Uploaded Python 3

File details

Details for the file genomeutils-0.1.0.tar.gz.

File metadata

  • Download URL: genomeutils-0.1.0.tar.gz
  • Upload date:
  • Size: 28.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for genomeutils-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c134923a2ebdbdca846b819f12647eb203d3aff2d93fd627565c4579c898778a
MD5 59c9a588015fa50d2a4300d32a63abde
BLAKE2b-256 12347a208cec418f7fd70f91950fe79ffc8012d03b6c006de05e43815f4e2c74

See more details on using hashes here.

File details

Details for the file genomeutils-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: genomeutils-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.4

File hashes

Hashes for genomeutils-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d87adb2196896e077bf7aaa1269ee117848ddf4e314f3781d23d10e40d3b6f47
MD5 5182676aa78a584dee58b225b42340be
BLAKE2b-256 bb080d50ab32f5894f01ed34bf3b44dc45c427be0c58e00ffdd07e44f402518f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page