Skip to main content

Ensembl GenomIO -- pipelines to convert basic genomic data into Ensembl cores and back to flatfile

Project description

Ensembl GenomIO

License Coverage CI Release

Pipelines to turn basic genomic data into Ensembl cores and back.

This is a multilanguage (Perl, Python) repo providing eHive pipelines and various scripts (see below) to prepare genomic data and load it as Ensembl core database or to dump such core databases as file bundles.

Bundles themselves consist of genomic data in various formats (e.g. fasta, gff3, json) and should follow the corresponding specification.

Installation and configuration

This repository is publicly available in PyPI, so it can be easily installed with your preferred Python package manager, e.g.:

pip install ensembl-genomio

Prerequisites

Pipelines are intended to be run inside the Ensembl production environment. Please, make sure you have all the proper credential, keys, etc. set up.

Get repo and install

Clone:

git clone git@github.com:Ensembl/ensembl-genomio.git

Install the python part (of the pipelines) and test it:

pip install ./ensembl-genomio
# And test it has been installed correctly
python -c 'import ensembl.io.genomio'

Update your perl envs (if you need to)

export PERL5LIB=$(pwd)/ensembl-genomio/src/perl:$PERL5LIB
export PATH=$(pwd)/ensembl-genomio/scripts:$PATH

Optional installation

If you need to install "editable" Python package use '-e' option

pip install -e ./ensembl-genomio

To install additional dependencies (e.g. [docs] or [cicd]) provide [<tag>] string, e.g.:

pip install -e ./ensembl-genomio[cicd]

For the list of tags see [project.optional-dependencies] in pyproject.toml.

Additional steps to use automated generation of the documentation

  • Install python part with the [docs] tag
  • Change into repo dir
  • Run mkdocs build command
git clone git@github.com:Ensembl/ensembl-genomio.git
cd ./ensembl-genomio
pip install -e .[docs]
mkdocs build

Nextflow installation

Please, refer to the "Installation" section of the Nextflow pipelines document.

Pipelines

Initialising and running eHive-based pipelines

Pipelines are derived from Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf, or from Bio::EnsEMBL::Hive::PipeConfig::EnsemblGeneric_conf, of from Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf (see documentation).

And the same perl class prefix used for every pipeline: Bio::EnsEMBL::EGPipeline::PipeConfig:: .

N.B. Don't forget to specify -reg_file option for the beekeeper.pl -url $url -reg_file $REG_FILE -loop command.

init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf
    $($CMD details script) \
    -hive_force_init 1\
    -queue_name $SPECIFIC_QUEUE_NAME \
    -registry $REG_FILE \
    -pipeline_tag "_${PIPELINE_RUN_TAG}" \
    -ensembl_root_dir ${ENSEMBL_ROOT_DIR} \
    -dbsrv_url $($CMD details url) \
    -proddb_url "$($PROD_SERVER details url)""$PROD_DBNAME" \
    -taxonomy_url "$($PROD_SERVER details url)""$TAXONOMY_DBNAME" \
    -release ${RELEASE_VERSION} \
    -data_dir ${INPUT_DIR}/manifests_dir/ \
    -pipeline_dir $OUT_DIR/loader_run \
    ${OTHER_OPTIONS} \
    2> $OUT_DIR/init.stderr \
    1> $OUT_DIR/init.stdout

SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -sync

LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -reg_file $REG_FILE -loop

$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout

List of the pipelines

Pipeline name Description Document Comment Module
BRC4_genome_loader creates an Ensembl core database from a set of flat files or adds ad-hoc (i.e. organellas) sequences to the existing core BRC4_genome_loader Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf
BRC4_genome_dumper
BRC4_genome_prepare
BRC4_addition_prepare
BRC4_genome_compare
LoadGFF3
LoadGFF3Batch

Scripts

CI/CD bits

As for now some Gitlab CI pipelines introduced to keep things in shape. Though, this bit is in constant development. Some documentation can be found in docs for GitLab CI/CD

Various docs

See docs

Unit testing

The Python part of the codebase has now unit tests available to test each module. Make sure you have installed this repository's [cicd] dependencies (via pip install ensembl-genomio[cicd]) before continuing.

Running all the tests in one go is as easy as running pytest from the root of the repository. If you also want to measure, collect and report the code coverage, you can do:

coverage run -m pytest
coverage report

You can also run specific tests by supplying the path to the specific test file/subfolder, e.g.:

pytest lib/python/tests/test_schema.py

Acknowledgements

Some of this code and documentation is inherited from the EnsemblGenomes and other Ensembl projects. We appreciate the effort and time spent by developers of the EnsemblGenomes and Ensembl projects.

Thank you!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ensembl_genomio-1.5.0.tar.gz (569.0 kB view details)

Uploaded Source

Built Distribution

ensembl_genomio-1.5.0-py3-none-any.whl (411.4 kB view details)

Uploaded Python 3

File details

Details for the file ensembl_genomio-1.5.0.tar.gz.

File metadata

  • Download URL: ensembl_genomio-1.5.0.tar.gz
  • Upload date:
  • Size: 569.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for ensembl_genomio-1.5.0.tar.gz
Algorithm Hash digest
SHA256 582443203d39277ba8aeee3aa218c8181ae75f8d306218759c71b462d72c0e0f
MD5 dfaf503ff68eedd5795cabcc30588c99
BLAKE2b-256 7160bf38a615b50777326300f578682f84223783297d05cc366daadddf8b9752

See more details on using hashes here.

Provenance

The following attestation bundles were made for ensembl_genomio-1.5.0.tar.gz:

Publisher: publish.yml on Ensembl/ensembl-genomio

Attestations:

File details

Details for the file ensembl_genomio-1.5.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ensembl_genomio-1.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ff41e620517e0e792271f475e97ae6b992e005a6cafada0864fecd3b70a6015e
MD5 3535be56809c2f36d176186185029fca
BLAKE2b-256 6961f893130894406c4f316ce54fd00b09d9767acea2c41ad100e9e2777f5dcf

See more details on using hashes here.

Provenance

The following attestation bundles were made for ensembl_genomio-1.5.0-py3-none-any.whl:

Publisher: publish.yml on Ensembl/ensembl-genomio

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page