Ensembl GenomIO -- pipelines to convert basic genomic data into Ensembl cores and back to flatfile
Project description
Ensembl GenomIO
Pipelines to turn basic genomic data into Ensembl cores and back.
This is a multilanguage (Perl, Python) repo providing eHive pipelines and various scripts (see below) to prepare genomic data and load it as Ensembl core database or to dump such core databases as file bundles.
Bundles themselves consist of genomic data in various formats (e.g. fasta, gff3, json) and should follow the corresponding specification.
Installation and configuration
This repository is publicly available in PyPI, so it can be easily installed with your preferred Python package manager, e.g.:
pip install ensembl-genomio
Prerequisites
Pipelines are intended to be run inside the Ensembl production environment. Please, make sure you have all the proper credential, keys, etc. set up.
Get repo and install
Clone:
git clone git@github.com:Ensembl/ensembl-genomio.git
Install the python part (of the pipelines) and test it:
pip install ./ensembl-genomio
# And test it has been installed correctly
python -c 'import ensembl.io.genomio'
Update your perl envs (if you need to)
export PERL5LIB=$(pwd)/ensembl-genomio/src/perl:$PERL5LIB
export PATH=$(pwd)/ensembl-genomio/scripts:$PATH
Optional installation
If you need to install "editable" Python package use '-e' option
pip install -e ./ensembl-genomio
To install additional dependencies (e.g. [docs]
or [cicd]
) provide [<tag>]
string, e.g.:
pip install -e ./ensembl-genomio[cicd]
For the list of tags see [project.optional-dependencies]
in pyproject.toml.
Additional steps to use automated generation of the documentation
- Install python part with the
[docs]
tag - Change into repo dir
- Run
mkdocs build
command
git clone git@github.com:Ensembl/ensembl-genomio.git
cd ./ensembl-genomio
pip install -e .[docs]
mkdocs build
Nextflow installation
Please, refer to the "Installation" section of the Nextflow pipelines document.
Pipelines
Initialising and running eHive-based pipelines
Pipelines are derived from Bio::EnsEMBL::Hive::PipeConfig::HiveGeneric_conf
,
or from Bio::EnsEMBL::Hive::PipeConfig::EnsemblGeneric_conf
,
of from Bio::EnsEMBL::EGPipeline::PipeConfig::EGGeneric_conf
(see documentation).
And the same perl class prefix used for every pipeline:
Bio::EnsEMBL::EGPipeline::PipeConfig::
.
N.B. Don't forget to specify -reg_file
option for the beekeeper.pl -url $url -reg_file $REG_FILE -loop
command.
init_pipeline.pl Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf
$($CMD details script) \
-hive_force_init 1\
-queue_name $SPECIFIC_QUEUE_NAME \
-registry $REG_FILE \
-pipeline_tag "_${PIPELINE_RUN_TAG}" \
-ensembl_root_dir ${ENSEMBL_ROOT_DIR} \
-dbsrv_url $($CMD details url) \
-proddb_url "$($PROD_SERVER details url)""$PROD_DBNAME" \
-taxonomy_url "$($PROD_SERVER details url)""$TAXONOMY_DBNAME" \
-release ${RELEASE_VERSION} \
-data_dir ${INPUT_DIR}/manifests_dir/ \
-pipeline_dir $OUT_DIR/loader_run \
${OTHER_OPTIONS} \
2> $OUT_DIR/init.stderr \
1> $OUT_DIR/init.stdout
SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
# beekeeper.pl -url $url -sync
LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
# beekeeper.pl -url $url -reg_file $REG_FILE -loop
$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout
List of the pipelines
Pipeline name | Description | Document | Comment | Module |
---|---|---|---|---|
BRC4_genome_loader | creates an Ensembl core database from a set of flat files or adds ad-hoc (i.e. organellas) sequences to the existing core | BRC4_genome_loader | Bio::EnsEMBL::Pipeline::PipeConfig::BRC4_genome_loader_conf | |
BRC4_genome_dumper | ||||
BRC4_genome_prepare | ||||
BRC4_addition_prepare | ||||
BRC4_genome_compare | ||||
LoadGFF3 | ||||
LoadGFF3Batch |
Scripts
- trf_split_run.bash -- a trf wrapper with chunking support to be used with ensembl-production-imported DNAFeatures pipeline (see docs)
CI/CD bits
As for now some Gitlab CI pipelines introduced to keep things in shape. Though, this bit is in constant development. Some documentation can be found in docs for GitLab CI/CD
Various docs
See docs
Unit testing
The Python part of the codebase has now unit tests available to test each module. Make sure you have installed this repository's [cicd]
dependencies (via pip install ensembl-genomio[cicd]
) before continuing.
Running all the tests in one go is as easy as running pytest
from the root of the repository. If you also want to measure, collect and report the code coverage, you can do:
coverage run -m pytest
coverage report
You can also run specific tests by supplying the path to the specific test file/subfolder, e.g.:
pytest lib/python/tests/test_schema.py
Acknowledgements
Some of this code and documentation is inherited from the EnsemblGenomes and other Ensembl projects. We appreciate the effort and time spent by developers of the EnsemblGenomes and Ensembl projects.
Thank you!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ensembl_genomio-1.5.0.tar.gz
.
File metadata
- Download URL: ensembl_genomio-1.5.0.tar.gz
- Upload date:
- Size: 569.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 582443203d39277ba8aeee3aa218c8181ae75f8d306218759c71b462d72c0e0f |
|
MD5 | dfaf503ff68eedd5795cabcc30588c99 |
|
BLAKE2b-256 | 7160bf38a615b50777326300f578682f84223783297d05cc366daadddf8b9752 |
Provenance
The following attestation bundles were made for ensembl_genomio-1.5.0.tar.gz
:
Publisher:
publish.yml
on Ensembl/ensembl-genomio
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
ensembl_genomio-1.5.0.tar.gz
- Subject digest:
582443203d39277ba8aeee3aa218c8181ae75f8d306218759c71b462d72c0e0f
- Sigstore transparency entry: 149128920
- Sigstore integration time:
- Predicate type:
File details
Details for the file ensembl_genomio-1.5.0-py3-none-any.whl
.
File metadata
- Download URL: ensembl_genomio-1.5.0-py3-none-any.whl
- Upload date:
- Size: 411.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ff41e620517e0e792271f475e97ae6b992e005a6cafada0864fecd3b70a6015e |
|
MD5 | 3535be56809c2f36d176186185029fca |
|
BLAKE2b-256 | 6961f893130894406c4f316ce54fd00b09d9767acea2c41ad100e9e2777f5dcf |
Provenance
The following attestation bundles were made for ensembl_genomio-1.5.0-py3-none-any.whl
:
Publisher:
publish.yml
on Ensembl/ensembl-genomio
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
ensembl_genomio-1.5.0-py3-none-any.whl
- Subject digest:
ff41e620517e0e792271f475e97ae6b992e005a6cafada0864fecd3b70a6015e
- Sigstore transparency entry: 149128922
- Sigstore integration time:
- Predicate type: