Skip to main content

Ensembl terminal user interface tools

Project description

CI CodeQL Coverage Status Ruff DOI DOI

ensembl-tui

ensembl-tui provides the eti terminal application for obtaining a subset of the data provided by Ensembl which can then be queried locally. You can have multiple such subsets on your machine, each corresponding to a different selection of species and data types.

Warning We currently only support accessing data from the main ensembl.org site. If you discover errors, please post a bug report.

Installing the software

General user installation instructions
$ pip install ensembl-tui
Developer installation instructions Fork the repo and clone your fork to your local machine. In the terminal, create either a python virtual environment or a new conda environment and activate it. In that virtual environment
$ pip install flit

Then do the flit version of a "developer install". (It is basically creating a symlink to the repos source directory.)

$ flit install -s --python `which python`

Resources required to subset Ensembl data

Ensembl hosts some very large data sets. You need to have a machine with sufficient disk space to store the data you want to download. At present we do not have support for predicting how much storage would be required for a given selection of species and data types. You will need to experiment.

Some commands can be run in parallel but have moderate memory requirements. If you have a machine with limited RAM, you may need to reduce the number of parallel processes. Again, run some experiments.

Getting setup

Specifying what data you want to download and where to put it

We use a plain text file to indicate the Ensembl domain, release and types of genomic data to download. Start by using the demo-config subcommand.

Usage: eti demo-config [OPTIONS]

  exports sample config and species table to the nominated path

Options:
  -o, --outpath PATH              Path to directory to export all rc contents.
  --domain [vertebrates|main|metazoa|protists]
                                  Ensembl domain to use for species list.
                                  [default: main]
  -f, --force_overwrite           Overwrite existing data.
  --help                          Show this message and exit.

$ eti demo-config -o ensembl_download

This command creates a ensembl_download download directory and writes two plain text files into it:

  1. species.tsv: contains the Latin names, common names etc... of the species accessible at ensembl.org website.
  2. sample.cfg: a sample configuration file that you can edit to specify the data you want to download.

The latter file includes comments on how to edit it in order to specify the genomic resources that you want.

Downloading the data

Downloads the data indicated in the config file to a local directory.

Usage: eti download [OPTIONS]

  download data from Ensembl's ftp site

Options:
  -c, --configpath PATH    Path to config file specifying databases, (only
                           species or compara at present).
  -d, --debug              Maximum verbosity, and reduces number of downloads,
                           etc...
  -sm, --species_map TEXT  Tsv file with species names, abbreviations etc..
                           [default: main]
  -v, --verbose
  --help                   Show this message and exit.

For a config file named config.cfg, the download command would be:

$ cd to/directory/with/config.cfg
$ eti download -c config.cfg

Note This is the only step for which the internet is required. Downloads can be interrupted and resumed. The software will delete partially downloaded files.

The download creates a new .cfg file inside the download directory. This file is used by the install command.

Installing the data

Converts the downloaded data into data formats designed to enhance querying performance.

Usage: eti install [OPTIONS]

  create the local representations of the data

Options:
  -d, --download PATH       Path to local download directory containing a cfg
                            file.
  -np, --num_procs INTEGER  Number of procs to use.  [default: 1]
  -f, --force_overwrite     Overwrite existing data.
  -v, --verbose
  --help                    Show this message and exit.

This step can be run in parallel, but the memory requirements will scale with the number of genomes. So we suggest monitoring performance on your system by trying it out on a small number of CPUs to start with. The following command uses 2 CPUs and has been safe on systems with only 16GB of RAM for 10 primate genomes, including homology data and whole genome alignments.

$ cd to/directory/with/downloaded_data
$ eti install -d downloaded_data -np 2
Checking what has been installed This will give a summary of what data has been installed at a provided path.
Usage: eti installed [OPTIONS]

  show what is installed

Options:
  -i, --installed TEXT  Path to root directory of an installation.  [required]
  --help                Show this message and exit.

Interrogating the data

We provide a conventional command line interface for querying the data with subcommands.

The full list of subcommands

You can get help on individual subcommands by running eti <subcommand> in the terminal.

Usage: eti [OPTIONS] COMMAND [ARGS]...

  Tools for obtaining and interrogating subsets of https://ensembl.org genomic
  data.

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  tui              Open Textual TUI.
  demo-config      exports sample config and species table to the nominated...
  download         download data from Ensembl's ftp site
  install          create the local representations of the data
  installed        show what is installed
  species-summary  genome summary data for a species
  dump-genes       export meta-data table for genes from one species to...
  compara-summary  summary data for compara
  homologs         exports CDS sequence data in fasta format for homology...
  alignments       export multiple alignments in fasta format for named genes

We also provide an experiment terminal user interface (TUI) that allows you to explore the data in a more interactive way. This is invoked with the tui subcommand.

Getting a summary of a genome

A command like the following

eti species-summary -i primates10_113/install --species human

displays two tables for the indicated genome. The first is the biotypes and their counts, the second the repeat classes / types and their counts.

Getting a summary of a homology data

A command like the following

eti compara-summary -i primates10_113/install

displays the homology types and counts. The values under homology_type can be used as input arguments to the homologs command --homology_type argument.

Exporting related sequences

A command like the following

eti homologs -i primates10_113/install/ --outdir sampled_100 --ref human --coord_names 1 --limit 100

will sample 100 one-to-one orthologs (the default homology type) to human chromosome 1 linked protein coding genes (the only biotype supported at present). The canonical CDS sequences will be written in fasta format to the directory sampled_100.

Exporting whole genome alignments

A command like the following

eti alignments -i primates10_113/install --outdir sampled_aligns_100 --align_name '*primate*' --coord_names 1 --ref human --limit 10

samples 10 alignments that include human chromosome 1 protein coding genes. These are from the Ensembl whole genome alignment whose name matches the glob pattern *primate*.

Warning

If this pattern matches more than one installed Ensembl alignment, the program will exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ensembl_tui-0.7.5.tar.gz (122.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ensembl_tui-0.7.5-py3-none-any.whl (83.9 kB view details)

Uploaded Python 3

File details

Details for the file ensembl_tui-0.7.5.tar.gz.

File metadata

  • Download URL: ensembl_tui-0.7.5.tar.gz
  • Upload date:
  • Size: 122.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ensembl_tui-0.7.5.tar.gz
Algorithm Hash digest
SHA256 437d0afbf465f7655372eb00ccf9c7f9ad160d2d206266a22a0c3b9e9cb99257
MD5 280766809aa47ac0ac7312dfcbbc8c56
BLAKE2b-256 a2322a7c00d37bce146cfc0dc928f5285f93e835023871d300f443dcff6a915e

See more details on using hashes here.

Provenance

The following attestation bundles were made for ensembl_tui-0.7.5.tar.gz:

Publisher: release.yml on cogent3/ensembl_tui

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ensembl_tui-0.7.5-py3-none-any.whl.

File metadata

  • Download URL: ensembl_tui-0.7.5-py3-none-any.whl
  • Upload date:
  • Size: 83.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ensembl_tui-0.7.5-py3-none-any.whl
Algorithm Hash digest
SHA256 2da9a3773de461328a8dec03beadda15268db19cd2d4c97b9e16946480410b52
MD5 32cd17eb3bf1a75f22e9e382d4ba23e1
BLAKE2b-256 11004ff4991fd9a6801dcd5f4f1b38ec190827865992511424f775365b01e933

See more details on using hashes here.

Provenance

The following attestation bundles were made for ensembl_tui-0.7.5-py3-none-any.whl:

Publisher: release.yml on cogent3/ensembl_tui

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page