Skip to main content

Automated environmental annotation of marine metagenomic samples

Project description

Metacontextify

A Python package for retrieving environmental context and properties for marine sequences from MGnify and ENA.

Overview

Metacontextify provides a comprehensive pipeline for enriching sequence data with environmental metadata from multiple sources:

  • MGnify: Marine metagenomics and genomics sequences and assemblies
  • ENA: European Nucleotide Archive sample metadata

Features

Retrieve environmental properties for:

  • ENA Sample IDs, MGnify Protein, Genome, Assembly, and Sample IDs
  • a JSON file with the hits in a MGnify protein similarity search
  • a csv with columns lat, lon, sample_date, depth

These environmental properties are temperature, salinity, pH and concentrations of nitrate, oxygen, phosphate and phytoplankton.

The tool provides a command-line interface and can be imported as Python module.

Installation

From source

git clone https://github.com/MaartenLangen/metacontextify.git
cd metacontextify
pip install -e .

With development dependencies

pip install -e ".[dev]"

Quick Start with CLI

Setting up Copernicus capabilities

In order to retrieve the environmental properties from the Copernicus Marine Services, credentials are needed. A guide on how to create this for free can be found here. Once you have your credentials, you can save them for the CLI with the following command:

metacontextify login user123 pswrd123

Processing IDs

A txt-file with one ID per line can be parsed with the following code:

metacontextify id-file input.txt protein output.csv

This is the command for MGnify Protein IDs. A list of other supported identifiers and other optional parameters can be listed with

metacontextify id-file --help

Processing MGnify similarity search results

The MGnify Protein website supports hmm-based protein similarity search. The results can be downloaded as a JSON file. Metacontextify supports the retrieval of environmental properties directly for this JSON file with the following command:

metacontextify simsearch input.json results.csv

An overview of additional optional parameters can be obtained by running

metacontextify simsearch --help

Processing a collection of locations and dates

In order to make the code broadly applicable, it has the functionality to retrieve environmental properties for a collection of latitudes, longitudes, sample dates and depths. The tool can then be executed as follows:

metacontextify location-file input.csv output.csv

The input csv should have at least the columns lat, lon, sample_date, depth (order is not important). Additional columns in the input will be copied to the output (e.g. to keep the identifier together with each entry for subsequent processing steps). Additional optional parameters can be listed by running

metacontextify location-file --help

Quick Start with Python module

Setting up Copernicus capabilities

In order to retrieve the environmental properties from the Copernicus Marine Services, credentials are needed. A guide on how to create this for free can be found here. Once you have your credentials, you can save them for the Python module with the following code:

from metacontextify.data_retrievers.cmems import login

login('user123', 'pswrd123')

Processing IDs

An iterable with IDs can be parsed with Metacontextify. For example, MGnify Protein identifiers can be parsed as follows:

from metacontextify.pipelines import get_properties_for_mgnify_proteins

results_df = get_properties_for_mgnify_proteins(
  protein_ids
)

This is the command for MGnify Protein IDs, where protein_ids is an iterable with MGnify Protein identifiers. A list of other supported identifiers:

  • MGnify Genome: get_properties_for_mgnify_genomes
  • MGnify Assembly: get_properties_for_mgnify_assemblies
  • MGnify Sample: get_properties_for_mgnify_samples
  • ENA Sample: get_properties_for_ena_samples

Processing MGnify similarity search results

The MGnify Protein website supports hmm-based protein similarity search. The results can be downloaded as a JSON file. Metacontextify supports the retrieval of environmental properties directly for this JSON file with the following code:

from metacontextify.pipelines import get_properties_for_mgnify_search_result

results_df = get_properties_for_mgnify_search_results(
  'path/to/json.json',
  nb_hits = 1000
)

By using the optional argument nb_hits, only the first n hits are read. Omitting this argument retrieves properties for all hits.

Processing a collection of locations and dates

In order to make the code broadly applicable, it has the functionality to retrieve environmental properties for a collection of latitudes, longitudes, sample dates and depths. This can be done with the following code:

from metacontextify.data_retrievers.cmems import get_properties

results_df = get_properties(
  input_df
)

The input dataframe should have at least the columns lat, lon, sample_date, depth (order is not important). Additional columns in the input will be copied to the output (e.g. to keep the identifier together with each entry for subsequent processing steps).

Module Overview

pipelines.py

High-level functions for complete processing workflows:

  • get_properties_for_mgnify_search_results(): Process MGnify similarity search JSON files
  • get_properties_for_mgnify_proteins(): Map MGnify protein IDs to environmental data
  • get_properties_for_mgnify_genomes(): Map MGnify genome IDs to environmental data
  • get_properties_for_mgnify_assemblies(): Map MGnify assembly IDs to environmental data
  • get_properties_for_mgnify_samples(): Map MGnify sample IDs to environmental data
  • get_properties_for_ena_samples(): Map ENA sample IDs to environmental data
  • get_properties_for_id_file(): Process text files with IDs
  • get_properties_for_locations_file(): Process CSV files with lat/lon/date/depth

utils.parsers

Input file parsing and data transformation:

  • read_mgnify_similarity_search_json(): Parse MGnify similarity search JSON results
  • read_id_file(): Read ID lists from text files
  • parse_dates(): Parse and standardize date strings

data_retrievers.mgnify

MGnify API interactions:

  • protein_to_assembly_from_file(): Map proteins to assemblies using local file
  • protein_to_assembly_from_website(): Map proteins to assemblies via MGnify website
  • assembly_to_sample(): Map assembly IDs to sample IDs
  • genome_to_sample(): Map genome IDs to sample IDs
  • get_mgnify_sample_metadata(): Retrieve sample metadata from MGnify API

data_retrievers.ena

ENA API interactions:

  • get_ena_sample_metadata(): Retrieve sample metadata from ENA API

data_retrievers.cmems

CMEMS (Copernicus Marine Service) API interactions:

  • login(): Authenticate with CMEMS and save credentials
  • get_properties(): Retrieve all environmental properties for locations/dates
  • get_phys(): Retrieve physical properties (temperature, salinity)
  • get_chem(): Retrieve biochemical properties (pH, nitrate, oxygen, phosphate, phytoplankton)

utils.http

HTTP utilities with retry logic:

  • http_get(): Basic HTTP GET wrapper with error handling
  • retry_request(): HTTP requests with exponential backoff and retry logic
  • validate_json(): Validate and parse JSON responses
  • handle_http_error(): Centralized HTTP error logging and handling

utils.logging

Logging configuration:

  • configure_logging(): Set up logging configuration with custom levels and formats
  • get_logger(): Get a configured logger instance for a module

Development

Running Tests

pytest

Code Quality

# Format code
black .

# Sort imports
isort .

# Lint code
flake8 .

# Type checking
mypy metacontextify

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Citation

If you use Metacontextify in your research, please cite:

@software{metacontextify2026,
  title={Metacontextify: Automated environmental annotation of marine metagenomic samples},
  author={Maarten Langen, Vera van Noort},
  year={2026},
  url={https://github.com/MaartenLangen/metacontextify.git}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metacontextify-0.1.0.tar.gz (22.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

metacontextify-0.1.0-py3-none-any.whl (3.1 MB view details)

Uploaded Python 3

File details

Details for the file metacontextify-0.1.0.tar.gz.

File metadata

  • Download URL: metacontextify-0.1.0.tar.gz
  • Upload date:
  • Size: 22.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for metacontextify-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f78dc860b5261770690e77b886a9c684f524f1b114565a6779879bdb6d79bc48
MD5 946f69e9ce68bb803d3e5ce19b2fbb4b
BLAKE2b-256 50475b30334c54e978caa73812bccbb67beb9db384983ae4930619f4b5ca6a5c

See more details on using hashes here.

File details

Details for the file metacontextify-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: metacontextify-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for metacontextify-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e1a212a6b31de946775546f0b171753f5164548aff43aeafa699407260be51ed
MD5 2af11f9cbf9c67d5249bc33634ea2b84
BLAKE2b-256 ca3272411a5cc7cd2095592a86f5fec4401a6b494c906bba7d8c6503abf7d1e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page