Automated environmental annotation of marine metagenomic samples
Project description
Metacontextify
A Python package for retrieving environmental context and properties for marine sequences from MGnify and ENA.
Overview
Metacontextify provides a comprehensive pipeline for enriching sequence data with environmental metadata from multiple sources:
- MGnify: Marine metagenomics and genomics sequences and assemblies
- ENA: European Nucleotide Archive sample metadata
Features
Retrieve environmental properties for:
- ENA Sample IDs, MGnify Protein, Genome, Assembly, and Sample IDs
- a JSON file with the hits in a MGnify protein similarity search
- a csv with columns lat, lon, sample_date, depth
These environmental properties are temperature, salinity, pH and concentrations of nitrate, oxygen, phosphate and phytoplankton.
The tool provides a command-line interface and can be imported as Python module.
Installation
From source
git clone https://github.com/MaartenLangen/metacontextify.git
cd metacontextify
pip install -e .
With development dependencies
pip install -e ".[dev]"
Quick Start with CLI
Setting up Copernicus capabilities
In order to retrieve the environmental properties from the Copernicus Marine Services, credentials are needed. A guide on how to create this for free can be found here. Once you have your credentials, you can save them for the CLI with the following command:
metacontextify login user123 pswrd123
Processing IDs
A txt-file with one ID per line can be parsed with the following code:
metacontextify id-file input.txt protein output.csv
This is the command for MGnify Protein IDs. A list of other supported identifiers and other optional parameters can be listed with
metacontextify id-file --help
Processing MGnify similarity search results
The MGnify Protein website supports hmm-based protein similarity search. The results can be downloaded as a JSON file. Metacontextify supports the retrieval of environmental properties directly for this JSON file with the following command:
metacontextify simsearch input.json results.csv
An overview of additional optional parameters can be obtained by running
metacontextify simsearch --help
Processing a collection of locations and dates
In order to make the code broadly applicable, it has the functionality to retrieve environmental properties for a collection of latitudes, longitudes, sample dates and depths. The tool can then be executed as follows:
metacontextify location-file input.csv output.csv
The input csv should have at least the columns lat, lon, sample_date, depth (order is not important). Additional columns in the input will be copied to the output (e.g. to keep the identifier together with each entry for subsequent processing steps). Additional optional parameters can be listed by running
metacontextify location-file --help
Quick Start with Python module
Setting up Copernicus capabilities
In order to retrieve the environmental properties from the Copernicus Marine Services, credentials are needed. A guide on how to create this for free can be found here. Once you have your credentials, you can save them for the Python module with the following code:
from metacontextify.data_retrievers.cmems import login
login('user123', 'pswrd123')
Processing IDs
An iterable with IDs can be parsed with Metacontextify. For example, MGnify Protein identifiers can be parsed as follows:
from metacontextify.pipelines import get_properties_for_mgnify_proteins
results_df = get_properties_for_mgnify_proteins(
protein_ids
)
This is the command for MGnify Protein IDs, where protein_ids is an iterable with MGnify Protein identifiers. A list of other supported identifiers:
- MGnify Genome:
get_properties_for_mgnify_genomes - MGnify Assembly:
get_properties_for_mgnify_assemblies - MGnify Sample:
get_properties_for_mgnify_samples - ENA Sample:
get_properties_for_ena_samples
Processing MGnify similarity search results
The MGnify Protein website supports hmm-based protein similarity search. The results can be downloaded as a JSON file. Metacontextify supports the retrieval of environmental properties directly for this JSON file with the following code:
from metacontextify.pipelines import get_properties_for_mgnify_search_result
results_df = get_properties_for_mgnify_search_results(
'path/to/json.json',
nb_hits = 1000
)
By using the optional argument nb_hits, only the first n hits are read. Omitting this argument retrieves properties for all hits.
Processing a collection of locations and dates
In order to make the code broadly applicable, it has the functionality to retrieve environmental properties for a collection of latitudes, longitudes, sample dates and depths. This can be done with the following code:
from metacontextify.data_retrievers.cmems import get_properties
results_df = get_properties(
input_df
)
The input dataframe should have at least the columns lat, lon, sample_date, depth (order is not important). Additional columns in the input will be copied to the output (e.g. to keep the identifier together with each entry for subsequent processing steps).
Module Overview
pipelines.py
High-level functions for complete processing workflows:
get_properties_for_mgnify_search_results(): Process MGnify similarity search JSON filesget_properties_for_mgnify_proteins(): Map MGnify protein IDs to environmental dataget_properties_for_mgnify_genomes(): Map MGnify genome IDs to environmental dataget_properties_for_mgnify_assemblies(): Map MGnify assembly IDs to environmental dataget_properties_for_mgnify_samples(): Map MGnify sample IDs to environmental dataget_properties_for_ena_samples(): Map ENA sample IDs to environmental dataget_properties_for_id_file(): Process text files with IDsget_properties_for_locations_file(): Process CSV files with lat/lon/date/depth
utils.parsers
Input file parsing and data transformation:
read_mgnify_similarity_search_json(): Parse MGnify similarity search JSON resultsread_id_file(): Read ID lists from text filesparse_dates(): Parse and standardize date strings
data_retrievers.mgnify
MGnify API interactions:
protein_to_assembly_from_file(): Map proteins to assemblies using local fileprotein_to_assembly_from_website(): Map proteins to assemblies via MGnify websiteassembly_to_sample(): Map assembly IDs to sample IDsgenome_to_sample(): Map genome IDs to sample IDsget_mgnify_sample_metadata(): Retrieve sample metadata from MGnify API
data_retrievers.ena
ENA API interactions:
get_ena_sample_metadata(): Retrieve sample metadata from ENA API
data_retrievers.cmems
CMEMS (Copernicus Marine Service) API interactions:
login(): Authenticate with CMEMS and save credentialsget_properties(): Retrieve all environmental properties for locations/datesget_phys(): Retrieve physical properties (temperature, salinity)get_chem(): Retrieve biochemical properties (pH, nitrate, oxygen, phosphate, phytoplankton)
utils.http
HTTP utilities with retry logic:
http_get(): Basic HTTP GET wrapper with error handlingretry_request(): HTTP requests with exponential backoff and retry logicvalidate_json(): Validate and parse JSON responseshandle_http_error(): Centralized HTTP error logging and handling
utils.logging
Logging configuration:
configure_logging(): Set up logging configuration with custom levels and formatsget_logger(): Get a configured logger instance for a module
Development
Running Tests
pytest
Code Quality
# Format code
black .
# Sort imports
isort .
# Lint code
flake8 .
# Type checking
mypy metacontextify
License
MIT License - see LICENSE file for details
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Citation
If you use Metacontextify in your research, please cite:
@software{metacontextify2026,
title={Metacontextify: Automated environmental annotation of marine metagenomic samples},
author={Maarten Langen, Vera van Noort},
year={2026},
url={https://github.com/MaartenLangen/metacontextify.git}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file metacontextify-0.1.0.tar.gz.
File metadata
- Download URL: metacontextify-0.1.0.tar.gz
- Upload date:
- Size: 22.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f78dc860b5261770690e77b886a9c684f524f1b114565a6779879bdb6d79bc48
|
|
| MD5 |
946f69e9ce68bb803d3e5ce19b2fbb4b
|
|
| BLAKE2b-256 |
50475b30334c54e978caa73812bccbb67beb9db384983ae4930619f4b5ca6a5c
|
File details
Details for the file metacontextify-0.1.0-py3-none-any.whl.
File metadata
- Download URL: metacontextify-0.1.0-py3-none-any.whl
- Upload date:
- Size: 3.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1a212a6b31de946775546f0b171753f5164548aff43aeafa699407260be51ed
|
|
| MD5 |
2af11f9cbf9c67d5249bc33634ea2b84
|
|
| BLAKE2b-256 |
ca3272411a5cc7cd2095592a86f5fec4401a6b494c906bba7d8c6503abf7d1e0
|