Skip to main content

Infer AI-friendly metadata about biosamples from multiple sources

Project description

Biosample Enricher

Infer AI-friendly environmental and geographic metadata about biosamples from multiple sources.

Python Version License: MIT Code style: ruff Type checked: mypy

Overview

Biosample Enricher provides 8 specialized services for enriching biosample metadata with environmental and geographic information from authoritative data sources. Each service focuses on a specific domain (elevation, weather, soil, marine, land cover, geocoding, geographic features) and returns structured, type-safe data ready for analysis or AI applications.

Features

  • 8 Specialized Services: Elevation, soil, weather, marine, land cover, forward/reverse geocoding, geographic features
  • Service-Based Architecture: Independent services with focused responsibilities
  • Type Safety: Full type hints with Pydantic validation and mypy checking
  • Smart Caching: HTTP caching with coordinate canonicalization for efficiency
  • Multiple Providers: Automatic fallback between data providers (USGS, Google, OSM, etc.)
  • Click-Based CLIs: User-friendly command-line tools for each service
  • Flexible Installation: Core services only, or add optional mongodb/metrics/schema extras

Installation

Prerequisites

Add to Your Project (Recommended)

# Basic installation - all 8 enrichment services
uv add biosample-enricher

# With optional dependencies
uv add biosample-enricher --extra metrics   # Metrics and visualization
uv add biosample-enricher --extra mongodb   # MongoDB support for NMDC/GOLD
uv add biosample-enricher --extra schema    # Schema analysis tools
uv add biosample-enricher --extra all       # All optional features

From Source (Development)

# Clone and install
git clone https://github.com/contextualizer-ai/biosample-enricher.git
cd biosample-enricher
uv sync

# With optional extras
uv sync --extra mongodb    # MongoDB support
uv sync --extra metrics    # Metrics and visualization
uv sync --extra schema     # Schema analysis tools
uv sync --extra all        # Everything

Quick Start

Python API

The package exports 8 services from the top level:

from biosample_enricher import (
    ElevationService,
    ElevationRequest,
    SoilService,
    WeatherService,
    MarineService,
    LandService,
    ReverseGeocodingService,
    ForwardGeocodingService,
    OSMFeaturesService,
)
from datetime import date

# Get elevation for a location
elevation_service = ElevationService()
request = ElevationRequest(latitude=40.7128, longitude=-74.0060)
observations = elevation_service.get_elevation(request)

for obs in observations:
    if obs.value_numeric is not None:
        print(f"{obs.provider.name}: {obs.value_numeric}m")
# Output:
# usgs_3dep: 13.15m
# google_elevation: 13.26m
# open_topo_data: 25.0m
# osm_elevation: 51.0m

# Get weather data for a location and date
weather_service = WeatherService()
weather_result = weather_service.get_daily_weather(
    lat=37.7749,
    lon=-122.4194,
    target_date=date(2024, 1, 15)
)
print(f"Temperature: {weather_result.temperature.value}°C")
print(f"Precipitation: {weather_result.precipitation.value}mm")

# Get soil properties
soil_service = SoilService()
soil_result = soil_service.enrich_location(
    latitude=40.7128,
    longitude=-74.0060,
    depth_cm="0-5cm"
)
print(f"Provider: {soil_result.provider}")
print(f"Quality score: {soil_result.quality_score}")

# Get marine data (SST, bathymetry, chlorophyll)
marine_service = MarineService()
marine_result = marine_service.get_comprehensive_marine_data(
    latitude=36.6,
    longitude=-121.9,
    target_date=date(2024, 1, 15)
)
if marine_result.sea_surface_temperature:
    print(f"Sea surface temp: {marine_result.sea_surface_temperature.value}°C")
if marine_result.bathymetry:
    print(f"Water depth: {marine_result.bathymetry.value}m")

# Reverse geocoding (coordinates -> place names)
geocoding_service = ReverseGeocodingService()
result = geocoding_service.reverse_geocode(lat=40.7128, lon=-74.0060)
if result:
    print(f"Location: {result.get_formatted_address()}")

# Get nearby geographic features
osm_service = OSMFeaturesService()
features = osm_service.get_features_for_location(
    latitude=37.7749,
    longitude=-122.4194,
    radius_m=1000
)
if features and features.named_features:
    for feature in features.named_features[:5]:
        print(f"{feature.name} ({feature.category}): {feature.distance_km:.2f}km")

CLI Usage

Each service has its own CLI command:

# Elevation lookup
uv run elevation-lookup lookup --lat 40.7128 --lon -74.0060

# Soil data
uv run soil-enricher lookup --lat 40.7128 --lon -74.0060 --depth 10

# Weather data
uv run weather-enricher lookup --lat 37.7749 --lon -122.4194 --date 2024-01-15

# Marine data
uv run marine-enricher lookup --lat 36.6 --lon -121.9 --date 2024-01-15

# Land cover
uv run land-enricher lookup --lat 40.7128 --lon -74.0060

# Batch processing from CSV
uv run elevation-lookup batch --input samples.csv --lat-col latitude --lon-col longitude

# Version info
uv run biosample-version

Services

1. Elevation Service

Get elevation data from multiple providers (USGS, Google, Open Topo Data).

Providers: USGS (US only, free), Google (global, requires API key), Open Topo Data (global, free)

Python:

from biosample_enricher import ElevationService, ElevationRequest

service = ElevationService()
request = ElevationRequest(latitude=40.7128, longitude=-74.0060)
observations = service.get_elevation(request)

CLI:

uv run elevation-lookup lookup --lat 40.7128 --lon -74.0060

2. Soil Service

Get soil properties (texture, pH, organic carbon, etc.).

Providers: SoilGrids (global coverage), USDA NRCS (US only)

Python:

from biosample_enricher import SoilService

service = SoilService()
soil_result = service.enrich_location(
    latitude=40.7128,
    longitude=-74.0060,
    depth_cm="0-5cm"
)

CLI:

uv run soil-enricher lookup --lat 40.7128 --lon -74.0060 --depth 10

3. Weather Service

Get historical weather data (temperature, precipitation, humidity, etc.).

Providers: Open-Meteo (free, global), Meteostat (free, global)

Python:

from biosample_enricher import WeatherService
from datetime import date

service = WeatherService()
weather_result = service.get_daily_weather(
    lat=37.7749,
    lon=-122.4194,
    target_date=date(2024, 1, 15)
)

CLI:

uv run weather-enricher lookup --lat 37.7749 --lon -122.4194 --date 2024-01-15

4. Marine Service

Get marine data (sea surface temperature, bathymetry, chlorophyll).

Providers: NOAA OISST (SST), GEBCO (bathymetry), ESA CCI (chlorophyll)

Python:

from biosample_enricher import MarineService
from datetime import date

service = MarineService()
marine_result = service.get_comprehensive_marine_data(
    latitude=36.6,
    longitude=-121.9,
    target_date=date(2024, 1, 15)
)

CLI:

uv run marine-enricher lookup --lat 36.6 --lon -121.9 --date 2024-01-15

5. Land Service

Get land cover classification.

Providers: ESA WorldCover, MODIS, NLCD (US only)

Python:

from biosample_enricher import LandService

service = LandService()
land_result = service.enrich_location(
    latitude=40.7128,
    longitude=-74.0060
)

CLI:

uv run land-enricher lookup --lat 40.7128 --lon -74.0060

6. Reverse Geocoding Service

Convert coordinates to human-readable addresses.

Providers: OSM Nominatim (free), Google Geocoding (requires API key)

Python:

from biosample_enricher import ReverseGeocodingService

service = ReverseGeocodingService()
result = service.reverse_geocode(lat=40.7128, lon=-74.0060)
if result:
    print(result.get_formatted_address())

7. Forward Geocoding Service

Convert addresses/place names to coordinates.

Providers: OSM Nominatim (free), Google Geocoding (requires API key)

Python:

from biosample_enricher import ForwardGeocodingService

service = ForwardGeocodingService()
result = service.geocode("New York City")
if result and result.locations:
    for location in result.locations[:3]:
        print(f"{location.formatted_address}: {location.latitude}, {location.longitude}")

8. OSM Features Service

Get nearby geographic features (parks, water bodies, landmarks).

Providers: OpenStreetMap Overpass API (free), Google Places (requires API key)

Python:

from biosample_enricher import OSMFeaturesService

service = OSMFeaturesService()
features = service.get_features_for_location(
    latitude=37.7749,
    longitude=-122.4194,
    radius_m=1000
)
if features and features.named_features:
    for feature in features.named_features[:5]:
        print(f"{feature.name} ({feature.category})")

API Keys

Only required for Google services (optional - OSM alternatives available for everything):

# Single API key for all Google services
export GOOGLE_MAIN_API_KEY="your-key-here"

All other services are free and require no authentication.

Development

Setup

# Clone repository
git clone https://github.com/contextualizer-ai/biosample-enricher.git
cd biosample-enricher

# Complete development setup
make dev-setup

Testing

# Run fast tests (excludes network/slow tests)
make test-fast

# Run all tests with coverage
make test-cov

# Run specific test categories
make test-unit          # Unit tests only
make test-integration   # Integration tests

Code Quality

# Format, lint, type-check, test
make dev-check

# Full CI validation
make check-ci

# Individual checks
make format       # Format with ruff
make lint         # Lint with ruff
make type-check   # Type check with mypy
make dep-check    # Check dependencies with deptry

Project Structure

biosample-enricher/
├── biosample_enricher/
│   ├── __init__.py           # Public API exports
│   ├── elevation/            # Elevation service
│   ├── soil/                 # Soil service
│   ├── weather/              # Weather service
│   ├── marine/               # Marine service
│   ├── land/                 # Land cover service
│   ├── reverse_geocoding/    # Reverse geocoding
│   ├── forward_geocoding/    # Forward geocoding
│   ├── osm_features/         # Geographic features
│   ├── models.py             # Core data models
│   ├── http_cache.py         # HTTP caching
│   └── cli*.py               # CLI commands
├── tests/                    # Test suite
├── pyproject.toml           # Project configuration
└── Makefile                 # Development automation

Dependencies

Core Dependencies

  • Always installed: pandas, rasterio, meteostat (required for weather aggregation and global soil coverage)
  • CLI and data validation: click, pydantic, requests, rich, pyyaml

Optional Dependencies

  • mongodb: pymongo for fetching from NMDC/GOLD databases (evaluation/demo only)
  • metrics: matplotlib, seaborn for visualization
  • schema: genson for schema analysis

Install with: uv sync --extra mongodb or uv sync --extra all

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run checks (make dev-check)
  5. Commit (git commit -m 'Add amazing feature')
  6. Push (git push origin feature/amazing-feature)
  7. Open a Pull Request

See CLAUDE.md for detailed development guidelines.

License

MIT License - see LICENSE file for details.

Acknowledgments

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biosample_enricher-0.1.0rc1.tar.gz (410.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biosample_enricher-0.1.0rc1-py3-none-any.whl (237.7 kB view details)

Uploaded Python 3

File details

Details for the file biosample_enricher-0.1.0rc1.tar.gz.

File metadata

  • Download URL: biosample_enricher-0.1.0rc1.tar.gz
  • Upload date:
  • Size: 410.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for biosample_enricher-0.1.0rc1.tar.gz
Algorithm Hash digest
SHA256 106fb1bd93887821e49ee833a3794513cec3acad8724972ffae6b3543905f45d
MD5 b10a0e7725fce8290c29bbbc139f25c4
BLAKE2b-256 2d372d3c8ce7ec880d125966ea3cb840dd78331adf05fa0e824240c5d0b3f640

See more details on using hashes here.

Provenance

The following attestation bundles were made for biosample_enricher-0.1.0rc1.tar.gz:

Publisher: release.yml on contextualizer-ai/biosample-enricher

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file biosample_enricher-0.1.0rc1-py3-none-any.whl.

File metadata

File hashes

Hashes for biosample_enricher-0.1.0rc1-py3-none-any.whl
Algorithm Hash digest
SHA256 4f61a59926a322f964b336762a273a347aa5d00e549931e7201eabc02a3908ed
MD5 a224d06919025b8f1e415635d1b20a3d
BLAKE2b-256 9c92f04936060f691a8c36f4c4c5bc42565e4a031cb1bc489ceb70208ad79a24

See more details on using hashes here.

Provenance

The following attestation bundles were made for biosample_enricher-0.1.0rc1-py3-none-any.whl:

Publisher: release.yml on contextualizer-ai/biosample-enricher

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page