Skip to main content

A foundation model for soil microbiome understanding

Project description

Gaia: Soil Microbiome Foundation Model

Gaia — the Greek goddess of Earth. Decoding the hidden language of soil microbiomes.

"The AlphaFold of Soil Microbiomes, built open-source."

English | 한국어

Gaia is a foundation model that understands the "language" of soil microbial communities. Pre-trained on public metagenomic data, it enables soil health diagnosis, yield prediction, and microbial consortium design.


Key Features

  • Pre-trained Foundation Model: Transformer-based model pre-trained on 10,000+ soil microbiome samples from MGnify, NEON, and EMP
  • Soil Health Diagnosis: Predict soil chemical properties (pH, organic carbon, total nitrogen) from microbial profiles
  • Biome Classification: Identify soil biome types (agricultural, forest, grassland, desert, wetland)
  • Drought Stress Detection: Binary classification of drought stress from microbial signatures
  • Interpretability Tools: Attention-based keystone genera identification
  • Synthetic Data Generation: Generate realistic microbial abundance profiles for target soil conditions

Quick Start

Installation

pip install gaia-soil

Or install from source:

git clone https://github.com/Kimchikilla/ProjectGaia.git
cd ProjectGaia
pip install -e ".[dev]"

Basic Usage

from gaia.inference import GaiaPredictor

# Load pre-trained model
predictor = GaiaPredictor.from_pretrained("gaia-v0.1")

# Predict soil properties from microbial profile
result = predictor.diagnose("path/to/abundance_profile.csv")
print(result.soil_health_report)

Project Structure

gaia/
├── README.md
├── LICENSE                    # Apache 2.0
├── CONTRIBUTING.md
├── docs/
│   ├── roadmap.md
│   ├── data_standard.md       # Data standardization guide
│   └── tutorials/
├── data/
│   ├── scripts/               # Data collection & preprocessing scripts
│   ├── configs/               # Data source configurations
│   └── README.md              # Data catalog
├── gaia/
│   ├── preprocessing/         # Preprocessing modules
│   ├── models/                # Model architectures
│   ├── training/              # Training scripts
│   ├── evaluation/            # Evaluation modules
│   └── inference/             # Inference modules
├── benchmarks/                # Benchmark datasets & evaluation criteria
├── notebooks/                 # Tutorial Jupyter notebooks
└── tests/

Data Sources

Source Description Samples
MGnify Taxonomic abundance tables from soil biomes 5,000-15,000
NEON Paired microbiome + environmental data ~2,000
Earth Microbiome Project Standardized global soil samples ~5,000
SMAG 40,039 soil MAGs from 3,304 metagenomes Reference DB

Benchmarks

Task Metric Description
Biome Classification ROC-AUC, F1 Classify soil biome type from microbial profile
Soil Chemistry Prediction R², RMSE Predict pH, organic C, total N
Tillage Classification Accuracy, Kappa Classify tillage practice
Drought Stress Detection Accuracy, F1 Detect drought stress (binary)
Abundance Reconstruction Cosine Similarity Reconstruct masked microbial profiles

Model Architecture

  • Base: Multi-layer Transformer Decoder
  • Layers: 6-12 (adjustable)
  • Attention Heads: 8-16
  • Embedding Dim: 256-512
  • Vocabulary: ~5,000 soil-associated genera
  • Pre-training: Continual pre-training from MGM weights

Tech Stack

Area Tool
Language Python 3.10+
Deep Learning PyTorch 2.x
Transformers Hugging Face Transformers
Data Pandas, AnnData, Biom-format
Bioinformatics QIIME2, Kraken2, MetaPhlAn
Visualization Matplotlib, Seaborn, UMAP
Experiment Tracking Weights & Biases
Model Hosting Hugging Face Hub

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Ways to Contribute

  • Code: Bug fixes, new features, pipeline improvements
  • Data: Standardized soil microbiome datasets
  • Science: New benchmark tasks, ecological validation, domain expertise

Community

  • GitHub Discussions: Technical discussions and Q&A
  • Discord: Real-time community chat
  • Monthly Meetings: Online direction-setting meetings (1st Thursday of each month)

Citation

@software{gaia2026,
  title={Gaia: A Foundation Model for Soil Microbiome Understanding},
  year={2026},
  url={https://github.com/Kimchikilla/ProjectGaia}
}

License

This project is licensed under the Apache License 2.0 - see LICENSE for details.


This project is under active development. Star this repo to stay updated!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gaia_soil-0.1.0.tar.gz (35.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gaia_soil-0.1.0-py3-none-any.whl (38.1 kB view details)

Uploaded Python 3

File details

Details for the file gaia_soil-0.1.0.tar.gz.

File metadata

  • Download URL: gaia_soil-0.1.0.tar.gz
  • Upload date:
  • Size: 35.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gaia_soil-0.1.0.tar.gz
Algorithm Hash digest
SHA256 772b1604be87a94c3e3aa3539fc8a065415017d7be041dbc0aaa9fbbaa8fff1f
MD5 d70720f0aff3a6cf55d44851ec108393
BLAKE2b-256 91f5a634da3c1a535af44335e853a394ec6cbcfde6a06bee50492be773bafef5

See more details on using hashes here.

File details

Details for the file gaia_soil-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gaia_soil-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for gaia_soil-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41648f1d9fd0f4875af001e36ad4fba74669fa029593051e9585299ff37e829f
MD5 6a3d5cd83b11f5f8493cb86059a7d073
BLAKE2b-256 4d39e9c47b264c6e8b3b44d9ff18ccd9c727a55bb910a9f8e3673b47fa6314fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page