A foundation model for soil microbiome understanding
Project description
Gaia: Soil Microbiome Foundation Model
Gaia — the Greek goddess of Earth. Decoding the hidden language of soil microbiomes.
"The AlphaFold of Soil Microbiomes, built open-source."
English | 한국어
Gaia is a foundation model that understands the "language" of soil microbial communities. Pre-trained on public metagenomic data, it enables soil health diagnosis, yield prediction, and microbial consortium design.
Key Features
- Pre-trained Foundation Model: Transformer-based model pre-trained on 10,000+ soil microbiome samples from MGnify, NEON, and EMP
- Soil Health Diagnosis: Predict soil chemical properties (pH, organic carbon, total nitrogen) from microbial profiles
- Biome Classification: Identify soil biome types (agricultural, forest, grassland, desert, wetland)
- Drought Stress Detection: Binary classification of drought stress from microbial signatures
- Interpretability Tools: Attention-based keystone genera identification
- Synthetic Data Generation: Generate realistic microbial abundance profiles for target soil conditions
Quick Start
Installation
pip install gaia-soil
Or install from source:
git clone https://github.com/Kimchikilla/ProjectGaia.git
cd ProjectGaia
pip install -e ".[dev]"
Basic Usage
from gaia.inference import GaiaPredictor
# Load pre-trained model
predictor = GaiaPredictor.from_pretrained("gaia-v0.1")
# Predict soil properties from microbial profile
result = predictor.diagnose("path/to/abundance_profile.csv")
print(result.soil_health_report)
Project Structure
gaia/
├── README.md
├── LICENSE # Apache 2.0
├── CONTRIBUTING.md
├── docs/
│ ├── roadmap.md
│ ├── data_standard.md # Data standardization guide
│ └── tutorials/
├── data/
│ ├── scripts/ # Data collection & preprocessing scripts
│ ├── configs/ # Data source configurations
│ └── README.md # Data catalog
├── gaia/
│ ├── preprocessing/ # Preprocessing modules
│ ├── models/ # Model architectures
│ ├── training/ # Training scripts
│ ├── evaluation/ # Evaluation modules
│ └── inference/ # Inference modules
├── benchmarks/ # Benchmark datasets & evaluation criteria
├── notebooks/ # Tutorial Jupyter notebooks
└── tests/
Data Sources
| Source | Description | Samples |
|---|---|---|
| MGnify | Taxonomic abundance tables from soil biomes | 5,000-15,000 |
| NEON | Paired microbiome + environmental data | ~2,000 |
| Earth Microbiome Project | Standardized global soil samples | ~5,000 |
| SMAG | 40,039 soil MAGs from 3,304 metagenomes | Reference DB |
Benchmarks
| Task | Metric | Description |
|---|---|---|
| Biome Classification | ROC-AUC, F1 | Classify soil biome type from microbial profile |
| Soil Chemistry Prediction | R², RMSE | Predict pH, organic C, total N |
| Tillage Classification | Accuracy, Kappa | Classify tillage practice |
| Drought Stress Detection | Accuracy, F1 | Detect drought stress (binary) |
| Abundance Reconstruction | Cosine Similarity | Reconstruct masked microbial profiles |
Model Architecture
- Base: Multi-layer Transformer Decoder
- Layers: 6-12 (adjustable)
- Attention Heads: 8-16
- Embedding Dim: 256-512
- Vocabulary: ~5,000 soil-associated genera
- Pre-training: Continual pre-training from MGM weights
Tech Stack
| Area | Tool |
|---|---|
| Language | Python 3.10+ |
| Deep Learning | PyTorch 2.x |
| Transformers | Hugging Face Transformers |
| Data | Pandas, AnnData, Biom-format |
| Bioinformatics | QIIME2, Kraken2, MetaPhlAn |
| Visualization | Matplotlib, Seaborn, UMAP |
| Experiment Tracking | Weights & Biases |
| Model Hosting | Hugging Face Hub |
Contributing
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
Ways to Contribute
- Code: Bug fixes, new features, pipeline improvements
- Data: Standardized soil microbiome datasets
- Science: New benchmark tasks, ecological validation, domain expertise
Community
- GitHub Discussions: Technical discussions and Q&A
- Discord: Real-time community chat
- Monthly Meetings: Online direction-setting meetings (1st Thursday of each month)
Citation
@software{gaia2026,
title={Gaia: A Foundation Model for Soil Microbiome Understanding},
year={2026},
url={https://github.com/Kimchikilla/ProjectGaia}
}
License
This project is licensed under the Apache License 2.0 - see LICENSE for details.
This project is under active development. Star this repo to stay updated!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gaia_soil-0.1.0.tar.gz.
File metadata
- Download URL: gaia_soil-0.1.0.tar.gz
- Upload date:
- Size: 35.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
772b1604be87a94c3e3aa3539fc8a065415017d7be041dbc0aaa9fbbaa8fff1f
|
|
| MD5 |
d70720f0aff3a6cf55d44851ec108393
|
|
| BLAKE2b-256 |
91f5a634da3c1a535af44335e853a394ec6cbcfde6a06bee50492be773bafef5
|
File details
Details for the file gaia_soil-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gaia_soil-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41648f1d9fd0f4875af001e36ad4fba74669fa029593051e9585299ff37e829f
|
|
| MD5 |
6a3d5cd83b11f5f8493cb86059a7d073
|
|
| BLAKE2b-256 |
4d39e9c47b264c6e8b3b44d9ff18ccd9c727a55bb910a9f8e3673b47fa6314fb
|