Models to visualize and forecast crop conditions and yields
Project description
geocif
Models to visualize and forecast crop conditions and yields
Generate Climatic Impact-Drivers (CIDs) from Earth Observation (EO) data, build ML yield forecasting models, and produce agmet condition monitoring plots.
Climatic Impact-Drivers for Crop Yield Assessment at NASA Harvest
- Free software: MIT license
- Documentation: https://ritviksahajpal.github.io/yield_forecasting/
Setup
Requirements
- Python 3.11+
- uv
Install
cd geocif # project root (where pyproject.toml lives)
uv sync # creates .venv and installs all dependencies
On Windows, uv automatically pulls pre-built geospatial wheels (GDAL, rasterio, fiona, shapely, pyproj, rtree) from the URLs in [tool.uv.sources]. On Linux/macOS, those entries are skipped (platform marker) and packages are installed from PyPI.
To activate the environment:
# Windows
.venv\Scripts\activate
# Linux/macOS
source .venv/bin/activate
Fresh reinstall
rm -rf .venv && uv sync
Config files
| File | Purpose | Used by |
|---|---|---|
geobase.txt |
Paths, shapefile column mappings | both |
countries.txt |
Per-country config (boundary files, admin levels, seasons, crops) | both |
crops.txt |
Crop masks, calendar categories (EWCM, AMIS) | both |
geoextract.txt |
Extraction-only settings (method, threshold, parallelism) | geoprepare |
geocif.txt |
Indices/ML/agmet settings, country overrides, runtime selections | geocif |
Usage
Order matters: Config files are loaded left-to-right. When the same key appears in multiple files, the last file wins. The tool-specific file (geoextract.txt or geocif.txt) must be last so its [DEFAULT] values (countries, method, etc.) override the shared defaults in countries.txt.
config_dir = "/path/to/config" # full path to your config directory
cfg_geoprepare = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geoextract.txt"]
cfg_geocif = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geocif.txt"]
geoprepare (download, extract, merge)
from geoprepare import geodownload
geodownload.run([f"{config_dir}/geobase.txt"])
from geoprepare import geoextract
geoextract.run(cfg_geoprepare)
from geoprepare import geomerge
geomerge.run(cfg_geoprepare)
geocif (indices, ML, agmet, analysis, experiments)
from geocif import indices_runner
indices_runner.run(cfg_geocif)
from geocif import geocif_runner
geocif_runner.run(cfg_geocif)
from geocif.agmet import geoagmet
geoagmet.run(cfg_geocif)
from geocif import analysis
analysis.run(cfg_geocif)
from geocif import experiments
experiments.run(cfg_geocif, n_trials=30)
from geocif import yield_outlook
yield_outlook.run(cfg_geocif) # uses config defaults (10 years, mean)
# yield_outlook.run(cfg_geocif, current_year=2026, n_years=10, aggregation="median")
ML models
geocif supports the following model types (configured via models in [DEFAULT]):
| Model | Key | Type |
|---|---|---|
| CatBoost | catboost |
Gradient boosting |
| XGBoost | xgboost |
Gradient boosting |
| TabPFN | tabpfn |
Prior-fitted network |
| TabICL | tabicl |
In-context learning |
| NGBoost | ngboost |
Natural gradient boosting |
| YDF | ydf |
Yggdrasil decision forests |
| Oblique RF | oblique |
Oblique random forest |
| Cubist | cubist |
Rule-based regression |
| MERF | merf |
Mixed effects random forest |
| Linear | linear |
LassoCV / LogisticRegressionCV |
| GAM | gam |
Generalized additive model |
| GeoSpaNN | geospaNN |
Geospatial neural network |
| Median | median |
Median baseline |
| Analog | analog |
Analogous year baseline |
Feature selection methods
Configured via feature_selection in [ML]:
none, SelectKBest, BorutaPy, Leshy, gOMP, RFECV, RFE, lasso, mrmr, SHAP, stabl, PowerShap, BorutaShap, Genetic, feature_engine, multi
Cluster analysis
Optional analysis that clusters regions by their CID profiles and identifies which CIDs discriminate each cluster. Works with or without yield data — falls back to a proxy CID (e.g., AUC_NDVI) when yield is unavailable. Enabled via [ML]:
run_cluster_analysis = True
cluster_analysis_proxy = AUC_NDVI ; proxy CID when yield is unavailable
cluster_analysis_max_k = 8 ; maximum clusters for silhouette selection
cluster_analysis_top_n = 20 ; top N CIDs in discrimination heatmap
cluster_analysis_variance = 0.85 ; cumulative PCA variance to retain
Pipeline: PCA dimensionality reduction → Ward's hierarchical clustering (silhouette-selected k) → Kruskal-Wallis + Cohen's d for CID discrimination → mutual information for CID-target association. Outputs: cluster map (choropleth), dendrogram, PCA biplot, discrimination heatmap with significance stars, target boxplot, and per-CID maps for top discriminating indices.
Spatial neighbor features
Optional GraphSAGE-style preprocessing that computes yield-correlation-weighted averages of neighboring regions' features. Enabled via [ML]:
use_spatial_neighbors = True
spatial_neighbor_method = knn ; knn or full
spatial_neighbor_k = 5 ; number of nearest neighbors
For each admin region, the neighbor graph is built from training data using haversine distances and Pearson yield correlations as edge weights. Neighbor-aggregated features are added as nbr_* columns and flow through standard feature selection.
Experiments output
The experiments runner writes to a dedicated DB and analysis folder under dir_output:
{dir_output}/
└── ml/
├── db/
│ └── experiments_{MMMM_DD_YYYY_HH}H.db
│
└── analysis/
└── {MMMM_DD_YYYY}/
├── experiments/ # Experiment 0 (model comparison)
│ ├── experiment_metrics.csv
│ ├── heatmap_models.png
│ ├── boxplot_models.png
│ ├── regional_mape_models_{country}.png
│ ├── error_distribution_models.png
│ └── metric_comparison.png
│
└── optimization/ # Optuna hyperparameter search
├── optuna_trials.csv
├── best_params.csv
├── convergence.png
├── optimization_history.png
├── param_importances.png
└── parallel_coordinate.png
Outlook output
The yield outlook runner produces a diverging choropleth map showing current forecast yield as a percentage of the historical mean/median prediction per region, plus a combined CSV.
{dir_output}/
└── ml/
└── analysis/
└── {MMMM_DD_YYYY}/
└── outlook/
├── yield_outlook_{country}_{crop}_{model}_{stage}_{year}.png
└── yield_outlook_{year}.csv
Config file documentation
geobase.txt
Shared paths and dataset settings. All directory paths are derived from dir_base.
[PATHS]
dir_base = /gpfs/data1/cmongp1/GEO
dir_inputs = ${dir_base}/inputs
dir_logs = ${dir_base}/logs
dir_download = ${dir_inputs}/download
dir_intermed = ${dir_inputs}/intermed
dir_metadata = ${dir_inputs}/metadata
dir_condition = ${dir_inputs}/crop_condition
dir_crop_inputs = ${dir_condition}/crop_t20
dir_boundary_files = ${dir_metadata}/boundary_files
dir_crop_calendars = ${dir_metadata}/crop_calendars
dir_crop_masks = ${dir_metadata}/crop_masks
dir_images = ${dir_metadata}/images
dir_production_statistics = ${dir_metadata}/production_statistics
dir_output = ${dir_base}/outputs
[DATASETS]
datasets = ['CHIRPS', 'CPC', 'NDVI', 'ESI', 'NSIDC', 'AEF']
countries.txt
Single source of truth for per-country config. Shared by both geoprepare and geocif.
[DEFAULT]
boundary_file = gaul1_asap_v04.shp
admin_level = admin_1
seasons = [1]
crops = ['maize']
category = AMIS
use_cropland_mask = False
calendar_file = crop_calendar.csv
; AMIS countries (inherit from DEFAULT, override crops if needed)
[argentina]
crops = ['soybean', 'winter_wheat', 'maize']
; EWCM countries (full per-country config)
[kenya]
category = EWCM
admin_level = admin_1
seasons = [1, 2]
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']
[malawi]
category = EWCM
admin_level = admin_2
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']
crops.txt
Crop mask filenames and calendar category definitions.
; Crop masks
[maize]
mask = Percent_Maize.tif
[winter_wheat]
mask = Percent_Winter_Wheat.tif
[sorghum]
mask = cropland_v9.tif
; Calendar categories
[EWCM]
use_cropland_mask = True
calendar_file = EWCM_2026-01-05.xlsx
crops = ['maize', 'sorghum', 'millet', 'rice', 'winter_wheat', 'teff']
eo_model = ['aef', 'nsidc_surface', 'nsidc_rootzone', 'ndvi', 'cpc_tmax', 'cpc_tmin', 'chirps', 'chirps_gefs', 'esi_4wk']
[AMIS]
calendar_file = AMISCM_2026-01-05.xlsx
geoextract.txt
Extraction-only settings for geoprepare. Loaded last so its [DEFAULT] overrides shared defaults.
[DEFAULT]
method = JRC
redo = False
threshold = True
floor = 20
ceil = 90
countries = ["malawi"]
forecast_seasons = [2022]
[PROJECT]
parallel_extract = True
parallel_merge = False
geocif.txt
Indices, ML, and agmet settings for geocif. Country overrides go here when geocif needs different values than countries.txt (e.g., a subset of crops).
[AGMET]
eo_plot = ['ndvi', 'chirts_era5_tmax', 'chirts_era5_tmin', 'chirps', 'esi_4wk', 'nsidc_surface', 'nsidc_rootzone']
logo_harvest = harvest.png
logo_geoglam = geoglam.png
; Country overrides (only where geocif differs from countries.txt)
[ethiopia]
crops = ['winter_wheat']
[bangladesh]
crops = ['rice']
admin_level = admin_2
boundary_file = bangladesh.shp
; ML model definitions
[catboost]
ML_model = True
[analog]
ML_model = False
[ML]
model_type = REGRESSION
target = Yield (tn per ha)
feature_selection = gOMP
cluster_strategy = single
check_yield_trend = False
use_spatial_neighbors = True
spatial_neighbor_method = knn
spatial_neighbor_k = 5
lag_yield_as_feature = True
lag_years = 3
median_yield_as_feature = False
median_years = 5
include_lat_lon_as_feature = False
panel_model = True
cat_features = ["Harvest Year", "Region_ID", "Region"]
outlook_n_years = 10 ; Number of historical years for yield outlook comparison
outlook_aggregation = mean ; mean or median
run_time_steps = latest ; latest, current, all, or N (every Nth time period)
run_cluster_analysis = False
cluster_analysis_proxy = AUC_NDVI
cluster_analysis_max_k = 8
cluster_analysis_top_n = 20
cluster_analysis_variance = 0.85
[LOGGING]
log_level = INFO
[DEFAULT]
data_source = harvest
method = monthly_r
project_name = geocif
countries = ["kenya"]
crops = ['maize']
admin_level = admin_1
models = ['catboost']
seasons = [1]
threshold = True
floor = 20
FLDAS forecast overlay
When FLDAS columns are present in the merged data (e.g. fldas_tair_tavg_lead0 through _lead5), agmet plots automatically overlay forecast dots on matching panels:
| FLDAS variable | Target panel |
|---|---|
fldas_tair_tavg |
Temperature |
fldas_totalprecip_tavg |
Daily precipitation |
fldas_soilmoist_tavg |
Soil moisture (surface) |
Each lead time (0–5) appears as a diamond marker with decreasing opacity (lead 0 = most opaque). Dots beyond the harvest date are suppressed. No config changes are needed — detection is automatic.
Release
To publish a new version to PyPI:
- Bump
__version__ingeocif/__init__.pyandversioninpyproject.toml - Build and upload:
uv build uvx twine upload dist/geocif-<version>*
- Commit:
git add geocif/__init__.py pyproject.toml git commit -m "Bump to <version>"
Credits
This project was supported by NASA Applied Sciences Grant No. 80NSSC17K0625 through the NASA Harvest Consortium, and the NASA Acres Consortium under NASA Grant #80NSSC23M0034.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geocif-0.4.479.tar.gz.
File metadata
- Download URL: geocif-0.4.479.tar.gz
- Upload date:
- Size: 282.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e99ada93f691633edd77a8c3319a7691e5b8c38abc5ad6b46d8d6d1f492cc71
|
|
| MD5 |
0ec44127d108231244ad98c9f7f15439
|
|
| BLAKE2b-256 |
98c95f4704e11f49d8f865a4adb061cf9ee4cb27cbcc5e619c8375bffa35046c
|
File details
Details for the file geocif-0.4.479-py2.py3-none-any.whl.
File metadata
- Download URL: geocif-0.4.479-py2.py3-none-any.whl
- Upload date:
- Size: 299.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4a95a654d1b07eb02f06c189a724008db352f9ed5de0dc343896a55137f134b
|
|
| MD5 |
bb989c3206fb4971f01e1e78c3c66a4d
|
|
| BLAKE2b-256 |
573a5ebf92f8d84bc3d4e0a5fd4f483adb083930c0b11a9bc1d202db122f33d2
|