Models to visualize and forecast crop conditions and yields
Project description
geocif
MIKES EDITS
Models to visualize and forecast crop conditions and yields
Generate Climatic Impact-Drivers (CIDs) from Earth Observation (EO) data, build ML yield forecasting models, and produce agmet condition monitoring plots.
Climatic Impact-Drivers for Crop Yield Assessment at NASA Harvest
- Free software: MIT license
- Documentation: https://ritviksahajpal.github.io/yield_forecasting/
Setup
Requirements
- Python 3.11+
- uv
Install
cd geocif # project root (where pyproject.toml lives)
uv sync # creates .venv and installs all dependencies
On Windows, uv automatically pulls pre-built geospatial wheels (GDAL, rasterio, fiona, shapely, pyproj, rtree) from the URLs in [tool.uv.sources]. On Linux/macOS, those entries are skipped (platform marker) and packages are installed from PyPI.
To activate the environment:
# Windows
.venv\Scripts\activate
# Linux/macOS
source .venv/bin/activate
Fresh reinstall
rm -rf .venv && uv sync
Config files
| File | Purpose | Used by |
|---|---|---|
geobase.txt |
Paths, shapefile column mappings | both |
countries.txt |
Per-country config (boundary files, admin levels, seasons, crops) | both |
crops.txt |
Crop masks, calendar categories (EWCM, AMIS) | both |
geoextract.txt |
Extraction-only settings (method, threshold, parallelism) | geoprepare |
geocif.txt |
Indices/ML/agmet settings, country overrides, runtime selections | geocif |
Usage
Order matters: Config files are loaded left-to-right. When the same key appears in multiple files, the last file wins. The tool-specific file (geoextract.txt or geocif.txt) must be last so its [DEFAULT] values (countries, method, etc.) override the shared defaults in countries.txt.
config_dir = "/path/to/config" # full path to your config directory
cfg_geoprepare = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geoextract.txt"]
cfg_geocif = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geocif.txt"]
geoprepare (download, extract, merge)
from geoprepare import geodownload
geodownload.run([f"{config_dir}/geobase.txt"])
from geoprepare import geoextract
geoextract.run(cfg_geoprepare)
from geoprepare import geomerge
geomerge.run(cfg_geoprepare)
geocif (indices, ML, agmet, analysis, experiments)
from geocif import indices_runner
indices_runner.run(cfg_geocif)
from geocif import geocif_runner
geocif_runner.run(cfg_geocif)
from geocif.agmet import geoagmet
geoagmet.run(cfg_geocif)
from geocif import analysis
analysis.run(cfg_geocif)
from geocif import experiments
experiments.run(cfg_geocif, n_trials=30)
from geocif import yield_outlook
yield_outlook.run(cfg_geocif) # uses config defaults (10 years, mean)
# yield_outlook.run(cfg_geocif, current_year=2026, n_years=10, aggregation="median")
Cropmask optimizers
Two consumers of geoprepare extraction outputs that tune the cropland mask used downstream. Run after the corresponding geoprepare extractor has written its outputs.
# Uniform threshold T over the region (single absolute or rank-based knob).
# Reads geoprepare.extract_sweep output:
# ${PATHS:dir_output}/threshold_sweep/{country}/{crop}/{country}_{crop}_s{season}_sweep.csv
from geocif import threshold_optimizer
threshold_optimizer.run(cfg_geocif)
# Per-cell binary mask — independent in/out decision per cropland cell.
# Reads geoprepare.extract_cells output:
# ${PATHS:dir_output}/cell_optimizer/{country}/{crop}/{country}_{crop}_s{season}_cells.parquet
# Writes a production-mask parquet at the same location that geoextract picks up.
from geocif import cell_optimizer
cell_optimizer.run(cfg_geocif)
Configure under [THRESHOLD_OPTIMIZER] and [CELL_OPTIMIZER] in geocif.txt. Outputs land under ${PATHS:dir_output}/ml/analysis/{date}/{threshold_sweep_summary|cell_optimizer}/.
Using the optimized cell mask in production extraction
geoprepare 0.6.273+ can apply the per-cell mask produced by cell_optimizer during EO extraction. Opt in per country (or in [DEFAULT]) in geoextract.txt:
[DEFAULT]
use_optimized_mask = True
When the flag is on, geoprepare.extract_EO reads
${PATHS:dir_output}/cell_optimizer/{country}/{crop}/{country}_{crop}_s{season}_optimized_mask.parquet
for every configured (country, crop, season) and AND-s it with the existing floor/ceiling AFI mask. Cells the optimizer marked included=False are dropped from the per-region aggregate even if they pass the floor/ceiling rule. Multi-season countries get the union across seasons — a cell is kept if any season's optimizer selected it.
Pipeline order with the optimized mask:
geoprepare.extract_cells.run(cfg_geoprepare) # writes per-cell parquets
geocif.cell_optimizer.run(cfg_geocif) # writes optimized_mask.parquet
geoprepare.geoextract.run(cfg_geoprepare) # reads optimized_mask.parquet
extract_EO aborts at startup with a missing-parquet list if use_optimized_mask = True for any country whose mask hasn't been produced yet — silent fallback to the floor/ceiling rule when the operator asked for the overlay would be a confusing footgun, so it doesn't.
Currently wired in process_aef, process_fldas, process_chirps_mfc, process_soilgrids (the static + monthly-forecast EO paths). The daily-EO path through geom_extract (NDVI, daily CHIRPS, ESI, etc.) is not yet wired — track via a future change in geoprepare.
Annual (leave-one-out) masks
Enable annual_mask = True under [CELL_OPTIMIZER] in geocif.txt to produce one mask per historical year instead of a single pooled mask. For each year Y, the GA trains on every OTHER year — year Y's yield never sees the cell selection — and that mask is written to a _y{year}_optimized_mask.parquet file alongside the pooled one. geoprepare.extract_EO prefers the year-specific file when extracting year Y (FLDAS / CHIRPS-MFC, which are per-year datasets) and falls back to the pooled file for forecast / current years. AEF and SoilGrids (static) always use the pooled mask.
This closes the overfitting failure mode where the pooled mask was selected with year Y's yield as part of the training data — visible in pre-0.4.747 runs as regions whose Pearson r between yield and NDVI flipped sign after selection (the GA found anti-correlated cells because R² is sign-blind).
Cost. Roughly (n_years + 1) × the pooled-only default per region. On a country with 25 yield years that's ~26× more GA runs; expect runtime to scale accordingly. Opt in only when the data span justifies it.
Off by default. Existing configs without annual_mask continue to write the single pooled parquet.
ML models
geocif supports the following model types (configured via models in [DEFAULT]):
| Model | Key | Type |
|---|---|---|
| CatBoost | catboost |
Gradient boosting |
| XGBoost | xgboost |
Gradient boosting |
| TabPFN | tabpfn |
Prior-fitted network |
| TabICL | tabicl |
In-context learning |
| NGBoost | ngboost |
Natural gradient boosting |
| YDF | ydf |
Yggdrasil decision forests |
| Oblique RF | oblique |
Oblique random forest |
| Cubist | cubist |
Rule-based regression |
| MERF | merf |
Mixed effects random forest |
| Linear | linear |
LassoCV / LogisticRegressionCV |
| GAM | gam |
Generalized additive model |
| GeoSpaNN | geospaNN |
Geospatial neural network |
| Median | median |
Median baseline |
| Analog | analog |
Analogous year baseline |
Feature selection methods
Configured via feature_selection in [ML]:
none, SelectKBest, BorutaPy, Leshy, gOMP, RFECV, RFE, lasso, mrmr, SHAP, stabl, PowerShap, BorutaShap, Genetic, feature_engine, multi
Cluster analysis
Optional analysis that clusters regions by their CID profiles and identifies which CIDs discriminate each cluster. Works with or without yield data — falls back to a proxy CID (e.g., AUC_NDVI) when yield is unavailable. Enabled via [ML]:
run_cluster_analysis = True
cluster_analysis_proxy = AUC_NDVI ; proxy CID when yield is unavailable
cluster_analysis_max_k = 8 ; maximum clusters for silhouette selection
cluster_analysis_top_n = 20 ; top N CIDs in discrimination heatmap
cluster_analysis_variance = 0.85 ; cumulative PCA variance to retain
Pipeline: PCA dimensionality reduction → Ward's hierarchical clustering (silhouette-selected k) → Kruskal-Wallis + Cohen's d for CID discrimination → mutual information for CID-target association. Outputs: cluster map (choropleth), dendrogram, PCA biplot, discrimination heatmap with significance stars, target boxplot, and per-CID maps for top discriminating indices.
Spatial neighbor features
Optional GraphSAGE-style preprocessing that computes yield-correlation-weighted averages of neighboring regions' features. Enabled via [ML]:
use_spatial_neighbors = True
spatial_neighbor_method = knn ; knn or full
spatial_neighbor_k = 5 ; number of nearest neighbors
For each admin region, the neighbor graph is built from training data using haversine distances and Pearson yield correlations as edge weights. Neighbor-aggregated features are added as nbr_* columns and flow through standard feature selection.
Experiments
The experiments runner (geocif.experiments) provides 6 experiments for model selection, feature importance, and hyperparameter tuning:
| # | Config name | Internal name | What it does |
|---|---|---|---|
| 0 | model_comparison |
models |
Runs each model in comparison_models head-to-head. Produces Bradley-Terry ranking, scatter plots, MAPE bars. Identifies best model per country (required by experiments 1 & 2). |
| 1 | cid_ablation |
cids |
Runs the best model once per CID Type in isolation (Cold alone, FLDAS alone, etc.). Shows which climate driver category contributes most. Produces MAPE-by-CID bar chart, region×CID heatmap, year×CID chart, CID rank over time. |
| 2 | region_filter |
region_filter |
Drops low-production regions and re-runs the best model to test if excluding noisy regions improves national accuracy. |
| 3 | optuna |
optuna |
Bayesian (TPE) search over ML hyperparameters (learning rate, depth, regularization, etc.). Produces convergence, parameter importance, and parallel coordinate plots. |
| 4 | optuna_cid_types |
optuna_cid_types |
Bayesian search for the best combination of CID Type categories (e.g. Rain+VI+ESI may beat using all 8 types). |
| 5 | optuna_cid_indices |
optuna_cid_indices |
Bayesian search for the best subset of individual CID indices (e.g. PRCPTOT + AUC_NDVI + TG90p). Capped at max_cid_indices per trial. |
Dependencies: Experiments 1 and 2 require experiment 0 first. Experiments 3–5 are independent.
Configure in geocif.txt:
[experiments]
run_experiments = ["model_comparison", "cid_ablation"]
comparison_models = ["catboost", "tabpfn", "tabicl"]
n_trials = 30
n_trials_cid_types = 30
n_trials_cid_indices = 60
max_cid_indices = 25
Run:
from geocif import experiments
experiments.run(cfg_geocif)
Experiments output
The experiments runner writes to a dedicated DB and analysis folder under dir_output:
{dir_output}/
└── ml/
├── db/
│ └── experiments_{MMMM_DD_YYYY_HH}H.db
│
└── analysis/
└── {MMMM_DD_YYYY}/
├── experiments/ # Experiment 0 (model comparison)
│ ├── experiment_metrics.csv
│ ├── heatmap_models.png
│ ├── boxplot_models.png
│ ├── regional_mape_models_{country}.png
│ ├── error_distribution_models.png
│ └── metric_comparison.png
│
└── optimization/ # Optuna hyperparameter search
├── optuna_trials.csv
├── best_params.csv
├── convergence.png
├── optimization_history.png
├── param_importances.png
└── parallel_coordinate.png
Outlook output
The yield outlook runner produces a diverging choropleth map showing current forecast yield as a percentage of the historical mean/median prediction per region, plus a combined CSV.
{dir_output}/
└── ml/
└── analysis/
└── {MMMM_DD_YYYY}/
└── outlook/
├── yield_outlook_{country}_{crop}_{model}_{stage}_{year}.png
└── yield_outlook_{year}.csv
Config file documentation
geobase.txt
Shared paths and dataset settings. All directory paths are derived from dir_base.
[PATHS]
dir_base = /gpfs/data1/cmongp1/GEO
dir_inputs = ${dir_base}/inputs
dir_logs = ${dir_base}/logs
dir_download = ${dir_inputs}/download
dir_intermed = ${dir_inputs}/intermed
dir_metadata = ${dir_inputs}/metadata
dir_condition = ${dir_inputs}/crop_condition
dir_crop_inputs = ${dir_condition}/crop_t20
dir_boundary_files = ${dir_metadata}/boundary_files
dir_crop_calendars = ${dir_metadata}/crop_calendars
dir_crop_masks = ${dir_metadata}/crop_masks
dir_images = ${dir_metadata}/images
dir_production_statistics = ${dir_metadata}/production_statistics
dir_output = ${dir_base}/outputs
[DATASETS]
datasets = ['CHIRPS', 'CPC', 'NDVI', 'ESI', 'NSIDC', 'AEF']
countries.txt
Single source of truth for per-country config. Shared by both geoprepare and geocif.
[DEFAULT]
boundary_file = gaul1_asap_v04.shp
admin_level = admin_1
seasons = [1]
crops = ['maize']
category = AMIS
use_cropland_mask = False
calendar_file = crop_calendar.csv
; AMIS countries (inherit from DEFAULT, override crops if needed)
[argentina]
crops = ['soybean', 'winter_wheat', 'maize']
; EWCM countries (full per-country config)
[kenya]
category = EWCM
admin_level = admin_1
seasons = [1, 2]
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']
[malawi]
category = EWCM
admin_level = admin_2
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']
crops.txt
Crop mask filenames and calendar category definitions.
; Crop masks
[maize]
mask = Percent_Maize.tif
[winter_wheat]
mask = Percent_Winter_Wheat.tif
[sorghum]
mask = cropland_v9.tif
; Calendar categories
[EWCM]
use_cropland_mask = True
calendar_file = EWCM_2026-01-05.xlsx
crops = ['maize', 'sorghum', 'millet', 'rice', 'winter_wheat', 'teff']
eo_model = ['aef', 'nsidc_surface', 'nsidc_rootzone', 'ndvi', 'cpc_tmax', 'cpc_tmin', 'chirps', 'chirps_gefs', 'esi_4wk']
[AMIS]
calendar_file = AMISCM_2026-01-05.xlsx
geoextract.txt
Extraction-only settings for geoprepare. Loaded last so its [DEFAULT] overrides shared defaults.
[DEFAULT]
method = JRC
redo = False
threshold = True
floor = 20
ceil = 90
countries = ["malawi"]
forecast_seasons = [2022]
[PROJECT]
parallel_extract = True
parallel_merge = False
geocif.txt
Indices, ML, and agmet settings for geocif. Country overrides go here when geocif needs different values than countries.txt (e.g., a subset of crops).
[AGMET]
eo_plot = ['ndvi', 'chirts_era5_tmax', 'chirts_era5_tmin', 'chirps', 'esi_4wk', 'nsidc_surface', 'nsidc_rootzone']
logo_harvest = harvest.png
logo_geoglam = geoglam.png
; Country overrides (only where geocif differs from countries.txt)
[ethiopia]
crops = ['winter_wheat']
[bangladesh]
crops = ['rice']
admin_level = admin_2
boundary_file = bangladesh.shp
; ML model definitions
[catboost]
ML_model = True
[analog]
ML_model = False
[ML]
model_type = REGRESSION
target = Yield (tn per ha)
feature_selection = gOMP
cluster_strategy = single
check_yield_trend = False
use_spatial_neighbors = True
spatial_neighbor_method = knn
spatial_neighbor_k = 5
lag_yield_as_feature = True
lag_years = 3
median_yield_as_feature = False
median_years = 5
include_lat_lon_as_feature = False
panel_model = True
cat_features = ["Harvest Year", "Region_ID", "Region"]
outlook_n_years = 10 ; Number of historical years for yield outlook comparison
outlook_aggregation = mean ; mean or median
run_time_steps = latest ; latest, current, all, or N (every Nth time period)
run_cluster_analysis = False
cluster_analysis_proxy = AUC_NDVI
cluster_analysis_max_k = 8
cluster_analysis_top_n = 20
cluster_analysis_variance = 0.85
[LOGGING]
log_level = INFO
[DEFAULT]
data_source = harvest
method = monthly_r
project_name = geocif
countries = ["kenya"]
crops = ['maize']
admin_level = admin_1
models = ['catboost']
seasons = [1]
threshold = True
floor = 20
FLDAS forecast overlay
When FLDAS columns are present in the merged data (e.g. fldas_tair_tavg_lead0 through _lead5), agmet plots automatically overlay forecast dots on matching panels:
| FLDAS variable | Target panel |
|---|---|
fldas_tair_tavg |
Temperature |
fldas_totalprecip_tavg |
Daily precipitation |
fldas_soilmoist_tavg |
Soil moisture (surface) |
Each lead time (0–5) appears as a diamond marker with decreasing opacity (lead 0 = most opaque). Dots beyond the harvest date are suppressed. No config changes are needed — detection is automatic.
Release
To publish a new version to PyPI:
- Bump
__version__ingeocif/__init__.pyandversioninpyproject.toml - Build and upload:
uv build uvx twine upload dist/geocif-<version>*
- Commit:
git add geocif/__init__.py pyproject.toml git commit -m "Bump to <version>"
Credits
This project was supported by NASA Applied Sciences Grant No. 80NSSC17K0625 through the NASA Harvest Consortium, and the NASA Acres Consortium under NASA Grant #80NSSC23M0034.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file geocif-0.4.806.tar.gz.
File metadata
- Download URL: geocif-0.4.806.tar.gz
- Upload date:
- Size: 640.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f47b20cb09ea766d413434c01a09f52775a3440a132a02f64672e2b6b67e3b5
|
|
| MD5 |
899a9b7164d798f6c8f8022ef4e2898e
|
|
| BLAKE2b-256 |
be9d099ae9df76647982b24efa253b850a1948cbf2eb4a0fec98f60a861ad94c
|
File details
Details for the file geocif-0.4.806-py2.py3-none-any.whl.
File metadata
- Download URL: geocif-0.4.806-py2.py3-none-any.whl
- Upload date:
- Size: 625.9 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
99a90d9bde6a158ed8db3d6360581fc9c69c5df8e6c3e759c857e40da763f69f
|
|
| MD5 |
b53bb2a238f83ca5dda233bcdcfd9dda
|
|
| BLAKE2b-256 |
092f35043aad3de448b0115084d30dcf473e68e60c30421b63d65f44a3805e9c
|