Skip to main content

Models to visualize and forecast crop conditions and yields

Project description

geocif

MIKES EDITS

image

Models to visualize and forecast crop conditions and yields

Generate Climatic Impact-Drivers (CIDs) from Earth Observation (EO) data, build ML yield forecasting models, and produce agmet condition monitoring plots.

Climatic Impact-Drivers for Crop Yield Assessment at NASA Harvest

Setup

Requirements

  • Python 3.11+
  • uv

Install

cd geocif                   # project root (where pyproject.toml lives)
uv sync                     # creates .venv and installs all dependencies

On Windows, uv automatically pulls pre-built geospatial wheels (GDAL, rasterio, fiona, shapely, pyproj, rtree) from the URLs in [tool.uv.sources]. On Linux/macOS, those entries are skipped (platform marker) and packages are installed from PyPI.

To activate the environment:

# Windows
.venv\Scripts\activate

# Linux/macOS
source .venv/bin/activate

Fresh reinstall

rm -rf .venv && uv sync

Config files

File Purpose Used by
geobase.txt Paths, shapefile column mappings both
countries.txt Per-country config (boundary files, admin levels, seasons, crops) both
crops.txt Crop masks, calendar categories (EWCM, AMIS) both
geoextract.txt Extraction-only settings (method, threshold, parallelism) geoprepare
geocif.txt Indices/ML/agmet settings, country overrides, runtime selections geocif

Usage

Order matters: Config files are loaded left-to-right. When the same key appears in multiple files, the last file wins. The tool-specific file (geoextract.txt or geocif.txt) must be last so its [DEFAULT] values (countries, method, etc.) override the shared defaults in countries.txt.

config_dir = "/path/to/config"  # full path to your config directory

cfg_geoprepare = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geoextract.txt"]
cfg_geocif = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geocif.txt"]

geoprepare (download, extract, merge)

from geoprepare import geodownload
geodownload.run([f"{config_dir}/geobase.txt"])

from geoprepare import geoextract
geoextract.run(cfg_geoprepare)

from geoprepare import geomerge
geomerge.run(cfg_geoprepare)

geocif (indices, ML, agmet, analysis, experiments)

from geocif import indices_runner
indices_runner.run(cfg_geocif)

from geocif import geocif_runner
geocif_runner.run(cfg_geocif)

from geocif.agmet import geoagmet
geoagmet.run(cfg_geocif)

from geocif import analysis
analysis.run(cfg_geocif)

from geocif import experiments
experiments.run(cfg_geocif, n_trials=30)

from geocif import yield_outlook
yield_outlook.run(cfg_geocif)  # uses config defaults (10 years, mean)
# yield_outlook.run(cfg_geocif, current_year=2026, n_years=10, aggregation="median")

Cropmask optimizers

Two consumers of geoprepare extraction outputs that tune the cropland mask used downstream. Run after the corresponding geoprepare extractor has written its outputs.

# Uniform threshold T over the region (single absolute or rank-based knob).
# Reads geoprepare.extract_sweep output:
#   ${PATHS:dir_output}/threshold_sweep/{country}/{crop}/{country}_{crop}_s{season}_sweep.csv
from geocif import threshold_optimizer
threshold_optimizer.run(cfg_geocif)

# Per-cell binary mask — independent in/out decision per cropland cell.
# Reads geoprepare.extract_cells output:
#   ${PATHS:dir_output}/cell_optimizer/{country}/{crop}/{country}_{crop}_s{season}_cells.parquet
# Writes a production-mask parquet at the same location that geoextract picks up.
from geocif import cell_optimizer
cell_optimizer.run(cfg_geocif)

Configure under [THRESHOLD_OPTIMIZER] and [CELL_OPTIMIZER] in geocif.txt. Outputs land under ${PATHS:dir_output}/ml/analysis/{date}/{threshold_sweep_summary|cell_optimizer}/.

Using the optimized cell mask in production extraction

geoprepare 0.6.273+ can apply the per-cell mask produced by cell_optimizer during EO extraction. Opt in per country (or in [DEFAULT]) in geoextract.txt:

[DEFAULT]
use_optimized_mask = True

When the flag is on, geoprepare.extract_EO reads ${PATHS:dir_output}/cell_optimizer/{country}/{crop}/{country}_{crop}_s{season}_optimized_mask.parquet for every configured (country, crop, season) and AND-s it with the existing floor/ceiling AFI mask. Cells the optimizer marked included=False are dropped from the per-region aggregate even if they pass the floor/ceiling rule. Multi-season countries get the union across seasons — a cell is kept if any season's optimizer selected it.

Pipeline order with the optimized mask:

geoprepare.extract_cells.run(cfg_geoprepare)   # writes per-cell parquets
geocif.cell_optimizer.run(cfg_geocif)          # writes optimized_mask.parquet
geoprepare.geoextract.run(cfg_geoprepare)      # reads optimized_mask.parquet

extract_EO aborts at startup with a missing-parquet list if use_optimized_mask = True for any country whose mask hasn't been produced yet — silent fallback to the floor/ceiling rule when the operator asked for the overlay would be a confusing footgun, so it doesn't.

Currently wired in process_aef, process_fldas, process_chirps_mfc, process_soilgrids (the static + monthly-forecast EO paths). The daily-EO path through geom_extract (NDVI, daily CHIRPS, ESI, etc.) is not yet wired — track via a future change in geoprepare.

Annual (leave-one-out) masks

Enable annual_mask = True under [CELL_OPTIMIZER] in geocif.txt to produce one mask per historical year instead of a single pooled mask. For each year Y, the GA trains on every OTHER year — year Y's yield never sees the cell selection — and that mask is written to a _y{year}_optimized_mask.parquet file alongside the pooled one. geoprepare.extract_EO prefers the year-specific file when extracting year Y (FLDAS / CHIRPS-MFC, which are per-year datasets) and falls back to the pooled file for forecast / current years. AEF and SoilGrids (static) always use the pooled mask.

This closes the overfitting failure mode where the pooled mask was selected with year Y's yield as part of the training data — visible in pre-0.4.747 runs as regions whose Pearson r between yield and NDVI flipped sign after selection (the GA found anti-correlated cells because R² is sign-blind).

Cost. Roughly (n_years + 1) × the pooled-only default per region. On a country with 25 yield years that's ~26× more GA runs; expect runtime to scale accordingly. Opt in only when the data span justifies it.

Off by default. Existing configs without annual_mask continue to write the single pooled parquet.

ML models

geocif supports the following model types (configured via models in [DEFAULT]):

Model Key Type
CatBoost catboost Gradient boosting
XGBoost xgboost Gradient boosting
TabPFN tabpfn Prior-fitted network
TabICL tabicl In-context learning
NGBoost ngboost Natural gradient boosting
YDF ydf Yggdrasil decision forests
Oblique RF oblique Oblique random forest
Cubist cubist Rule-based regression
MERF merf Mixed effects random forest
Linear linear LassoCV / LogisticRegressionCV
GAM gam Generalized additive model
GeoSpaNN geospaNN Geospatial neural network
Median median Median baseline
Analog analog Analogous year baseline

Feature selection methods

Configured via feature_selection in [ML]:

none, SelectKBest, BorutaPy, Leshy, gOMP, RFECV, RFE, lasso, mrmr, SHAP, stabl, PowerShap, BorutaShap, Genetic, feature_engine, multi

Cluster analysis

Optional analysis that clusters regions by their CID profiles and identifies which CIDs discriminate each cluster. Works with or without yield data — falls back to a proxy CID (e.g., AUC_NDVI) when yield is unavailable. Enabled via [ML]:

run_cluster_analysis = True
cluster_analysis_proxy = AUC_NDVI   ; proxy CID when yield is unavailable
cluster_analysis_max_k = 8          ; maximum clusters for silhouette selection
cluster_analysis_top_n = 20         ; top N CIDs in discrimination heatmap
cluster_analysis_variance = 0.85    ; cumulative PCA variance to retain

Pipeline: PCA dimensionality reduction → Ward's hierarchical clustering (silhouette-selected k) → Kruskal-Wallis + Cohen's d for CID discrimination → mutual information for CID-target association. Outputs: cluster map (choropleth), dendrogram, PCA biplot, discrimination heatmap with significance stars, target boxplot, and per-CID maps for top discriminating indices.

Spatial neighbor features

Optional GraphSAGE-style preprocessing that computes yield-correlation-weighted averages of neighboring regions' features. Enabled via [ML]:

use_spatial_neighbors = True
spatial_neighbor_method = knn   ; knn or full
spatial_neighbor_k = 5          ; number of nearest neighbors

For each admin region, the neighbor graph is built from training data using haversine distances and Pearson yield correlations as edge weights. Neighbor-aggregated features are added as nbr_* columns and flow through standard feature selection.

Experiments

The experiments runner (geocif.experiments) provides 6 experiments for model selection, feature importance, and hyperparameter tuning:

# Config name Internal name What it does
0 model_comparison models Runs each model in comparison_models head-to-head. Produces Bradley-Terry ranking, scatter plots, MAPE bars. Identifies best model per country (required by experiments 1 & 2).
1 cid_ablation cids Runs the best model once per CID Type in isolation (Cold alone, FLDAS alone, etc.). Shows which climate driver category contributes most. Produces MAPE-by-CID bar chart, region×CID heatmap, year×CID chart, CID rank over time.
2 region_filter region_filter Drops low-production regions and re-runs the best model to test if excluding noisy regions improves national accuracy.
3 optuna optuna Bayesian (TPE) search over ML hyperparameters (learning rate, depth, regularization, etc.). Produces convergence, parameter importance, and parallel coordinate plots.
4 optuna_cid_types optuna_cid_types Bayesian search for the best combination of CID Type categories (e.g. Rain+VI+ESI may beat using all 8 types).
5 optuna_cid_indices optuna_cid_indices Bayesian search for the best subset of individual CID indices (e.g. PRCPTOT + AUC_NDVI + TG90p). Capped at max_cid_indices per trial.

Dependencies: Experiments 1 and 2 require experiment 0 first. Experiments 3–5 are independent.

Configure in geocif.txt:

[experiments]
run_experiments = ["model_comparison", "cid_ablation"]
comparison_models = ["catboost", "tabpfn", "tabicl"]
n_trials = 30
n_trials_cid_types = 30
n_trials_cid_indices = 60
max_cid_indices = 25

Run:

from geocif import experiments
experiments.run(cfg_geocif)

Experiments output

The experiments runner writes to a dedicated DB and analysis folder under dir_output:

{dir_output}/
└── ml/
    ├── db/
    │   └── experiments_{MMMM_DD_YYYY_HH}H.db
    │
    └── analysis/
        └── {MMMM_DD_YYYY}/
            ├── experiments/                            # Experiment 0 (model comparison)
            │   ├── experiment_metrics.csv
            │   ├── heatmap_models.png
            │   ├── boxplot_models.png
            │   ├── regional_mape_models_{country}.png
            │   ├── error_distribution_models.png
            │   └── metric_comparison.png
            │
            └── optimization/                           # Optuna hyperparameter search
                ├── optuna_trials.csv
                ├── best_params.csv
                ├── convergence.png
                ├── optimization_history.png
                ├── param_importances.png
                └── parallel_coordinate.png

Outlook output

The yield outlook runner produces a diverging choropleth map showing current forecast yield as a percentage of the historical mean/median prediction per region, plus a combined CSV.

{dir_output}/
└── ml/
    └── analysis/
        └── {MMMM_DD_YYYY}/
            └── outlook/
                ├── yield_outlook_{country}_{crop}_{model}_{stage}_{year}.png
                └── yield_outlook_{year}.csv

Config file documentation

geobase.txt

Shared paths and dataset settings. All directory paths are derived from dir_base.

[PATHS]
dir_base = /gpfs/data1/cmongp1/GEO

dir_inputs = ${dir_base}/inputs
dir_logs = ${dir_base}/logs
dir_download = ${dir_inputs}/download
dir_intermed = ${dir_inputs}/intermed
dir_metadata = ${dir_inputs}/metadata
dir_condition = ${dir_inputs}/crop_condition
dir_crop_inputs = ${dir_condition}/crop_t20

dir_boundary_files = ${dir_metadata}/boundary_files
dir_crop_calendars = ${dir_metadata}/crop_calendars
dir_crop_masks = ${dir_metadata}/crop_masks
dir_images = ${dir_metadata}/images
dir_production_statistics = ${dir_metadata}/production_statistics

dir_output = ${dir_base}/outputs

[DATASETS]
datasets = ['CHIRPS', 'CPC', 'NDVI', 'ESI', 'NSIDC', 'AEF']

countries.txt

Single source of truth for per-country config. Shared by both geoprepare and geocif.

[DEFAULT]
boundary_file = gaul1_asap_v04.shp
admin_level = admin_1
seasons = [1]
crops = ['maize']
category = AMIS
use_cropland_mask = False
calendar_file = crop_calendar.csv

; AMIS countries (inherit from DEFAULT, override crops if needed)
[argentina]
crops = ['soybean', 'winter_wheat', 'maize']

; EWCM countries (full per-country config)
[kenya]
category = EWCM
admin_level = admin_1
seasons = [1, 2]
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']

[malawi]
category = EWCM
admin_level = admin_2
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']

crops.txt

Crop mask filenames and calendar category definitions.

; Crop masks
[maize]
mask = Percent_Maize.tif

[winter_wheat]
mask = Percent_Winter_Wheat.tif

[sorghum]
mask = cropland_v9.tif

; Calendar categories
[EWCM]
use_cropland_mask = True
calendar_file = EWCM_2026-01-05.xlsx
crops = ['maize', 'sorghum', 'millet', 'rice', 'winter_wheat', 'teff']
eo_model = ['aef', 'nsidc_surface', 'nsidc_rootzone', 'ndvi', 'cpc_tmax', 'cpc_tmin', 'chirps', 'chirps_gefs', 'esi_4wk']

[AMIS]
calendar_file = AMISCM_2026-01-05.xlsx

geoextract.txt

Extraction-only settings for geoprepare. Loaded last so its [DEFAULT] overrides shared defaults.

[DEFAULT]
method = JRC
redo = False
threshold = True
floor = 20
ceil = 90
countries = ["malawi"]
forecast_seasons = [2022]

[PROJECT]
parallel_extract = True
parallel_merge = False

geocif.txt

Indices, ML, and agmet settings for geocif. Country overrides go here when geocif needs different values than countries.txt (e.g., a subset of crops).

[AGMET]
eo_plot = ['ndvi', 'chirts_era5_tmax', 'chirts_era5_tmin', 'chirps', 'esi_4wk', 'nsidc_surface', 'nsidc_rootzone']
logo_harvest = harvest.png
logo_geoglam = geoglam.png

; Country overrides (only where geocif differs from countries.txt)
[ethiopia]
crops = ['winter_wheat']

[bangladesh]
crops = ['rice']
admin_level = admin_2
boundary_file = bangladesh.shp

; ML model definitions
[catboost]
ML_model = True

[analog]
ML_model = False

[ML]
model_type = REGRESSION
target = Yield (tn per ha)
feature_selection = gOMP
cluster_strategy = single
check_yield_trend = False
use_spatial_neighbors = True
spatial_neighbor_method = knn
spatial_neighbor_k = 5
lag_yield_as_feature = True
lag_years = 3
median_yield_as_feature = False
median_years = 5
include_lat_lon_as_feature = False
panel_model = True
cat_features = ["Harvest Year", "Region_ID", "Region"]
outlook_n_years = 10        ; Number of historical years for yield outlook comparison
outlook_aggregation = mean  ; mean or median
run_time_steps = latest         ; latest, current, all, or N (every Nth time period)
run_cluster_analysis = False
cluster_analysis_proxy = AUC_NDVI
cluster_analysis_max_k = 8
cluster_analysis_top_n = 20
cluster_analysis_variance = 0.85

[LOGGING]
log_level = INFO

[DEFAULT]
data_source = harvest
method = monthly_r
project_name = geocif
countries = ["kenya"]
crops = ['maize']
admin_level = admin_1
models = ['catboost']
seasons = [1]
threshold = True
floor = 20

FLDAS forecast overlay

When FLDAS columns are present in the merged data (e.g. fldas_tair_tavg_lead0 through _lead5), agmet plots automatically overlay forecast dots on matching panels:

FLDAS variable Target panel
fldas_tair_tavg Temperature
fldas_totalprecip_tavg Daily precipitation
fldas_soilmoist_tavg Soil moisture (surface)

Each lead time (0–5) appears as a diamond marker with decreasing opacity (lead 0 = most opaque). Dots beyond the harvest date are suppressed. No config changes are needed — detection is automatic.

Release

To publish a new version to PyPI:

  1. Bump __version__ in geocif/__init__.py and version in pyproject.toml
  2. Build and upload:
    uv build
    uvx twine upload dist/geocif-<version>*
    
  3. Commit:
    git add geocif/__init__.py pyproject.toml
    git commit -m "Bump to <version>"
    

Credits

This project was supported by NASA Applied Sciences Grant No. 80NSSC17K0625 through the NASA Harvest Consortium, and the NASA Acres Consortium under NASA Grant #80NSSC23M0034.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geocif-0.4.761.tar.gz (577.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

geocif-0.4.761-py2.py3-none-any.whl (570.0 kB view details)

Uploaded Python 2Python 3

File details

Details for the file geocif-0.4.761.tar.gz.

File metadata

  • Download URL: geocif-0.4.761.tar.gz
  • Upload date:
  • Size: 577.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for geocif-0.4.761.tar.gz
Algorithm Hash digest
SHA256 098490bd34635f41d64498715b7a223a01c466738a0b6631f80bf5eb136b60f0
MD5 53a24f5feebaf73527aa2a28fa60b3e6
BLAKE2b-256 2aa8a01fc2422904eae716ea66c3e2c93153223883d9afa06d82efe217c5221d

See more details on using hashes here.

File details

Details for the file geocif-0.4.761-py2.py3-none-any.whl.

File metadata

  • Download URL: geocif-0.4.761-py2.py3-none-any.whl
  • Upload date:
  • Size: 570.0 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for geocif-0.4.761-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 6ca2a40309adeb17f904525a8cb97b3e02d218a080af4ff71afe99a6844eaf6a
MD5 2b939991469cd244b8df5b432ed5c237
BLAKE2b-256 405816129045d2107cdd68f96c431d7aacc283931ac907aad932e051917bde49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page