Earth Engine + Apache Beam

Project description

GeeBeam

Google Earth Engine + Apache Beam for building geospatial training datasets

Purpose:

GeeBeam is a lightweight library for building and executing Apache Beam pipelines that download data "chips" from Google Earth Engine and write them to TensorFlow records for model training.

The user defines the Earth Engine images they want to download chips from using the Python earthengine-api. geebeam then serialized the graph-definition of the images so they can be passed to the Beam workers.

The pipelines can be run locally or on Google Cloud Dataflow. (Note: currently local jobs are limited to short-running tasks due to grpc "Deadline Exceeded" error).

Install:

pip install geebeam

Examples:

Running locally:

Here we'll create a burned area mask for 2024 using the MCD64A1 product. For example, this could be the target variable for a burn risk model.

import ee
import geebeam
import google

# Get default project id from environment (or specify PROJECT_ID manually)
PROJECT_ID = google.auth.default()[1]

# Initialize ee client, replace with your GCP project ID
ee.Initialize(project=PROJECT_ID)

# Build image for download
burned_2024 = (ee.ImageCollection('MODIS/061/MCD64A1')
            .select('BurnDate')
            .filter(ee.Filter.calendarRange(2024, 2024, 'year'))
            .min()
            .gt(0)
            .rename(['Burn'])
            )

# Building and triggering the pipeline is done with a single command:
geebeam.run_pipeline(
    image_list = [burned_2024],
    project=PROJECT_ID,
    patch_size=128, # Pixel dimensions in each direction
    scale=500, # Final export resolution in meters
    n_sample=10, # Number of tiles to sample
    validation_ratio=0.2, # Fraction to select as validation data
    output_path='./test_tf_data/',
    sampling_region=ee.Geometry.Rectangle(-63.0, -9.0, -56.0, -4.0)
)

Now let's add another dataset: MapBiomas Amazonia forest fraction

# MB Land-use/land-cover forest fraction
# Note that LULC codes less than 10 area forest in MapBiomas Amazon Collection 6
mb_amz_lulc = (
    ee.Image('projects/mapbiomas-public/assets/amazon/lulc/collection6/mapbiomas_collection60_integration_v1')
    .lt(10)
   .reduceResolution('mean', maxPixels=500)
)

# Exporting both together is as simple as this:
geebeam.run_pipeline(
    image_list = [burned_2024, mb_amz_lulc],
    project=PROJECT_ID,
    patch_size=128,
    scale=500,
    n_sample=10,
    validation_ratio=0.2,
    output_path='./test_tf_data/',
    sampling_region=ee.Geometry.Rectangle(-63.0, -9.0, -56.0, -4.0),
    num_workers=1
)

Scaling up with DataFlow:

The export process can be scaled to many workers via Google Cloud DataFlow. First write a script containing your geebeam.run_pipeline() command. Then execute using the Beam DataFlow runner:

python examples/geebeam_run.py \
    --region=us-east1 \
    --worker=zone us-east1-b \
    --runner=DataflowRunner \
    --max_num_workers=8 \
    --experiments=use_runner_v2 \
    --temp_location=gs://[your-bucket]/[path_to_temp_dir]
    --machine_type=n2-highmem-2 \
    --sdk_container_image=us-docker.pkg.dev/mmacedo-reservoirid/geebeam-public/geebeam:latest

Note in this case your output_path in run_pipeline() should be a Google Cloud Storage path. If you're running an older version of geebeam, replace "latest" in the sdk_container_image URI with the version number (e.g. v0.1.2). You can also build your own Docker image to run on. More info in the DataFlow docs.

See the Apache Beam and Google Cloud DataFlow docs for full documentation, e.g. pipeline command-line options

Common DataFlow gotchas

Before running, you must enable the DataFlow API on Google Cloud Console.
If you get an error stating "Subnetwork ''... does not have Private Google Access...", you may need to activate it for your subnetwork (replace us-east1 with your region):

gcloud compute networks subnets update default \
    --region=us-east1 \
    --enable-private-ip-google-access

You can test your pipeline script (e.g. geebeam_run.py) and Beam options using the DirectRunner before submitting to DataFlow:

python examples/geebeam_run.py \
    --runner=DirectRunner

See DataFlow documentation on specifying network and subnetwork for DataFlow jobs.

For more common errors, see the Google Cloud DataFlow troubleshooting guide.

Alternatives:

GeeFlow: Google DeepMind's GeeFlow fulfills a similar purpose. It is more flexible, allowing for more user control of data processing, reprojection, and writing, but slower and no longer actively maintained. With the goal of meeting most users' needs, GeeBeam is designed to be easier and quicker to use, but allows from more limited data transformations.
Export training data to Google Cloud Storage then download chips from there: This works, but if you need to get data from many different datasets it's slow to export all that data to Cloud Storage and can be expensive to store it there if you don't delete it quickly. This also uses a lot of Earth Engine compute hours, which are now subject to stricter monthly limits.
Xee: Xee allows for accessing Earth Engine objects as xarray.Datasets. You could use this to define a xarray.Dataset and download "chips" from it, but geebeam interfaces with Beam to automatically parallelize this task and export to Tensorflow records.

Project details

Release history Release notifications | RSS feed

0.3.2

Apr 9, 2026

0.3.1

Apr 9, 2026

0.2.2

Mar 27, 2026

This version

0.2.1

Mar 27, 2026

0.2.0

Mar 26, 2026

0.1.2

Mar 20, 2026

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

geebeam-0.2.1.tar.gz (14.5 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

geebeam-0.2.1-py3-none-any.whl (12.6 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file geebeam-0.2.1.tar.gz.

File metadata

Download URL: geebeam-0.2.1.tar.gz
Upload date: Mar 27, 2026
Size: 14.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geebeam-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`d6d7dbec588d2d9b89519f85eb5a91f902e756baab7a5a9c0c6626b6f5d1bcbc`
MD5	`9819310bec707ebbfc87004f096c75b4`
BLAKE2b-256	`1a61b8fdf9e2c49a87693a40987c6a7d4522ad287fb55baa4bce8741626ef66d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for geebeam-0.2.1.tar.gz:

Publisher: release.yml on kysolvik/geebeam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: geebeam-0.2.1.tar.gz
- Subject digest: d6d7dbec588d2d9b89519f85eb5a91f902e756baab7a5a9c0c6626b6f5d1bcbc
- Sigstore transparency entry: 1188991049
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: kysolvik/geebeam@86fa6a853247ace728cb7e944fbdba3b48cf36a4
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/kysolvik
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@86fa6a853247ace728cb7e944fbdba3b48cf36a4
- Trigger Event: release

File details

Details for the file geebeam-0.2.1-py3-none-any.whl.

File metadata

Download URL: geebeam-0.2.1-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 12.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for geebeam-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7724488d5a8d208992d788ec4e95cb0cd7c0187787a80414f99951a19696a629`
MD5	`2d449b52c98f6782e4b15667093bdfb1`
BLAKE2b-256	`4dda0fc8184e599da073360dd2c72c348ff9a866a87dc3503c1e786af20604bf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for geebeam-0.2.1-py3-none-any.whl:

Publisher: release.yml on kysolvik/geebeam

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: geebeam-0.2.1-py3-none-any.whl
- Subject digest: 7724488d5a8d208992d788ec4e95cb0cd7c0187787a80414f99951a19696a629
- Sigstore transparency entry: 1188991054
- Sigstore integration time: Mar 27, 2026
Source repository:
- Permalink: kysolvik/geebeam@86fa6a853247ace728cb7e944fbdba3b48cf36a4
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/kysolvik
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@86fa6a853247ace728cb7e944fbdba3b48cf36a4
- Trigger Event: release

geebeam 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

GeeBeam

Purpose:

Install:

Examples:

Running locally:

Scaling up with DataFlow:

Common DataFlow gotchas

Alternatives:

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance