Skip to main content

Tool to download Satelligence raster data from Google Cloud Storage.

Project description

Get raster data from google cloud storage

Prerequisites

  • GDAL library
  • GCLOUD SDK activated
  • gcsfs, tqdm, pebble, numpy, and a few more packages. pip install -r requirements.txt will install everything for you.

NOTE ON GDAL You need to install gdal using pip install gdal. Perhaps you get an error that the installation did not go ok. That's most likely due to a mismatch of the binaries of gdal that you have and the python version. Check the version of the binaries you have by running gdalinfo --version. Then fill in that version in the pip install: pip install gdal==<version>

TL;DR: how to run

You can use this tool to download any type of gdal readable file (vrt, tif, etc) from google cloud storage. In addition, you can also use it to download zarr's (those that we use in dprof), and there is a shortcut to easily download dprof results.

In general, usage is python get_result.py <url>.

<url> should start with the bucket name, without the "gs://" prefix.

Examples:

.vrt (or other gdal-readable formats)

Get any kind of gdal-readable file, e.g. the 2015 V6.1 FBL from the vrt:

python get_result.py s11-base-data/landcover/forest_baselines/FBL_V6.1/2015/FBL_V6.1_2015.vrt 

.zarr

Get the deforestation zarr:

python get_result.py s11-production-dprof-cache/Deforestation/deforestation/deforestation.zarr

Note that you don't need to append the "/result" part to the zarr url if you want to read from the /result group. If you want to read from a different group, just append that to the zarr url, e.g. "s11-bucket/some.zarr/some_group".

dprof results

The shortcut to download results (will download every vrt for that result) is: python get_result.py <bucket>/<resultnumber> E.g.:

python get_result.py s11-production-dprof-result/13883

For more options, see below.

Usage

get_result.py [-h] [--threads THREADS] [--bounds ULX ULY LRX LRY] [--resolution RESOLUTION] [--dtype DTYPE] [--resampling RESAMPLING] [--nodata-per-band] [--list] [--outname OUTNAME] [--align-to-blocks | --no-align-to-blocks] [--chunk-timeout TIMEOUT] [--debug] source_url

positional arguments: : source_url: url to the dataset, see above for examples.

options: : -h, --help: show this help message and exit. : --threads THREADS: number of simultaneous download threads to use (default=3). : --bounds ULX ULY LRX LRY: output bounds (minx miny maxx maxy). : --resolution RESOLUTION: output resolution (in decimal degrees). (NOT APPLICABLE TO ZARR) : --resampling RESAMPLING: resampling algorithm (any of: nearest (default), bilinear, cubic, cubicspline, lanczos, average, mode). (NOT APPLICABLE TO ZARR) : --nodata-per-band: propagate a separate nodata value for each band, instead of using the nodata value from the first band for all bands, assuming that it is the same for the whole file. This is slightly slower with result vrts with many bands. (NOT APPLICABLE TO ZARR) : --list: print a list of vrts of the Result, then exit. (NOT APPLICABLE TO ZARR) : --dtype: override datatype for output tif (defaults to the input datatype). : --outname: output file name (if not given, uses the source name). : --no-align-to-blocks: don't optimize reading by extending the output bounds, such that these align to the internal blocks of the source dataset (will slightly extend the output bounds). The default is to align to input blocks, because it is better for performance. If you want the exact bounds as specified, use --no-align-to-blocks. : --chunk-timeout: Timeout for handling a single (output) chunk job, in seconds. Default = 300 (5 minutes). You might need to increase this when resampling to a much coarser resolution. : --blocks-per-job: the amount of chunks to read in a single job. For zarr, it makes sense to set this higher; for vrt usually not. : --debug: enable debug logging.

Note that the options related to sub/resampling do not apply to zarr sources! Because the zarr's are not read by gdal, we cannot use gdal's resampling implementation. Zarr inputs will always be read at their full input resolution.

Example with options

To download result 011605 in the Production bucket, but only an area in South Borneo, resampled to 0.01x0.01 DD per pixel using average resampling:

python get_result.py s11-production-dprof-results/11605 --bounds 113 -1 114 -2 --resolution 0.0001 --resampling average

NB. this example will run quite slow due to the much larger output pixel size (0.0001 output vs 0.00006 input). You might need to add --chunk-timeout with a larger value than the default (300) option to allow for longer chunk job processing times.

Caveats

  • the output file will be named the same as the input VRT, but the ".vrt" extension replaced with ".tif"
  • the script is trying to use an optimal chunk size for the output tif, within the 16-1048 range, so that a single output chunk maps approx. to a single input chunk
  • the output will be written to the folder where the script is invoked
  • the script will download each vrt first, because opening the vrt remotely is extremely slow for bigger vrts. The vrt is removed again when the download is finished.
  • specifying a significantly lower output resolution than the input resolution will cause the chunks to take a lot of time per chunk (because for each output chunk, a lot of input data needs to be read). If you get errors like Exception: [Errno Task timeout], you should increase the --chunk-timeout value (the default is 300) * chunks that error will be retried max 10 times. Getting a few error is quite normal, but if you get many errors, there probably is an issue with gcs at google's side.
  • for some reason, the performance for zarrs (e.g. the deforestation zarr, maybe it is because of the sparseness) is much lower than for vrt's. Often it helps to specify a higher value for --blocks-per-job, e.g. 16 or even 64. Ymmv.
  • you should be able to kill the script with ctrl-c. However, this does not always correctly cancel chunks that were already downloading, so it might take some time before it really stops. If you really need to stop it now, after hitting ctrl-c, use pkill -f get_result.py which will kill it immediately (if you have pkill installed).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

s11_get_result-0.0.34.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

s11_get_result-0.0.34-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file s11_get_result-0.0.34.tar.gz.

File metadata

  • Download URL: s11_get_result-0.0.34.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for s11_get_result-0.0.34.tar.gz
Algorithm Hash digest
SHA256 ae1506df9d878a48cfb30c30991a191ab9e89f5b18715e98fcc69e82aee17193
MD5 bc3b705d05ba572e1cde078864686bb7
BLAKE2b-256 d12104e660459cb25a785e8e68cf41dbb5c784c820da8d936ece054463c6f60a

See more details on using hashes here.

File details

Details for the file s11_get_result-0.0.34-py3-none-any.whl.

File metadata

File hashes

Hashes for s11_get_result-0.0.34-py3-none-any.whl
Algorithm Hash digest
SHA256 ea6c3660b3e73f4aa9460e391fe7c585171e648f6a2012faffa93511618cf7e0
MD5 5e6e2df68ef28dbf8649ec4d21bf609b
BLAKE2b-256 f922369681488755938c9f38cbe7224d7aeb68c44b95430ec83b2a9b2c68242f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page