Tool to download Satelligence raster data from Google Cloud Storage.
Project description
Get raster data from google cloud storage
Prerequisites
- GDAL library
- GCLOUD SDK activated
- gcsfs, tqdm, pebble, numpy, and a few more packages.
pip install -r requirements.txtwill install everything for you.
NOTE ON GDAL
You need to install gdal using pip install gdal. Perhaps you get an error that
the installation did not go ok. That's most likely due to a mismatch of the binaries of
gdal that you have and the python version. Check the version of the binaries you
have by running gdalinfo --version. Then fill in that version in the pip install:
pip install gdal==<version>
TL;DR: how to run
You can use this tool to download any type of gdal readable file (vrt, tif, etc) from google cloud storage. In addition, you can also use it to download zarr's (those that we use in dprof), and there is a shortcut to easily download dprof results.
In general, usage is python get_result.py <url>.
<url> should start with the bucket name, without the "gs://" prefix.
Examples:
.vrt (or other gdal-readable formats)
Get any kind of gdal-readable file, e.g. the 2015 V6.1 FBL from the vrt:
python get_result.py s11-base-data/landcover/forest_baselines/FBL_V6.1/2015/FBL_V6.1_2015.vrt
.zarr
Get the deforestation zarr:
python get_result.py s11-production-dprof-cache/Deforestation/deforestation/deforestation.zarr
Note that you don't need to append the "/result" part to the zarr url if you want to read from the /result group. If you want to read from a different group, just append that to the zarr url, e.g. "s11-bucket/some.zarr/some_group".
dprof results
The shortcut to download results (will download every vrt for that result) is: python get_result.py <bucket>/<resultnumber>
E.g.:
python get_result.py s11-production-dprof-result/13883
For more options, see below.
Usage
get_result.py [-h] [--threads THREADS] [--bounds ULX ULY LRX LRY] [--resolution RESOLUTION] [--dtype DTYPE] [--resampling RESAMPLING] [--nodata-per-band] [--list] [--outname OUTNAME] [--align-to-blocks | --no-align-to-blocks] [--chunk-timeout TIMEOUT] [--debug] source_url
positional arguments: : source_url: url to the dataset, see above for examples.
options: : -h, --help: show this help message and exit. : --threads THREADS: number of simultaneous download threads to use (default=3). : --bounds ULX ULY LRX LRY: output bounds (minx miny maxx maxy). : --resolution RESOLUTION: output resolution (in decimal degrees). (NOT APPLICABLE TO ZARR) : --resampling RESAMPLING: resampling algorithm (any of: nearest (default), bilinear, cubic, cubicspline, lanczos, average, mode). (NOT APPLICABLE TO ZARR) : --nodata-per-band: propagate a separate nodata value for each band, instead of using the nodata value from the first band for all bands, assuming that it is the same for the whole file. This is slightly slower with result vrts with many bands. (NOT APPLICABLE TO ZARR) : --list: print a list of vrts of the Result, then exit. (NOT APPLICABLE TO ZARR) : --dtype: override datatype for output tif (defaults to the input datatype). : --outname: output file name (if not given, uses the source name). : --no-align-to-blocks: don't optimize reading by extending the output bounds, such that these align to the internal blocks of the source dataset (will slightly extend the output bounds). The default is to align to input blocks, because it is better for performance. If you want the exact bounds as specified, use --no-align-to-blocks. : --chunk-timeout: Timeout for handling a single (output) chunk job, in seconds. Default = 300 (5 minutes). You might need to increase this when resampling to a much coarser resolution. : --blocks-per-job: the amount of chunks to read in a single job. For zarr, it makes sense to set this higher; for vrt usually not. : --debug: enable debug logging.
Note that the options related to sub/resampling do not apply to zarr sources! Because the zarr's are not read by gdal, we cannot use gdal's resampling implementation. Zarr inputs will always be read at their full input resolution.
Example with options
To download result 011605 in the Production bucket, but only an area in South Borneo, resampled to 0.01x0.01 DD per pixel using average resampling:
python get_result.py s11-production-dprof-results/11605 --bounds 113 -1 114 -2 --resolution 0.0001 --resampling average
NB. this example will run quite slow due to the much larger output pixel size (0.0001 output vs 0.00006 input).
You might need to add --chunk-timeout with a larger value than the default (300) option to allow for longer chunk job processing times.
Caveats
- the output file will be named the same as the input VRT, but the ".vrt" extension replaced with ".tif"
- the script is trying to use an optimal chunk size for the output tif, within the 16-1048 range, so that a single output chunk maps approx. to a single input chunk
- the output will be written to the folder where the script is invoked
- the script will download each vrt first, because opening the vrt remotely is extremely slow for bigger vrts. The vrt is removed again when the download is finished.
- specifying a significantly lower output resolution than the input resolution will cause the chunks to take a lot of time per chunk (because for each output chunk, a lot of input data needs to be read). If you get errors like
Exception: [Errno Task timeout], you should increase the --chunk-timeout value (the default is 300) * chunks that error will be retried max 10 times. Getting a few error is quite normal, but if you get many errors, there probably is an issue with gcs at google's side. - for some reason, the performance for zarrs (e.g. the deforestation zarr, maybe it is because of the sparseness) is much lower than for vrt's. Often it helps to specify a higher value for --blocks-per-job, e.g. 16 or even 64. Ymmv.
- you should be able to kill the script with ctrl-c. However, this does not always correctly cancel chunks that were already downloading, so it might take some time before it really stops. If you really need to stop it now, after hitting ctrl-c, use
pkill -f get_result.pywhich will kill it immediately (if you have pkill installed).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file s11_get_result-0.0.34.tar.gz.
File metadata
- Download URL: s11_get_result-0.0.34.tar.gz
- Upload date:
- Size: 15.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae1506df9d878a48cfb30c30991a191ab9e89f5b18715e98fcc69e82aee17193
|
|
| MD5 |
bc3b705d05ba572e1cde078864686bb7
|
|
| BLAKE2b-256 |
d12104e660459cb25a785e8e68cf41dbb5c784c820da8d936ece054463c6f60a
|
File details
Details for the file s11_get_result-0.0.34-py3-none-any.whl.
File metadata
- Download URL: s11_get_result-0.0.34-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea6c3660b3e73f4aa9460e391fe7c585171e648f6a2012faffa93511618cf7e0
|
|
| MD5 |
5e6e2df68ef28dbf8649ec4d21bf609b
|
|
| BLAKE2b-256 |
f922369681488755938c9f38cbe7224d7aeb68c44b95430ec83b2a9b2c68242f
|