Tools for fetching data, and providing ready-to-use https://prefect.io flows

Project description

Fetching data

Tools for fetching data, and providing ready-to-use Prefect flows.

Features:

Fetch from various protocol (Amazon S3, Copernicus Climate Data Store, HTTP)
Keep track of previously downloaded file using a sqlite
Temporary renaming of downloaded file (eg. .tmp extention)
Full-featured workflow using Prefect

Current available protocol :

datafetch.protocol.s3.S3ApiBucket for fetching AWS buckets, in particular AWS Opendata
datafetch.protocol.cds.ClimateDataStoreApi for fetching from Copernicus Climate Data Store
datafetch.protocol.http.SimpleHttpFetch

Current available weather-related fetchers:

datefetch.weather.noaa.nwp.NoaaGfsS3 for fetching NOAA GFS from AWS S3
datefetch.weather.meteofrance.obs.MeteoFranceObservationFetch
datafetch.weather.ecmwf.EcmwfEra5CDS
datafetch.weather.ecmwf.EcmwfEra5S3

Quickstart

Installation

pip install git+https://github.com/steph-ben/datafetch.git

Download a full GFS run using prefect flow

>>> from datafetch.s3.flows import create_flow_download
>>> flow = create_flow_download()
>>> flow.run()

Download single GFS file

>>> from datafetch.s3 import NoaaGfsS3
>>> s3api = NoaaGfsS3()
NoaaGfsS3(bucket_name='noaa-gfs-bdp-pds')

# Check availability
>>> s3api.check_timestep_availability("20210201", "00", "003")
{'date_day': '20210201', 'run': '00', 'timestep': '003'}

# Launch download
>>> s3api.download_timestep("20210201", "00", "003", download_dir="/tmp/")
{'fp': '/tmp/gfs.20210201/00/gfs.t00z.pgrb2.0p25.f003'}

# Check file
$ ls -lh /tmp/gfs.20210201/00/gfs.t00z.pgrb2.0p25.f003
-rw-rw-r-- 1 steph steph 312M Feb  5 15:45 /tmp/gfs.20210201/00/gfs.t00z.pgrb2.0p25.f003

Low-level API usage

>>> from datafetch.s3 import NoaaGfsS3
>>> s3api = NoaaGfsS3()

# Check data availability
>>> r = s3api.filter(Prefix=s3api.get_daterun_prefix("20210202", "00"))
>>> list(r)[:3]
[s3.ObjectSummary(bucket_name='noaa-gfs-bdp-pds', key='gfs.20210202/00/gfs.t00z.pgrb2.0p25.anl'), 
 s3.ObjectSummary(bucket_name='noaa-gfs-bdp-pds', key='gfs.20210202/00/gfs.t00z.pgrb2.0p25.anl.idx'), 
 s3.ObjectSummary(bucket_name='noaa-gfs-bdp-pds', key='gfs.20210202/00/gfs.t00z.pgrb2.0p25.f000')]

# Download
>>> s3api.download('gfs.20210202/00/gfs.t00z.pgrb2.0p25.anl', destination_dir="/tmp/")
PosixPath('/tmp/gfs.20210202/00/gfs.t00z.pgrb2.0p25.anl')

Fetching from AWS

TODO

Fetching from Copernicus Climate Data Store (CDS)

Copernicus CDS call itself a place to "Dive into this wealth of information about the Earth's past, present and future climate."

You can browse and download all data from the official website. As well, a python API https://github.com/ecmwf/cdsapi is available for downloading data from scripts.

The datafetch.protocol.cds package enhance cdsapi with the following features:

Make asynchronous request and check request status later on, using a sqlite
Keep track of previously downloaded file, using a sqlite
Temporary renaming of downloaded file (eg. .tmp extention)

Pre-requisites

In order to access those public data, you must:

Register a free account from https://cds.climate.copernicus.eu/user/register
Configure your user key, as defined here https://github.com/ecmwf/cdsapi#configure

Then you can :

Browse all online resources from https://cds.climate.copernicus.eu/cdsapp#!/search?type=dataset
Simulate the needed information to download the resources from Donwload data > Show API request, example:

cds_resource_name = 'reanalysis-era5-pressure-levels'
cds_resource_param = {
    'product_type': 'reanalysis',
    'format': 'grib',
    'variable': 'temperature',
    'pressure_level': '850',
    'year': '2021',
    'month': '02',
    'day': '18',
    'time': [
        '00:00', '06:00', '12:00',
        '18:00',
    ],
}

Usage

Downloading a small resources

from datafetch.protocol.cds import ClimateDataStoreApi

cds = ClimateDataStoreApi()
fp = cds.fetch(
    cds_resource_name='reanalysis-era5-pressure-levels',
    cds_resource_param={
        'product_type': 'reanalysis',
        'format': 'grib',
        'variable': 'temperature',
        'pressure_level': '850',
        'year': '2021',
        'month': '02',
        'day': '18',
        'time': ['00:00'],
    },
    destination_dir='/tmp/',
    wait_until_complete=True
)

Downloading a larger resource

Defining the large resource to download :

cds_resource_name = 'reanalysis-era5-pressure-levels'
cds_resource_param = {
    'product_type': 'reanalysis',
    'format': 'grib',
    'variable': 'temperature',
    'pressure_level': '850',
    'year': '2021',
    'month': '02',
    'day': '18',
    'time': ['00:00'],
}

Submitting request to CDS, tracked into local sqlite

from datafetch.protocol.cds import ClimateDataStoreApi
cds = ClimateDataStoreApi()

db_record, created = cds.submit_to_queue(cds_resource_name, cds_resource_param)
print(db_record.queue_id)

Check request status

# Using initial request data (request id is retrieved from sqlite)
db_record = cds.check_queue(cds_resource_name, cds_resource_param)
print(db_record)

# Or directly using queue id
state, reply = cds.check_queue_by_id(queue_id="xxx-xxx")
print(state, reply)

Download result

# Using initial request data
fp = cds.download_result(
    cds_resource_name, cds_resource_param,
    destination_dir="/tmp/"
)
print(fp)

# Or directly using queue id
fp = cds.download_result_by_id(queue_id="xxx-xxx")
print(fp)

Project details

Release history Release notifications | RSS feed

This version

0.0.2

Mar 2, 2021

0.0.1

Mar 2, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafetch-0.0.2.tar.gz (20.3 kB view hashes)

Uploaded Mar 2, 2021 Source

Built Distribution

datafetch-0.0.2-py3-none-any.whl (26.6 kB view hashes)

Uploaded Mar 2, 2021 Python 3

Hashes for datafetch-0.0.2.tar.gz

Hashes for datafetch-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`af3f3116d1ff912b708028a19797de633ef4efbf243d7e7ce87e986b5f47b12e`
MD5	`346ff127aa2df6db528e3a88c3dba353`
BLAKE2b-256	`bdd6496085bf5409ad0987ac5c7a2a30950ce2d7a562456137dda17ec31450be`

Hashes for datafetch-0.0.2-py3-none-any.whl

Hashes for datafetch-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`735d04a7058bebb27cbd31743373eedff544476e415b74279a31b4ab195ba713`
MD5	`24118ec5b200ec3a7f76b5c9b6a025ee`
BLAKE2b-256	`866a97400e6c5c524d5fcd0bdc6e7d349a8557859a5fa55d623dafc3a0eefa08`