Skip to main content

Keeps a local data repository up to date with different remote data sources.

Project description

databird

Periodically retrieve data from different sources.

The databird package only provides a framework to plan and run the tasks needed to keep a local data-file-store up do date with various remote sources. The remote sources can be anything (e.g. FTP Server, ECMWF, HTTP Api, SQL database, ...), as long as there is a databird-driver available for the specific source.

Usage

Databird is configured with configuration files and invoked by

$ databird retrieve -c /etc/databird/databird.conf

# or (as the above is the default)
$ databird retrieve

You can store the configuration files anywhere and for example run the above command periodically as cron job.

Configuration

The following example configuration defines a repository, which is populated with daily GNSS data from ftp://cddis.nasa.gov/gnss/data/daily/.

The main configuration file (usually databird.conf) could look like that:

general:
  root: /data/repos # root path for data repositories
  num-workers: 16   # max number of async workers
  include: "databird.conf.d/*.conf"  # include config files

Generally you can configure anything in any file, as all configuration files are merged to one configuration tree. The include option is an exception, as it can only be declared in the top config file.

Then in databird.conf.d/cddis.conf you can configure a profile and a repository:

profiles:
  nasa_cddis:
    driver: standard.FtpDriver
    configuration:
      host: cddis.nasa.gov
      user: anonymous
      password: ""
      tls: False

repositories:
  nasa_gnss:
    description: Data from NASAs Archive of Space Geodesy Data
    profile: nasa_cddis
    period: 1 day
    delay: 2 days
    start: 2019-01-01
    targets:
      status: "{time:%Y}/cddis_gnss_{iso_date}.status"
    configuration:
      user: anonymous  # this could override 'user' from profile
      root: "/gnss/data/daily"
      patterns:
        status: "{time:%Y}/{time:%j}/{time:%y%j}.status"

When calling databird with this configuration the following is achieved:

  • A repository in the folder /data/repos/nasa_gnss/ is created
  • For every day, a file like 2019/nasa_gnss_2019-01-20.status is expected
  • If that file is missing, retrieve it from ftp://cddis.nasa.gov/gnss/data/daily/2019/020/19020.status
  • If there are many files missing, the data is retrieved asynchronously

This example used the standard.FTPDriver.

Drivers

Anyone can write drivers (see below). Currently, the following drivers are available:

  • standard.FilesystemDriver: Retrieve data from the local filesystem (included)
  • standard.FtpDriver: Retrieve data from an FTP server (included)
  • ecmwf.EcmwfDriver: Retrieve data from the European Centre for Medium-Range Weather Forecasts (ECMWF) via their API (to be anounced)
  • c3s.C3SDriver: Retrieve data from the Copernicus Climate Change Service (C3S) via their API (to be anounced)

Development

  1. Create a Python environment and activate it
    $ python3 -m venv . && source bin/activate
    
  2. Install the development environment:
    (databird) $ pip install -r requirements-dev.txt
    

Writing a new driver

Drivers are published in a namespace package databird-drivers. Everyone can develop drivers and share them.

Install databird and run mr.bob to create a new driver package:

(databird) $ cd $HOME/projects
(databird) $ python -m mrbob.cli databird.blueprints:driver

After answering some questions, a new directory databird-driver-<chosen_name> is created. Lets asume <chosen_name> = foo, then your driver is usually implemented in databird/drivers/foo/foo.py in a class named FooDriver(). Until more documentation is available, you have to look at the code to figure out how to write a driver.

Other people will be able to use it with driver: foo.FooDriver.

Tell me if you wrote a new driver, so I can include it in the list.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
databird-0.6.3-py3-none-any.whl (13.1 kB) Copy SHA256 hash SHA256 Wheel py3
databird-0.6.3.tar.gz (9.0 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page