Skip to main content

Keeps a local data repository up to date with different remote data sources.

Project description

databird

Periodically retrieve data from different sources.

The databird package only provides a framework to plan and run the tasks needed to keep a local data-file-store up do date with various remote sources. The remote sources can be anything (e.g. FTP Server, ECMWF, HTTP Api, SQL database, ...), as long as there is a databird-driver available for the specific source.

Usage

Databird is configured with configuration files and invoked by

$ databird retrieve -c /etc/databird/databird.conf

# or (as the above is the default)
$ databird retrieve

You can store the configuration files anywhere and for example run the above command periodically as cron job.

Also, some rq workers are required:

$ rq worker databird

This will start one worker. You should use a supervisor to start multiple workers.

Configuration

The following example configuration defines a repository, which is populated with daily GNSS data from ftp://cddis.nasa.gov/gnss/data/daily/.

The main configuration file (usually databird.conf) could look like that:

general:
  root: /data/repos # root path for data repositories
  num-workers: 16   # max number of async workers
  include: "databird.conf.d/*.conf"  # include config files

Generally you can configure anything in any file, as all configuration files are merged to one configuration tree. The include option is an exception, as it can only be declared in the top config file.

Then in databird.conf.d/cddis.conf you can configure a profile and a repository:

profiles:
  nasa_cddis:
    driver: standard.FtpDriver
    configuration:
      host: cddis.nasa.gov
      user: anonymous
      password: ""
      tls: False

repositories:
  nasa_gnss:
    description: Data from NASAs Archive of Space Geodesy Data
    profile: nasa_cddis
    period: 1 day
    delay: 2 days
    start: 2019-01-01
    targets:
      status: "{time:%Y}/cddis_gnss_{iso_date}.status"
    configuration:
      user: anonymous  # this could override 'user' from profile
      root: "/gnss/data/daily"
      patterns:
        status: "{time:%Y}/{time:%j}/{time:%y%j}.status"

When calling databird with this configuration the following is achieved:

  • A repository in the folder /data/repos/nasa_gnss/ is created
  • For every day, a file like 2019/nasa_gnss_2019-01-20.status is expected
  • If that file is missing, retrieve it from ftp://cddis.nasa.gov/gnss/data/daily/2019/020/19020.status
  • If there are many files missing, the data is retrieved asynchronously

This example used the standard.FTPDriver.

Monitoring

Use databird webmonitor [PORT] to start the web interface.

Since databird uses RQ for managing jobs, you also check the options at RQ/docs/monitoring.

Drivers

Anyone can write drivers (see below). Currently, the following drivers are available:

Included:

  • standard.FilesystemDriver: Retrieve data from the local filesystem
  • standard.CommandDriver: Run an arbitrary shell command
  • standard.FtpDriver: Retrieve data from an FTP server

Climate:

  • climate.EcmwfDriver: Retrieve data from the European Centre for Medium-Range Weather Forecasts (ECMWF) via their API
  • climate.C3SDriver: Retrieve data from the Copernicus Climate Change Service (C3S) via their API
  • climate.GesDiscDriver: Retrieve data from the NASA EarthData GES DISC service.

Development

  1. Create a Python environment and activate it
    $ python3 -m venv . && source bin/activate
    
  2. Install the development environment:
    (databird) $ pip install -r requirements-dev.txt
    

Writing a new driver

Drivers are published in a namespace package databird-drivers. Everyone can develop drivers and share them.

Install databird and run mr.bob to create a new driver package:

(databird) $ cd $HOME/projects
(databird) $ python -m mrbob.cli databird.blueprints:driver

After answering some questions, a new directory databird-driver-<chosen_name> is created. Lets asume <chosen_name> = foo, then your driver is usually implemented in databird/drivers/foo/foo.py in a class named FooDriver(). Until more documentation is available, you have to look at the code to figure out how to write a driver.

Other people will be able to use it with driver: foo.FooDriver.

Tell me if you wrote a new driver, so I can include it in the list.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databird-0.7.1.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

databird-0.7.1-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file databird-0.7.1.tar.gz.

File metadata

  • Download URL: databird-0.7.1.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for databird-0.7.1.tar.gz
Algorithm Hash digest
SHA256 7e18bdc21768ac50b076aa8fd2bbc66662b9b1f38a3a1d463b2a20cb5f3c6f1d
MD5 7e6c43af9f31807c37b4c42158a4a0fd
BLAKE2b-256 f911279e51ab7bfea44627d6a1e32d8df76164d60a1e44139866de6436912796

See more details on using hashes here.

File details

Details for the file databird-0.7.1-py3-none-any.whl.

File metadata

  • Download URL: databird-0.7.1-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/39.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.3

File hashes

Hashes for databird-0.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2d38f1fa6b05801d8f38b9cbad078fb2678825c022cb72f8ea17be9c387bcec7
MD5 5a9549108cd68c791f281fe1d178ac3b
BLAKE2b-256 691eee90022a41bea22a148714646137cc86dce2e6d26590d36d210760fe71bd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page