Skip to main content

A lightweight orchestrator for spatial databases

Project description

MakeGIS

A lightweight orchestrator for spatial databases.

MakeGIS organizes workflows in a DAG whose nodes can be of three types:

  • source nodes: load a dataset into a target database
  • transform nodes: perform transforms within a target database
  • custom nodes: run arbitrary commands

It comes with a command line tool, mkgs, that operates on the resulting DAG.

Key features/choices:

  • Local and standalone: mkgs runs locally, no other service involved
  • Easy data loading: describe where the data is, MakeGIS handles the rest
  • Works for both ETL and ELT workflows
  • Automatic dependency discovery for SQL transforms
  • Support arbitrary code
  • Event journal to keep track of database state
  • Build DAG through code or from YAML files.

[!Note] MakeGIS is a young project and still exploring different approaches.

In particular, the Configuration docs in this readme reflect a somewhat opinionated way of organizing and declaring a DAG through makegis.yaml files. Alternative DAG-building paradigms are being explored.

Installation

pip install makegis

MakeGIS relies on external tools, such as ogr2ogr, to be available.

Concept

A quick overview of the main components underpinning MakeGIS

DAG

The DAG organizes tasks. A DAG node owns one or more database objects (i.e. tables, views, functions, ...). A database object cannot be owned by more than one node. DAG nodes can depend on other DAG nodes.

DAG nodes come in three types. Source nodes own a single database table and describe the data source of that table. Tranfrom nodes represent SQL to be run against a target database. The SQL statement are parsed to detect any dependencies (database object owned by other nodes). Finally, custom nodes wrap arbitrary commands.

Targets

Targets handle all interecation with a database instance. This includes running nodes as well as writing to and reading from the journal (see below).

Journal

MakeGIS keeps an event journal on each target database. This journal logs which nodes have been run, when, and with what version of MakeGIS. If the MakegGIS project is in a version control system (only git supported at this stage), then the version of the project is logged for each run too.

The role of the journal is to detect stale or modified nodes that need to be rerun. See the mkgs outdated command.

Usage

Makegis provides the mkgs CLI utility to operate on the DAG.

usage: mkgs [-h] [-v] [--debug] {init,ls,outdated,run} ...

positional arguments:
  {init,ls,outdated,run}
                        commands
    init                initialize journal on target
    ls                  list nodes
    outdated            report outdated nodes
    run                 run nodes

options:
  -h, --help            show this help message and exit
  -v, --verbose         verbose messages
  --debug               debug messages

mkgs init

The init command prepares a target database to work with MakeGIS. It creates a _makegis_log journal table that is used to track which nodes have been run, when and at what version. It will also create any missing schemas expeced by the DAG.

usage: mkgs init [-h] [-t TARGET]

options:
  -h, --help           show this help message and exit
  -t, --target TARGET  db instance to target

mkgs ls

The ls command shows DAG nodes matching a selection pattern. At this stage only * wildcards are supported but additional operators are planned (e.g. +<pattern> or <pattern>+ for upstream/downstream propagation).

usage: mkgs ls [-h] pattern

positional arguments:
  pattern     DAG selection pattern

options:
  -h, --help  show this help message and exit

mkgs outdated

The outdated command reports outdated nodes for the given target.

usage: mkgs outdated [-h] [-t TARGET]

options:
  -h, --help           show this help message and exit
  -t, --target TARGET  db instance to target

mkgs run

The run command will run the nodes matching a selection pattern (same as mkgs ls). Nodes that are fresh (i.e. not outdated) will be skipped. This can be overridden by using the --force flag.

usage: mkgs run [-h] [-t TARGET] [-d] [-f] pattern

positional arguments:
  pattern              DAG selection pattern

options:
  -h, --help           show this help message and exit
  -t, --target TARGET  db instance to target
  -d, --dry-run        process nodes without actually running them
  -f, --force          also run fresh nodes

Configuration

Makegis is configured through YAML configuration files and environment variables.

A makegis.root.yml file defines the root of a MakeGIS project, along with project-wide settings. MakeGIS will traverse the directory tree and look for any makegis.yml files.

An example project may look like this:

project/
├─ src/
|  ├─ raw/
|  │  ├─ provider/
|  │  │  └─ makegis.yml
|  |  └─ makegis.yml
|  └─ core/
|     ├─ transform_1.sql
|     ├─ transform_2.sql
|     ├─ transform_3.sql
|     └─ makegis.yml
├─ .env
├─ .gitignore
└─ makegis.root.yml

[!Note]
Environment variables can be used by enclosing them in double curly brackets: {{ EXAMPLE }}. MakeGIS will consider any .env files in the project tree.

makegis.root.yml

A makegis.root.yml file defines the root of a MakeGIS project along with project wide settings. Here's an annotated example:

# The project's root directory.
src_dir: ./src

# Global defaults
defaults:
  # Global defaults for `load` nodes
  load:
    epsg: 4326
    geom_index: false
  # Optional default target (to use we running mkgs without a `--target` option)
  target: pg_dev

# Databases to target
targets:
  pg_prod:
    host: prod.example.com
    port: 5432
    user: mkgs
    db: postgres
  pg_dev:
    host: 127.0.0.1
    port: 5432
    user: mkgs
    db: postgres

makegis.yml

The path of a makegis.yml determines the database relations they manage, whith top-level directories mapping to schemas.

A makegis.yml contains one of the following configuration blocks:

  • load: defines sources to be loaded to a target
  • transform: defines transforms to be applied to a target
  • node: custom node to run bespoke commands

Load block

Maps tables to external data sources. Each table becomes a DAG node and can be invoked individually

load:
  <table-name>:
    <loader>: <loader-arg>
    <loader-option>: <option-value>
    <loader-option>: <option-value>
    ...
load:
  countries:
    wfs: https://wfs.example.com/countries?token={{API_KEY}}
    epsg: 4326
    geom_index: true

TODO: Document loaders and their options.

EPSG option
Single value

epsg: 4326:

Target SRID. If source declares a different EPSG, a tranformation is applied. If source has no SRID, no transformation is applied and srid is set to given value.

Mapping

epsg: 4326:2193

Convert from source to dest. Warn or abort if source exposes a different SRID

Transform block

Declares sql scripts to be enrolled. Each script becomes a DAG node. Dependencies with other DAG nodes are resolved automatically. The order in which sql scripts are listed does not matter. There are no constraints on what is in the sql scripts, as long as MakeGIS is aware of all dependencies.

transform:
  - create_view_of_awesome_table.sql
  - create_awesome_table.sql

Node block

A node block defines a custom DAG node, for when more flexibility is needed than offered by a load or tranform block.

The price to pay for more flexibility is that dependencies need to be documented manually. This goes for upstream dependecies as well as objects created on the target db.

node:
  # List any relations needed by this node.
  deps:
    - schema.upstream_table
  # Commands that do not change the target db but need to be run before we proceed.
  # Commands are run sequentially, in listing order.
  prep:
    - before.py
  # Main section
  do:
    # List of commands along with any objects they will create on the target.
    run:
      - cmd: script1.py
        # Declare objects owned by this command
        creates:
          - table: new_table
          - function: helper
    # Can also use a load block here, but it won't spawn new DAG nodes
    <load-block>
  # Like prep, but runs after `do`, and only if `do` runs fine.
  post:
    - after.py
  # Like post but always runs, even if something failed prior.
  finally:
    - teardown.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

makegis-0.2.0.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

makegis-0.2.0-py3-none-any.whl (28.5 kB view details)

Uploaded Python 3

File details

Details for the file makegis-0.2.0.tar.gz.

File metadata

  • Download URL: makegis-0.2.0.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for makegis-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c5477e43700caee94c3022ae86c0161ff017eb9b8babe61b7980af035a67ab80
MD5 157eddfacec020abfa12bdadfd9d8525
BLAKE2b-256 62ab4206d13c35e853b4f766cca8fa730df3f478afc6069f498df550b0fc7969

See more details on using hashes here.

File details

Details for the file makegis-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: makegis-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 28.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for makegis-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f7d6236dde083236622a5bdcebc4896147c56461fffb765a5247399ac84ffff7
MD5 7b9203609ab641f201d671b88a8b7064
BLAKE2b-256 aab6342d891e1cea7d4e2ed72c6dc7055f2ba8c448537ab0600f58974611486c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page