Skip to main content

Datapunt generic ETL command line scripts and functions for shell scripting in Docker.

Project description

Data-processing

https://img.shields.io/badge/python-3.6-blue.svg https://img.shields.io/badge/license-MPLv2.0-blue.svg

At the City of Amsterdam we deal with many different types of structured and unstructered data. Much of the data is not of high quality and are missing proper semantics to do proper analytics with.

This repository combines generic command line functions to create extract, transform and load steps we can then use for creating a reproducable data for analytics and usage in dashboards and maps.

For more information about the how we use these functions in our workflow, read the data-pipeline guide.

How to use

amsterdam.github.io/data-processing

To use a function in python you can use:

from datapunt_processing.extract import download_from_catalog

or

from datapunt_processing.helpers.connections import objectstore_connection

To use the functions directly from the command line in your virtual environment or docker shell you can use it like this:

download_from_data_amsterdam -h

To see the list of command line functions see the modules below or directly in setup.py

Getting Started

To get the functions up and running:

pip install datapunt-processing

To develop the functions locally use these steps:

  1. Clone the repository:

git clone https://github.com/Amsterdam/data-processing.git
cd data-processing
  1. Create Virtual environment in Windows

# Create and activate a virtual environment, for example with:
python -m venv --copies --prompt data-processing .venv
.venv\Scripts\activate
  1. Create Virtual environment in OSX

virtualenv --python=$(which python3) venv
source venv/bin/activate
  1. Install the data-processing modules in editable mode

pip install -e .

4. A database is required for the transform and load functions. You can setup your postgres database credentials in the config.ini file to apply to the functions.

If want to use Docker, you can start a database server for your project in a new terminal. The name, port and login of the database can be changed in the docker-compose.yml. Also change them in the config.ini file which will be used by the functions to connect to that database.

docker-compose up -d database

Notebooks

Some of the examples are in the form of runnable Jupyter notebooks. Copies of these with all the images and output included are hosted at Anaconda Cloud. To run these notebooks on your own system, start up a Jupyter notebook server:

To install jupyter:

pip install -e .\[dev\]

jupyter notebook --NotebookApp.iopub_data_rate_limit=100000000

How to Contribute

If you want to contribute please follow the contribute guidelines

Prequisites

Fork this repository to your local github account.

To add new documentation and test new functions, install the docs,test,dev packages using this command:

pip install -e .[docs,test,dev]
or when using zsh
pip install -e .\[docs,test,dev\]

Steps to add code

This package is build by using setuptools to be able to deploy this later on PyPi with version control. It follows some of these guidelines of setting up a python package.

  1. Convert your function into a python-package command line script using the boilerplate_function.py

side note: not all functions are suitable for CL. Machine learning preprocessing steps or general API calls for instance, (that often require parameters in the form of dicts or lists) as input are not suitable and can be used as stand-alone scripts.

2. Add test to the test folder and run .. code-block:: bash

python setup.py test

to test if no other functions are breaking. Correct those issues if needed.

  1. Add your commandline name and end point location to the console_scripts in setup.py.

  2. Add a awesome_module.rst file with Sphinx Argparse extension fields to generate the description and argument fields by reusing an existing rst file. Helpers will generate automatically, so you can skip this step if it is only a helper function.

  3. add the rst file to the modules.rst to be found on the main page.

  4. Regenerate the documentation to test the docs output using:

sphinx/make docs
  1. Make a PR to add the add your awesome function to our processing code to be reused by many other developpers and data analists.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datapunt_processing-0.0.1a5.tar.gz (1.7 MB view details)

Uploaded Source

File details

Details for the file datapunt_processing-0.0.1a5.tar.gz.

File metadata

File hashes

Hashes for datapunt_processing-0.0.1a5.tar.gz
Algorithm Hash digest
SHA256 029550d5a9590588bb36b874aa6b7f9cdc9642cf46bb2579bc383251c8d84f0b
MD5 058a518c556d13f6cb4f0635a83e331e
BLAKE2b-256 d3318fd89e3198c673b49e47e5e282183fb288a21104deec94994089b8976382

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page