Skip to main content

Facilitate data engineering on the Ingenii Data Platform

Project description

Ingenii Data Engineering Package

Maintainer License Contributing

Details

  • Current Version: 0.3.3

Overview

This package provides utilities for data engineering on Ingenii's Azure Data Platform. This can be both used for local development, and is used in the Ingenii Databricks Runtime.

Usage

Import the package to use the functions within.

import ingenii_data_engineering

dbt

Part of this package validates dbt schemas to ensure they are compatible with Databricks and the larger Ingenii Data Platform. This happens when a data pipeline to ingest a file is run, to make sure a file is ingested correctly. Full details of how to set up your dbt schema files in your Data Engineering repository can be found in the Ingenii Data Engineering Example repository.

Pre-processing

This package contains code to facilitate the pre-processing of files before they are ingested by the data platform. This allows users to transform any data into a form that is compatible. See details of working with pre-processing functions in the Ingenii Data Engineering Example repository.

This package also contains the code to turn the pre-processing scripts into a package, ready to be uploaded and used by the Data Platform. Once this package is installed, the command

python -m <package name> <command> <folder with pre-processing code>
python -m ingenii_data_engineering pre_processing_package pre_process

will generate a .whl file in a folder called dist/. For more details, see the Ingenii Data Engineering Example repository.

Development

Prerequisites

  1. A working knowledge of git SCM
  2. Installation of Python 3.7.3

Set up

  1. Complete the 'Getting Started > Prerequisites' section
  2. For Windows only:
  3. Run make setup: to copy the .env into place (.env-dist > .env)

Getting started

  1. Complete the 'Getting Started > Set up' section

  2. From the root of the repository, in a terminal (preferably in your IDE) run the following commands to set up a virtual environment:

    python -m venv venv
    . venv/bin/activate
    pip install -r requirements-dev.txt
    pre-commit install
    

    or for Windows:

    python -m venv venv
    . venv/Scripts/activate
    pip install -r requirements-dev.txt
    pre-commit install
    
  3. Note: if you get a permission denied error when executing the pre-commit install command you'll need to run chmod -R 775 venv/bin/ to recursively update permissions in the venv/bin/ dir

  4. The following checks are run as part of pre-commit hooks: flake8(note unit tests are not run as a hook)

Building

  1. Complete the 'Getting Started > Set up' section
  2. Run make build to create the package in ./dist
  3. Run make clean to remove dist files

Testing

  1. Complete the 'Getting Started > Set up' and 'Development' sections
  2. Run make test to run the unit tests using pytest
  3. Run flake8 to run lint checks using flake8
  4. Run make qa to run the unit tests and linting in a single command
  5. Run make qa to remove pytest files

Version History

  • 0.3.3: Deprecated path for dbt
  • 0.3.2: Further bugfix for JSON UTF-8 BOM
  • 0.3.1: Remove unnecessary functions specific to Databricks
  • 0.3.0: Create pre-processing package using the module
  • 0.2.1: Handle JSON read UTF-8 BOM
  • 0.2.0: Pre-processing happens all in the 'archive' container
  • 0.1.5: Better functionality for column names in .csv files
  • 0.1.4: Handle JSON files
  • 0.1.3: Adding pre-processing utilities
  • 0.1.2: Rearrangement and better split of work with the Databricks Runtime. Better validation
  • 0.1.1: Minor bug fixes
  • 0.1.0: dbt schema validation, pre-processing class

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ingenii_data_engineering-0.3.3.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ingenii_data_engineering-0.3.3-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file ingenii_data_engineering-0.3.3.tar.gz.

File metadata

  • Download URL: ingenii_data_engineering-0.3.3.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/8.6.1 pkginfo/1.12.1.2 requests/2.32.3 requests-toolbelt/1.0.0 tqdm/4.67.1 CPython/3.9.16

File hashes

Hashes for ingenii_data_engineering-0.3.3.tar.gz
Algorithm Hash digest
SHA256 81e61a31318289e520e49628d65a6432de25ca6b0b821d29bc520bca526a92ea
MD5 940965ff91ec24002e01d3db52dd98d2
BLAKE2b-256 4feebac7189e28d8754df2ee9febcad5a09b390ab87f8fe0b747206263fee425

See more details on using hashes here.

File details

Details for the file ingenii_data_engineering-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: ingenii_data_engineering-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/8.6.1 pkginfo/1.12.1.2 requests/2.32.3 requests-toolbelt/1.0.0 tqdm/4.67.1 CPython/3.9.16

File hashes

Hashes for ingenii_data_engineering-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 967b42ad6e94b8eafa7df56dbf53c2cb8010afacc91a212d6ee97ac1cb02d700
MD5 afeee8f3e855ef7dde2bd830287ffa90
BLAKE2b-256 24e2e99fce3459f5752d57d77a1c6b7942d45e60fb7bd75d57d261422d785e68

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page