Skip to main content

DataYoga for Python

Project description

DataYoga Core

Introduction

datayoga-core is the transformation engine used in DataYoga, a framework for building and generating data pipelines.

Installation

pip install datayoga-core

Quick Start

This demonstrates how to transform data using a DataYoga job.

Create a Job

Use this example.yaml:

- steps:
    - uses: add_field
      with:
        fields:
          - field: full_name
            language: jmespath
            expression: concat([fname, ' ' , lname])
          - field: country
            language: sql
            expression: country_code || ' - ' || UPPER(country_name)
    - uses: rename_field
      with:
        fields:
          - from_field: fname
            to_field: first_name
          - from_field: lname
            to_field: last_name
    - uses: remove_field
      with:
        fields:
          - field: credit_card
          - field: country_name
          - field: country_code
    - uses: map
      with:
        expression:
          {
            first_name: first_name,
            last_name: last_name,
            greeting: "'Hello ' || CASE WHEN gender = 'F' THEN 'Ms.' WHEN gender = 'M' THEN 'Mr.' ELSE 'N/A' END || ' ' || full_name",
            country: country,
            full_name: full_name
          }
        language: sql

Transform Data Using datayoga-core

Use this code snippet to transform a data record using the job defined above:

import datayoga_core as dy
from datayoga_core.job import Job
from datayoga_core.utils import read_yaml

job_settings = read_yaml("example.yaml")
job = dy.compile(job_settings)

assert job.transform({"fname": "jane", "lname": "smith", "country_code": 1, "country_name": "usa", "credit_card": "1234-5678-0000-9999", "gender": "F"}) == {"first_name": "jane", "last_name": "smith", "country": "1 - USA", "full_name": "jane smith", "greeting": "Hello Ms. jane smith"}

As can be seen, the record has been transformed based on the job:

  • fname field renamed to first_name.
  • lname field renamed to last_name.
  • country field added based on an SQL expression.
  • full_name field added based on a JMESPath expression.
  • greeting field added based on an SQL expression.

Examples

  • Add a new field country out of an SQL expression that concatenates country_code and country_name fields after upper case the later:

    uses: add_field
    with:
      field: country
      language: sql
      expression: country_code || ' - ' || UPPER(country_name)
    
  • Rename fname field to first_name and lname field to last_name:

    uses: rename_field
    with:
      fields:
        - from_field: fname
          to_field: first_name
        - from_field: lname
          to_field: last_name
    
  • Remove credit_card field:

    uses: remove_field
    with:
      field: credit_card
    

For a full list of supported block types see reference.

Expression Language

DataYoga supports both SQL and JMESPath expressions. JMESPath are especially useful to handle nested JSON data, while SQL is more suited to flat row-like structures.

Notes

  • Dot notation in expression represents nesting fields in the object, for example name.first_name refers to { "name": { "first_name": "John" } }.
  • In order to refer to a field that contains a dot in its name, escape it, for example name\.first_name refers to { "name.first_name": "John" }.

JMESPath Custom Functions

DataYoga adds the following custom functions to the standard JMESPath library:

Function Description Example Comments
capitalize Capitalizes all the words in the string Input: {"name": "john doe"}
Expression: capitalize(name)
Output: John Doe
concat Concatenates an array of variables or literals Input: {"fname": "john", "lname": "doe"}
Expression: concat([fname, ' ' ,lname])
Output: john doe
This is equivalent to the more verbose built-in expression: ' '.join([fname,lname])
hash Calculates a hash using the hash_name hash function and returns its hexadecimal representation Input: {"some_str": "some_value"}
Expression: hash(some_str, `sha1`)
Output: 8c818171573b03feeae08b0b4ffeb6999e3afc05
Supported algorithms: sha1 (default), sha256, md5, sha384, sha3_384, blake2b, sha512, sha3_224, sha224, sha3_256, sha3_512, blake2s
in Checks if an element matches any value in a list of values Input: {"el": "b"}
Expression: in(el, ["a", "b", "c"])
Output: True
left Returns a specified number of characters from the start of a given text string Input: {"greeting": "hello world!"}
Expression: left(greeting, 5)
Output: hello
lower Converts all uppercase characters in a string into lowercase characters Input: {"fname": "John"}
Expression: lower(fname)
Output: john
mid Returns a specified number of characters from the middle of a given text string Input: {"greeting": "hello world!"}
Expression: mid(greeting, 4, 3)
Output: o w
regex_replace Replaces a string that matches a regular expression Input: {"text": "Banana Bannnana"}
Expression: regex_replace(text, 'Ban\w+', 'Apple Apple')
Output: Apple Apple
replace Replaces all the occurrences of a substring with a new one Input: {"sentence": "one four three four!"}
Expression: replace(sentence, 'four', 'two')
Output: one two three two!
right Returns a specified number of characters from the end of a given text string Input: {"greeting": "hello world!"}
Expression: right(greeting, 6)
Output: world!
split Splits a string into a list of strings after breaking the given string by the specified delimiter (comma by default) Input: {"departments": "finance,hr,r&d"}
Expression: split(departments)
Output: ['finance', 'hr', 'r&d']
Default delimiter is comma - a different delimiter can be passed to the function as the second argument, for example: split(departments, ';')
time_delta_days Returns the number of days between a given dt and now (positive) or the number of days that have passed from now (negative) Input: {"dt": '2021-10-06T18:56:16.701670+00:00'}
Expression: time_delta_days(dt)
Output: 365
If dt is a string, ISO datetime (2011-11-04T00:05:23+04:00, for example) is assumed. If dt is a number, Unix timestamp (1320365123, for example) is assumed.
time_delta_seconds Returns the number of seconds between a given dt and now (positive) or the number of seconds that have passed from now (negative) Input: {"dt": '2021-10-06T18:56:16.701670+00:00'}
Expression: time_delta_days(dt)
Output: 31557600
If dt is a string, ISO datetime (2011-11-04T00:05:23+04:00, for example) is assumed. If dt is a number, Unix timestamp (1320365123, for example) is assumed.
upper Converts all lowercase characters in a string into uppercase characters Input: {"fname": "john"}
Expression: upper(fname)
Output: JOHN
uuid Generates a random UUID4 and returns it as a string in standard format Input: None
Expression: uuid()
Output: 3264b35c-ff5d-44a8-8bc7-9be409dac2b7

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datayoga_core-1.28.0.tar.gz (29.4 kB view details)

Uploaded Source

Built Distribution

datayoga_core-1.28.0-py3-none-any.whl (44.9 kB view details)

Uploaded Python 3

File details

Details for the file datayoga_core-1.28.0.tar.gz.

File metadata

  • Download URL: datayoga_core-1.28.0.tar.gz
  • Upload date:
  • Size: 29.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for datayoga_core-1.28.0.tar.gz
Algorithm Hash digest
SHA256 9996de27ca36fbfc0157245018f58c3012aeda5c6fee99a4ff58e2b1f0a42306
MD5 1f67313421ee1554847507077b245500
BLAKE2b-256 0429c9c731d6328ac2ae8d6f9e6be81809c0e7a8593195e13cb606e77178d214

See more details on using hashes here.

File details

Details for the file datayoga_core-1.28.0-py3-none-any.whl.

File metadata

File hashes

Hashes for datayoga_core-1.28.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be1876d568c8b1dde25583b447fa091861aa7022e48edd547add4277c3cb46e9
MD5 52f9e8328ff19eb8f7cc315040ec1f76
BLAKE2b-256 c2a4a2c9d4be265346c5c5e0768f090e09dcff0112f3080857defd7de534a9aa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page