DataYoga for Python
Project description
DataYoga Core
Introduction
datayoga-core
is the transformation engine used in DataYoga
, a framework for building and generating data pipelines.
Installation
pip install datayoga-core
Quick Start
This demonstrates how to transform data using a DataYoga job.
Create a Job
Use this example.yaml
:
- steps:
- uses: add_field
with:
fields:
- field: full_name
language: jmespath
expression: concat([fname, ' ' , lname])
- field: country
language: sql
expression: country_code || ' - ' || UPPER(country_name)
- uses: rename_field
with:
fields:
- from_field: fname
to_field: first_name
- from_field: lname
to_field: last_name
- uses: remove_field
with:
fields:
- field: credit_card
- field: country_name
- field: country_code
- uses: map
with:
expression:
{
first_name: first_name,
last_name: last_name,
greeting: "'Hello ' || CASE WHEN gender = 'F' THEN 'Ms.' WHEN gender = 'M' THEN 'Mr.' ELSE 'N/A' END || ' ' || full_name",
country: country,
full_name: full_name,
}
language: sql
Transform Data Using datayoga-core
Use this code snippet to transform a data record using the job defined above:
import datayoga_core as dy
from datayoga_core.job import Job
from datayoga_core.utils import read_yaml
job_settings = read_yaml("example.yaml")
job = dy.compile(job_settings)
assert job.transform({"fname": "jane", "lname": "smith", "country_code": 1, "country_name": "usa", "credit_card": "1234-5678-0000-9999", "gender": "F"}) == {"first_name": "jane", "last_name": "smith", "country": "1 - USA", "full_name": "jane smith", "greeting": "Hello Ms. jane smith"}
As can be seen, the record has been transformed based on the job:
fname
field renamed tofirst_name
.lname
field renamed tolast_name
.country
field added based on an SQL expression.full_name
field added based on a JMESPath expression.greeting
field added based on an SQL expression.
Examples
-
Add a new field
country
out of an SQL expression that concatenatescountry_code
andcountry_name
fields after upper case the later:uses: add_field with: field: country language: sql expression: country_code || ' - ' || UPPER(country_name)
-
Rename
fname
field tofirst_name
andlname
field tolast_name
:uses: rename_field with: fields: - from_field: fname to_field: first_name - from_field: lname to_field: last_name
-
Remove
credit_card
field:uses: remove_field with: field: credit_card
For a full list of supported block types see reference.
Expression Language
DataYoga supports both SQL and JMESPath expressions. JMESPath are especially useful to handle nested JSON data, while SQL is more suited to flat row-like structures.
Notes
- Dot notation in expression represents nesting fields in the object, for example
name.first_name
refers to{ "name": { "first_name": "John" } }
. - In order to refer to a field that contains a dot in its name, escape it, for example
name\.first_name
refers to{ "name.first_name": "John" }
.
JMESPath Custom Functions
DataYoga adds the following custom functions to the standard JMESPath library:
Function | Description | Example | Comments |
---|---|---|---|
capitalize |
Capitalizes all the words in the string | Input: {"name": "john doe"} Expression: capitalize(name) Output: John Doe |
|
concat |
Concatenates an array of variables or literals | Input: {"fname": "john", "lname": "doe"} Expression: concat([fname, ' ' ,lname]) Output: john doe |
This is equivalent to the more verbose built-in expression: ' '.join([fname,lname]) |
hash |
Calculates a hash using the hash_name hash function and returns its hexadecimal representation |
Input: {"some_str": "some_value"} Expression: hash(some_str, `sha1`) Output: 8c818171573b03feeae08b0b4ffeb6999e3afc05 |
Supported algorithms: sha1 (default), sha256, md5, sha384, sha3_384, blake2b, sha512, sha3_224, sha224, sha3_256, sha3_512, blake2s |
left |
Returns a specified number of characters from the start of a given text string | Input: {"greeting": "hello world!"} Expression: left(greeting, `5`) Output: hello |
|
lower |
Converts all uppercase characters in a string into lowercase characters | Input: {"fname": "John"} Expression: lower(fname) Output: john |
|
mid |
Returns a specified number of characters from the middle of a given text string | Input: {"greeting": "hello world!"} Expression: mid(greeting, `4`, `3`) Output: o w |
|
replace |
Replaces all the occurrences of a substring with a new one | Input: {"sentence": "one four three four!"} Expression: replace(sentence, 'four', 'two') Output: one two three two! |
|
right |
Returns a specified number of characters from the end of a given text string | Input: {"greeting": "hello world!"} Expression: right(greeting, `6`) Output: world! |
|
split |
Splits a string into a list of strings after breaking the given string by the specified delimiter (comma by default) | Input: {"departments": "finance,hr,r&d"} Expression: split(departments) Output: ['finance', 'hr', 'r&d'] |
Default delimiter is comma - a different delimiter can be passed to the function as the second argument, for example: split(departments, ';') |
time_delta_days |
Returns the number of days between a given dt and now (positive) or the number of days that have passed from now (negative) |
Input: {"dt": '2021-10-06T18:56:16.701670+00:00'} Expression: time_delta_days(dt) Output: 365 |
If dt is a string, ISO datetime (2011-11-04T00:05:23+04:00, for example) is assumed. If dt is a number, Unix timestamp (1320365123, for example) is assumed. |
time_delta_seconds |
Returns the number of seconds between a given dt and now (positive) or the number of seconds that have passed from now (negative) |
Input: {"dt": '2021-10-06T18:56:16.701670+00:00'} Expression: time_delta_days(dt) Output: 31557600 |
If dt is a string, ISO datetime (2011-11-04T00:05:23+04:00, for example) is assumed. If dt is a number, Unix timestamp (1320365123, for example) is assumed. |
upper |
Converts all lowercase characters in a string into uppercase characters | Input: {"fname": "john"} Expression: upper(fname) Output: JOHN |
|
uuid |
Generates a random UUID4 and returns it as a string in standard format | Input: None Expression: uuid() Output: 3264b35c-ff5d-44a8-8bc7-9be409dac2b7 |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file datayoga_core-1.15.0.tar.gz
.
File metadata
- Download URL: datayoga_core-1.15.0.tar.gz
- Upload date:
- Size: 24.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8cb0914b80fde9e3b0660e54180f0cff3bbad8533c3f9798d4d30ea2c4fcb22 |
|
MD5 | 3a7293581c133b81efeba124549db94b |
|
BLAKE2b-256 | 8c381d466492102e6e6046db84b772d10d461c04b5d931ef6f577029048bba1b |
File details
Details for the file datayoga_core-1.15.0-py3-none-any.whl
.
File metadata
- Download URL: datayoga_core-1.15.0-py3-none-any.whl
- Upload date:
- Size: 38.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 194ffa6d00d3c87892101bcd8363b768940680a0e0fc891ebb57a04031b3a7b6 |
|
MD5 | 73dfa1bea5fb1e6299bc5bd40cc3046d |
|
BLAKE2b-256 | d92a967e7fda9978a651b4057b63b5364f003b53b6bbfe4d0eb06b4c2ba06c32 |