Skip to main content

A Data Dependency Graph Framework and Executor

Project description

DataJet

A Data Dependency Graph Framework and Executor

Key Features

  • Lazy: Only Evaluate and return the data you need
  • Declarative: Declare Data Dependencies and transformations explicitly, using plain python
  • Dependency-Free: Just Python.

Installation

Requirements:

  • Python >=3.8

To Get Started, Install DataJet From pypi:

pip install datajet

Usage

DataJet dependencies are expressed as a dict. Each key-value pair in the dict corresponds to a piece of data and specifies how to calculate it:

from datajet import execute
datajet_map = {
    "dollars": [3.99, 10.47, 18.95, 15.16,],
    "units": [1, 3, 5, 4,],
    "prices": lambda dollars, units: [d/u for d, u in zip(dollars, units)],
    "average_price": lambda prices: sum(prices) / len(prices) * 1000 // 10 / 100
}
execute(datajet_map, fields=['prices', 'average_price'])
{'prices': [3.99, 3.49, 3.79, 3.79], 'average_price': 3.76}

You can also override any data declaration at "execute time":

execute(
        datajet_map, 
        context={"dollars": map(lambda x: x*2, [3.99, 10.47, 18.95, 15.16,])}, 
        fields=['average_price']
)
{'average_price': 7.52}

Keys can be any hashable. The value corresponding to each key can be a function or an object. The functions can have 0 or more parameters. The parameter names must correspond to other keys in the dict if no explicitly defined inputs to the callable are declared in the map. See data maps for how to explicitly define inputs.

You can also define multiple ways of calculating a piece of data via defining a list of functions as the value to the key. Again, each function's parameters must correspond to other keys in the dict, or else you can define which other keys should be inputs to the function via explicitly defining inputs.

The benefits of specifying your data like this may not be immediately apparent, but this gets powerful when you create many possible functions:

from datajet import execute 
departments =[
    {"tag": "GR", "value": "GROCERY"}, {"tag": "FZ", "value": "FROZEN"}
]

categories = [
    {"tag": "PB", "value": "Peanut Butter", "department_tag": "GR"},
    {"tag": "J", "value": "Jelly", "department_tag": "GR",},
    {"tag": "FZE", "value":"FROZEN ENTREES", "department_tag": "FZ"}
]

subcategories = [
    {"tag": "NPB", "value": "PEANUT BUTTER - NATURAL", "category_tag": "PB"},
    {"tag": "CPB", "value": "PEANUT BUTTER - CONVENTIONAL", "category_tag": "PB"}
]

datajet_map = {
    "category": lambda category_tag: dict((d['tag'], d['value']) for d in categories).get(category_tag),
    "category_tag": [
        lambda category: dict((d['value'], d['tag']) for d in categories).get(category),
        lambda subcategory: dict((d['value'], d['category_tag']) for d in subcategories).get(subcategory),
    ],
    "subcategory": lambda subcategory_tag: dict((d['tag'], d['value']) for d in subcategories).get(subcategory_tag),
    "department_tag": lambda category: dict((d['value'], d['department_tag']) for d in categories).get(category),
    "department": lambda department_tag: dict((d['tag'], d['value']) for d in departments).get(department_tag),
}
execute(datajet_map, context={"subcategory_tag": "NPB"}, fields={"department"})
{'department': 'GROCERY'}

In the above example, datajet.execute knows to traverse the path from subcategory_tag -> subcategory -> category -> department_tag -> department to find the corresponding department to the subcategory_tag passed into the context parameter. Datajet will find connections from "the data you have" to "the data you want" in the data dependency tree, execute the given functions, and return the result.

This framework frees you (the coder) from the need for more global knowledge about how pieces of data are connected when you request data. To define a datapoint you only need local knowledge of it's immediate inputs, and datajet finds the fastest path from the data you input to what you need.

Development

git clone
make install

This will start a poetry shell that has all the dev dependencies installed.

Development troubleshooting

If you see:

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>

Go to /Applications/Python3.x and run 'Install Certificates.command'

Built on ideas inspired by

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datajet-0.1.0a2.tar.gz (7.6 kB view hashes)

Uploaded Source

Built Distribution

datajet-0.1.0a2-py3-none-any.whl (7.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page