Skip to main content

Small dependency resolution library for datasets

Project description

# Dependency resolution for datasets

![Travis status](https://travis-ci.org/njwilson23/depgraph.svg?branch=master)

`depgraph` is a tiny Python library for expressing networks of dependencies
required to construct datasets. Networks are declared in terms of the
relationships between source and target datasets (network graph edges).
`depgraph` can then report descendants and parents for any particular node and
instruct builds in a manner similar to `make`. When a `DependencyGraph` object
returns a dataset that must be built, it provides a reason, such as:

- the dataset is missing
- the dataset is out of date and required by another dataset
- the dataset is a target dataset

`depgraph` is intended to be a component for assembling dataset build tools.
Important considerations for such a build tool are that it must:

- permit reproducible analysis
- be documenting
- perform fast rebuilds to enable experimentation

`depgraph` contains the following classes:

- `depgraph.DependencyGraph`
- `depgraph.Dataset`
- `depgraph.DatasetGroup`
- `depgraph.Reason`

## Example

Declare a set of dependencies resembling the graph below:

R0 R1 R2 R3 [raw data]
\ / | |
DA0 DA1 /
\ / \ /
DB0 DB1
\ / | \
\ / | \
DC0 DC1 DC2 [products]


```python
from depgraph import Dataset, DependencyGraph

# Define Datasets
# use an optional keyword `tool` to provide a key instructing our build tool
# how to assemble this product
R0 = Dataset("data/raw0", tool="read_csv")
R1 = Dataset("data/raw1", tool="read_csv")
R2 = Dataset("data/raw2", tool="database_query")
R3 = Dataset("data/raw3", tool="read_hdf")

DA0 = Dataset("step1/da0", tool="merge_fish_counts")
DA1 = Dataset("step1/da1", tool="process_filter")

DB0 = Dataset("step2/db0", tool="join_counts")
DB1 = Dataset("step2/db1", tool="join_by_date")

DC0 = Dataset("results/dc0", tool="merge_model_obs")
DC1 = Dataset("results/dc1", tool="compute_uncertainty")
DC2 = Dataset("results/dc2", tool="make_plots")

graph = DependencyGraph()

# Declare relationships
graph.add_dataset(da0, (raw0, raw1))
graph.add_dataset(da1, (raw2,))
graph.add_dataset(db0, (da0, da1))
graph.add_dataset(db1, (da1, raw3))
graph.add_dataset(dc0, (db0, db1))
graph.add_dataset(dc1, (db1,))
graph.add_dataset(dc2, (db1,))

# Query buildsteps to build a product
while True:
targets = graph.buildable(DC1)

if len(targets) == 0:
break

for target, reason in targets:
# Each target is a dataset with a 'name' attribute and whatever
# additional keyword arguments where defined with it.
# The 'reason' is a depgraph.Reason object that codifies why a
# particular target is necessary (e.g. it's out of date, it's missing,
# and required by a subsequent target, etc.)
print("Building {0} with {1} because {2}".format(target.name,
target.tool,
reason))
# Call a function or start a subprocess that will result in the
# target being built and saved to a file
my_build_func(target.tool, target.name)
# [...]
```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data-depgraph-0.1.dev0.tar.gz (6.4 kB view details)

Uploaded Source

File details

Details for the file data-depgraph-0.1.dev0.tar.gz.

File metadata

File hashes

Hashes for data-depgraph-0.1.dev0.tar.gz
Algorithm Hash digest
SHA256 37422d077869e5d111ae5708d62f7e78e24cbfa8c6e9a27f5b8296306f900a30
MD5 15fdddf73c21ac63e5f9f8e1f0ffe809
BLAKE2b-256 8f80a3e17af58dea3c583aadd0439db2751b48b8c41b93a23805740cecca5cde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page