Advanced python 'what changed and what do we need to do' tracking

These details have not been verified by PyPI

Project links

homepage

Project description

pypipegraph2

<title>docs: available</title> docs docs available available [docs](https://tyberiusprime.github.io/pypipegraph2/)

Fine-grained tracking of what goes into generated artifacts, and when it's necessary to recalculate them.

Also, trivial parallelization.

Description

There's a bunch of 'pipeline' packages out there.

For scientific workflow management.

Often with a lot of 'magic'.

SnakeMake for example is popular.

Pypipegraph2 is a bit different.

You build a directed, acyclic graph (DAG) of jobs that you want done.

A job is, for example,

a python function and the name of the file(s) it generates.
a python function
a name and a set of parameters
the name of an input file

Pypipegraph2 will hash all the things. Your input files, your (intermediary) output files. Your parameters. Even your python functions (source and bytecode).

When any of those change, or the number of inputs to a job changes, we recalculate it. Only if it's output changed, we recalculate the immediate downstreams. Do they return their old output (because your change was incosequential)? Then we do not recalculate their downstreams.

That's the big thing with respect to pypipegraph 1. When the input changed back then, everything downstream was recalculated.

You can use this in notebooks. You can use it in scripts. You can use it in complicated scientific pipelines. It scales easily to a few 100.000 jobs.

Here is the simplest example:

import pypipegraph2 as ppg2

ppg2.new()

def do_it(output_path):
	output_path.write_text("Hello world")
	
job = ppg2.FileGeneratingJob('hello.txt', do_it)
ppg2.run()

And you'll get two jobs: A FileGeneratingJob and a FunctionInvariant to go with it.

If you change do_it and rerun your script, the output will change.

def do_it(output_path):
	output_path.write_text("Hello world, how are you today")

If you delete the output file, if you introduce a dependency, say by job.depends_on(ppg.FunctionInvariant(my_other_function)), if you remove such a dependency, 'hello.txt' will be rebuild.

For interactive work, you can redefine jobs and rerun the same graph again.

If jobs fail, those downstream of them / dependent on them will not be evalutated. But everything outside of that part of the DAG will be.

Jobs will run in parallel, using both multi-threading (for jobs modifying the currently running program) and multi-process (for FileGenerating jobs).

Jobs like AttributeLoading, and TempFileGenerating have cleanups that run when their immediate downstreams have been processed. The also only run when they're required by a downstream, or when thier inputs have changed.

FunctionInvariants are smart. They compare bytecode if you're using the same python version, and fall back to source code if you have changed it.

Jobs available

FileGeneratingJob(path, function) - generate one file
MultiFileGeneratingJob(paths | dict of paths, function) - generate multiple files
TempFileGeneratingJob(path, function) - generate one file, and remove it asap
MultiTempFileGeneratingJob(paths | dict of paths, function) - generate multiple files, and remove them asap
FunctionInvariant(name, function) - track the source/bytecode of a function
FileInvariant(path) - track an input file
ParameterInvariant(name, (something, {'other': True})) - track parameters
DataLoading(name, func) - run this function in the current process
AttributeLoading(name, object, attribute_name, func) - store the result of func on object
CachedDataLoading(path, calc_func, load_func) - run func, cache the output of calc_func in a file, load_func(output) if required
AttributeLoading(path, object, attribute_name, func) - store the result of func into a file, and load when necessary
JobGeneratingjob(name, func) - generate more jobs (after the upstreams have run!)
PlotJob(output_path, calc_func, plot_func) - generate some data, store it in a cache file, dump it a spreadsheet, generate a plot from the data, store it in output_path)

Rust engine

Starting with version 3.0.0, the actual engine is written in Rust.

This is a complete rewrite of the inner workings. There were a small number of situations left where a graph would not evaluate, mostly involving failing jobs, and the python solution was very hard to follow - thanks to the 'run-on-demand' nature of temporary jobs.

The new rust engine is based on the insight that while externally, we have a lot of job classes, for the evaluation only three kinds of jobs exists: Always, Output and Ephemeral.

This allowed us a much more complete testing regime, the engine was tested to evaluate with all possible graphs (minus isomorphic equivalents) up to 7 nodes, and all possible graphs with 1..n failures up to 6 nodes (and some quarter of the possible 7 node graphs). This has increased my confidence into this implementation finally being correct.

The drawback of course is that you need to install a binary wheel, or build with maturin. The nix-flake has a dev enviroment with everything setup.

Note

Differences to pypipegraph

- graphs can now be run multiple times
- calling a job will run the graph cut down to this job and it's prerequisites.
  Some jobs - like PlotJobs will return something useful from the call.

- FileGeneratingJob-callbacks now must take the target filename as first parameter
  MultiFileGeneratingJob receive either their mapping, or their list of output files.
  This is being checked at definitaion time.
- MultiFileGeneratingJob may receive a dict of 'pretty name' -> filename'. 
  Then you can depend_on(mfg['pretty name A']) to only invalidate when 'filenameA's content changes!
- PlotJob now returns a tuple: (PlotJob, Optional(CalcJob), Optional(TableJob).
  This removes all the unintuitive ugliness of 'which job will depends_on add the
  dependency to'.
- PlotJob skip_table / skip_caching are now create_table and cache_calc (defaulting to True)
- CachedAttributeLoadingJob now returns a tuple (AttributeLoadingJob, FileGeneratingJob)
- .ignore_code_changes has been replaced by constructor argument depend_on_function (inverted!)
- ppg.run_pipegraph() / ppg.new_pipegraph() is now just ppg.run()/ppg.ew()
- ppg.RunTimeError is now ppg.RunFailed

- Removed the following due to 'no usage':
	- class MemMappedDataLoadingJob
	- class DependencyInjectionJob 
	- class TempFilePlusGeneratingJob 
	- class NotebookJob
	- class CombinedPlotJob
	- class FinalJob (was only used in the Bil)
	- PlotJob.add_fiddle
	- class JobList (depends_on handles all use cases without this special class / and it was unused)

	
- a failed jobs exceptions are no longer available as job.exception,
  they can now be found in ppg.global_pipegraph.last_run[job_id].error
  (last_run is also the result of ppg.run() if you set do_raise = False)

- CycleError is now NotADag
- ParameterInvariant no longer take an 'accept_as_unchanged' function. I suppose it would be trivial to implement using compare_hashes, but I couldn't find any current usage.
- job + job (which returned a JobList) is no longer supported. Job.depends_on can be called with any number of jobs (this was already true in ppg1, but the + syntax was still around)
- job.is_in_dependency_chain has been removed
- the 'graph dumping' functionality has been removed for now
- passing the wrong type of argument (such as a non callable to FunctionInvariant) raises TypeError instead of ValueError
- FileGeneratingJob by default reject empty outputs (this can be changed with empty_ok=True) as a parameter. The default is inversed for MultiFileGeneratingJob.
- In ppg1 if a file existed, a (new) FileGeneratingJob covering it was not run. PPG2 runs the FileGeneratingJob in order to record the right hash.
- A failing job does no longer remove it's output. We know to rerun it because we didn't record it's new input hashes. This also means the rename_broken has been removed
- TempFileGeneratingJob.do_cleanup_if_was_never_run is no more - I don't think it was ever used outside of testing
- Defining multiple jobs creating the same output raises JobOutputConflict (more specific than ValueError)
- The execution of 'useless' leaf jobs now usually happens at least once, due to them being invalidated by their FunctionInvariant
- JobDiedException is now called just 'JobDied'
- The various 'FileTimeInvariant/FileChecksumInvariant/RobustFileChecksumInvariant' forwarders have been removed. Use FileInvariant.
- MultiFileInvariant is gone. Adding/removing FileInvariants now triggers by itself, no need to stuff multiple into a MultiFileInvariant
- Pruning + running will no longer set ._pruned=pruning_job_id on downstream jobs, but .pruned_reason=pruning_job_id. Otherwise you could not unprune() and run again.
- ppg.util.global_pipegraph is now ppg.global_pipegraph
- Redefining a job in an incompatible way now raises JobRedefinitionError (instead of JobContractError)
- Calling the same PlotJob once with cache_calc/create_table = True and once with False no longer triggers an exception, even in strict (RunMode.CONSOLE) mode. The jobs do stick around though.
- interactive console mode
	- restart/reboot is now 'again/stop_and_again' to make it clearer
	- better progression, nice output 
	- some barely used commands were removed for now (runtimes (see log/runtimes.tsv), next, stay, errors, spy, spy-flame

- new job kind: SharedMultiFileGeneratingJob
  This job's output folder is keyed by a hash of it's inputs, and can be easily shared between multiple ppgs from multiple places (replaces mbf_externals.PreBuildJob, conceptually)

Project details

These details have not been verified by PyPI

Project links

homepage

Release history Release notifications | RSS feed

This version

3.4.2

Nov 25, 2024

3.4.1

Nov 22, 2024

3.4.0

Nov 21, 2024

3.3.1

Nov 7, 2024

3.3.0

Nov 4, 2024

3.1.4

Sep 16, 2024

3.1.3

Jul 17, 2024

3.1.2

Jul 17, 2024

3.1.1

Jul 16, 2024

3.1.0

Jul 16, 2024

3.0.9

Jul 15, 2024

3.0.7

Apr 30, 2024

3.0.6

Nov 23, 2023

3.0.5

Jul 19, 2023

3.0.3

Jul 13, 2023

3.0.2

Jul 5, 2023

3.0.0

May 17, 2023

2.4.2

Dec 23, 2021

2.4.0

Nov 11, 2021

2.3.1

Oct 10, 2021

2.3.0

Sep 22, 2021

2.2.2

Sep 20, 2021

2.2.1

Sep 20, 2021

2.2

Sep 20, 2021

2.1.2

Sep 20, 2021

2.1.1

Sep 16, 2021

2.1

Sep 16, 2021

2.0

Sep 1, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypipegraph2-3.4.2.tar.gz (2.9 MB view details)

Uploaded Nov 25, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pypipegraph2-3.4.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.2 MB view details)

Uploaded Nov 25, 2024 CPython 3.8+manylinux: glibc 2.17+ x86-64

File details

Details for the file pypipegraph2-3.4.2.tar.gz.

File metadata

Download URL: pypipegraph2-3.4.2.tar.gz
Upload date: Nov 25, 2024
Size: 2.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.7.4

File hashes

Hashes for pypipegraph2-3.4.2.tar.gz
Algorithm	Hash digest
SHA256	`9491da57e49bc0d06dabf512dbd9d0a6e8608967866551e56caf6b3a492203cf`
MD5	`1d678d3e15e99aa65991cb5e97ef237e`
BLAKE2b-256	`9a59f79b9b0d62fb66da780fc07a3e41616719ce1298531e1de4c3937553022e`

See more details on using hashes here.

File details

Details for the file pypipegraph2-3.4.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pypipegraph2-3.4.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Nov 25, 2024
Size: 6.2 MB
Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.7.4

File hashes

Hashes for pypipegraph2-3.4.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`5ba8f39ca8ef26680da07f0e38736550d1c1d9618627a39e14540619ea6fe4cd`
MD5	`94afb76df63c7902611be0572dd57539`
BLAKE2b-256	`a7c8469db1e2511e3573127c1bc61a6891b65705796644809d520ad63a989737`

See more details on using hashes here.

pypipegraph2 3.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pypipegraph2

Description

Jobs available

Rust engine

Note

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes