Distribution of tasks.
Project description
This little tool helps in scheduling, tracking and aggregating calculations and their results. It forms the step that brings you from ‘a directory with working code for a job’ to ‘running dozens of jobs and getting results easily’.
pip install fenpei
This is intended to be used to run multiple intensive computations on a (linux) cluster. At present, it assumes a shared file system on the cluster.
It takes a bit of work to integrate with your situation but it is very flexible and should make your life easier after setting it up. Some features:
Jobs are created in Python files, making it short and extremely flexible.
It uses a command line interface (some shell experience required) to easily start, stop or monitor jobs.
Easy to use with existing code and easily reproducible, since it works by creating isolated job directories.
Can replaces scheduling queue functionality and start jobs through ssh, or can work with existing systems (slurm and qsum included, others implementable).
Flexibility for caching, preparation and result extraction.
Uses multi-processing and can easily use caching for greater performance, and symlinks to save space.
Note that:
You will have to write Python code for your specific job, as well as any analysis or visualization for the extracted data.
Except for status monitoring mode, it derives the state on each run, it doesn’t keep a database that can get outdated or corrupted.
One example to run reproducible jobs with Fenpei (there are many ways):
Make a script that runs your code from source to completion for one set of parameters.
Subclass the ShJobSingle job and add all the files that you need in get_nosub_files.
Replace all the parameters in the run script and other config files by {{ some_param_name }}. Add these files to get_sub_files.
Make a Python file (example below) for each analysis you want to run, and fill in all the some_param_name with the appropriate values.
From a shell, use python your_jobfile.py -s to see the status, then use other flags for more functionality (see below).
Implement is_complete and result in your job (and crash_reason if you want -t) (others can be overridden too, if you require special behaviour).
Add analysis code to your job file if you want to visualize the results.
Example file to generate jobs:
def generate_jobs(): for alpha in [0.01, 0.10, 1.00]: for beta in range(0, 41): dict(name='a{0:.2f}_b{1:d}'.format(alpha, beta), subs=dict( alpha=alpha, beta=beta, gamma=5, delta='yes' ), use_symlink=True) def analyze(queue): results = queue.compare_results(('J', 'init_vib', 'init_rot',)) # You now have the results for all jobs, indexed by the above three parameters. # Visualization is up to you, and will be run when the user adds -x if __name__ == '__main__': jobs = create_jobs(JobCls=ShefJob, generator=generate_jobs(), default_batch=splitext(basename(__file__))[0]) queue = SlurmQueue(partition='example', jobs=jobs, summary_func=analyze) queue.run_argv()
This file registers many jobs for combinations of alpha and beta parameters. You can now use the command line:
usage: results.py [-h] [-v] [-f] [-e] [-a] [-d] [-l] [-p] [-c] [-w WEIGHT] [-q LIMIT] [-k] [-r] [-g] [-s] [-m] [-x] [-t] [-j] [--jobs JOBS] [--cmd ACTIONS] distribute jobs over available nodes optional arguments: -h, --help show this help message and exit -v, --verbose more information (can be used multiple times, -vv) -f, --force force certain mistake-sensitive steps instead of failing with a warning -e, --restart with this, start and cleanup ignore complete (/running) jobs -a, --availability list all available nodes and their load (cache reload) -d, --distribute distribute the jobs over available nodes -l, --list show a list of added jobs -p, --prepare prepare all the jobs -c, --calc start calculating one jobs, or see -z/-w/-q -w WEIGHT, --weight WEIGHT -c will start jobs with total WEIGHT running -q LIMIT, --limit LIMIT -c will add jobs until a total LIMIT running -k, --kill terminate the calculation of all the running jobs -r, --remove clean up all the job files -g, --fix fix jobs, check cache etc (e.g. after update) -s, --status show job status -m, --monitor show job status every few seconds -x, --result run analysis code to summarize results -t, --whyfail print a list of failed jobs with the reason why they failed -j, --serial job commands (start, fix, etc) may NOT be run in parallel (parallel is faster but order of jobs and output is inconsistent) --jobs JOBS specify by name the jobs to (re)start, separated by whitespace --cmd ACTIONS run a shell command in the directories of each job that has a dir ($NAME/$BATCH/$STATUS if --s) actions are executed (largely) in the order they are supplied; some actions may call others where necessary
Pull requests, extra documentation and bug reports are welcome! It’s Revised BSD-licensed so you can do many things.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file fenpei-2.7.2.tar.gz
.
File metadata
- Download URL: fenpei-2.7.2.tar.gz
- Upload date:
- Size: 32.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/2.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f51064d6c4aeda0f4ab0d223df0774dfac03850af47ef8843898899c0a394b6 |
|
MD5 | 00602a9cf54fdfeb96c5ca278d54b0de |
|
BLAKE2b-256 | b6c8a0a25678435d83944a9e903dda7941a52c2b52c3c14dd4f9198fd3e739f7 |