Job distribution for everyone
fjd makes it easy to run computational jobs on many CPUs.
There are several powerful tools for dispatching dynamic lists of computational jobs to multiple, possibly distributed CPUs. However, for simple use cases, the effort of installation and setup is often too high.
With fjd, the hurdle to get started is very low. Installation is easy. Pushing jobs into the queue only requires to put an executable script in a directory. Per default, all CPUs on your computer are used. Other computers can be used very easily, as well. Plus, your jobs can be written in any language.
You can call fjd directly:
$ fjd --exe "mktemp XXX.tmp" --repeat 5
This simple example creates five temporary files with random names. Each of the five jobs will be done on a different CPU (if you have that many).
You can also supply a list with parameters:
$ fjd --exe 'touch bla$1.txt' --parameters 1,2,3,4
This will create four files: bla1.txt, bla2.txt, bla3.txt, bla4.txt. Here, fjd will select by itself how many CPUs on your machine it should use. Note that if you use the placeholder $1, use single quotes around the --exe command.
To use several computers, you can configure a number of hosts in your network and how many CPUs should be running on each (see an example of this below).
For illustration, here is the session output from the first example:
$ fjd --exe "mktemp XXX.tmp" --repeat 5 [fjd-recruiter] Hired 5 workers in project "default". [fjd-dispatcher] Started on project "default". [fjd-dispatcher] Job queue is empty and all jobs have finished. [fjd-recruiter] Fired 5 worker(s) in project "default".
fjd can also be controlled in more detail.
Most importantly, you can write an executable bash script for each job and put it in the queue. You have total freedom what each job is doing and when you put it in the queue. In the usual case your queue directory is ~/.fjd/default/jobqueue (where default could be changed to a specific project name). Another option is to describe a job as a configuration file. We’ll have an example below.
Then, you can start one or more fjd-worker threads, like this:
$ fjd-recruiter hire [<number of workers>]
Per default, this starts n-1 worker threads, where n is the number of CPUs on your machine.
Finally, start a dispatcher:
Now the fjd-dispatcher assigns jobs to fjd-worker threads who are currently not busy, until the job queue is empty.
You can cancel the fjd-dispatcher process at any time (i.e. hit CTRL-C) and it will ask you if workers should be fired or not. By the way, if screen sessions are running and you want them to stop, then you can always fire workers by hand:
$ fjd-recruiter fire
or, to fire only the ones running one specific project (more on that option below):
$ fjd-recruiter --project <my-project> fire
If you start a new dispatcher, it will first clean up (“fire”) old screen sessions.
First, you need to have python 2.7, which the default python on almost all systems these days (note: python 3.x support is not there yet, but close; see issue #10 on github). Then:
$ pip install fjd
If you do not have enough privileges (look for something like “Permission denied” in the output), install locally (for your user account only):
$ pip install fjd --user
If you do not have pip installed (I can’t wait for everyone running Python 3.4), I made a small script, which should help to install all needed things. Download it and make it executable:
$ wget https://raw.github.com/nhoening/fjd/master/fjd/scripts/INSTALL $ chmod +x INSTALL
Now you can install system-wide:
or, if you do not have root privileges, you can also install locally:
$ source INSTALL --user
Note - If you installed locally, this should be added to your ~/.bashrc or ~/.profile file:
Note - Installing locally could be the better choice, actually, because it might save you from installing fjd on each machine you want to use. If they all share the home directory, they will all know about fjd once you are logged in.
We can tell fjd about other machines in the network and how many workers we’d like to employ on them. To do that, we place a file called remote.conf in the project’s directory. Here is an example:
[host1] name: localhost workers: 3 [host2] name: hyuga.sen.cwi.nl workers: 5
There is an advanced example in the github repo which you can run and inspect. Use the scripts run-example.sh or run-remote-example.sh to execute.
Note - fjd works under the assumption that all CPUs are in a local network and can access a shared home directory.
Note - If you normally have to type in a password to login to a remote machine via SSH, you’ll have to do this here, as well. You can configure passwordless login by putting a public key in ~/.ssh/authorized_keys. For the shared-home directory setting we use fjd for, this makes a lot of sense, as you stay within your LAN anyway. In general, some SSH configuration can go a long way to ease your life, e.g. by connection sharing through the ControlAuto option. Search the web or ask your local IT guy.
Workers are Unix screen sessions, you can see them by typing:
$ screen -ls
and inspect them if you want. As attaching to screen sessions is cumbersome and fjd can also close them before you have a chance to see what went wrong (this is an option you can set, see next example below), fjd logs screen output to ~/.fjd/<project>/screenlogs (each worker has its own log file).
Here is an example log from a screen session of a worker:
$ fjd-worker --project default [fjd-worker] Started with ID nics-macbook.fritz.box_1382522062.31. [fjd-worker] Worker nics-macbook.fritz.box_1382522062.31: I found a job. [fjd-worker] Worker nics-macbook.fritz.box_1382522062.31: Finished my job. [fjd-worker] Worker nics-macbook.fritz.box_1382522062.31: I found a job. [fjd-worker] Worker nics-macbook.fritz.box_1382522062.31: Finished my job.
Normally, the project directory is ~/.fjd/default. But you can tell fjd to use a different project identifier (this way, you could have several projects running without them getting into each other’s way, i.e. stopping one project wouldn’t stop the workers of the other and you wouldn’t override the first project if you start another).
For instance, you’d recruite workers and dispatch jobs with the --project parameter:
$ fjd-recruiter --project remote-example hire $ fjd-dispatcher --project remote-example
Or, if you call fjd from code:
recruiter = Recruiter(project=project) recruiter.hire() Dispatcher(project=project)
When a job is heavily parameterised, a bash script might not be convenient. I sometimes prefer a configuration file in such cases. In this case, a job file should adhere to the general INI-file standard. You can write in there what you want - fjd only has one requirement. Add a fjd section, and specify which command to execute. Here is an example:
[fjd] executable: python example/ajob.py [params] logfile: logfiles/job0.dat my_param: value0
Your executable script gets this configuration file passed as a command line argument, so this would be called on the shell:
python example/ajob.py <absolute path to the job file>
In addition, you can put other job-specific configuration in there for the executable to see, as I did here in the [params]-section (I repeat: only the [fjd]-section is required by fjd).
Take care to get relative paths correct (or simply make them absolute): If paths are relative, they should be relative to the directory in which you start the fjd-dispatcher.
The advanced example also shows how this works. You can use the script run-config-example.sh to run it.
Small files in your home directory are used to indicate which jobs have to be done (these are created by you) and which workers are available (these are created automatically). Files are also used by fjd to assign workers to jobs.
This simple file-based approach makes fjd very easy to use.
For CPUs from several machines to work on your job queue, we make one necessary assumption: We assume that there is a shared home directory for logged-in users, which all machines can access. This setting is very common now in universities and companies.
A little bit more detail about the fjd internals: The fjd-recruiter creates worker threads on one or more machines (a worker thread is a Unix screen session, which remains even if you log out). The fjd-worker processes announce themselves in the workerqueue directory. The fjd-dispatcher finds your jobs in the jobqueue directory and pairs a job with an available worker. It then removes those entries from the jobqueue and workerqueue directories and creates a new entry in jobpods, where workers will pick up their assignments.
Then, the dispatcher calls your executable script and passes the file that describes the job to it as parameter on the shell. Your script simply has to read the job file and act accordingly.
All of these directories mentioned above exist in ~/.fjd and will of course be created if they do not yet exist.