A tool for documentable and reproducible analysis and research
Project description
==================
About psicons.core
==================
.. |psicons| replace:: *psicons*
Background
----------
Scientific analysis can be problematic:
* It may involve multiple steps, each using the results of the previous stage.
Making a mistake often means repeating the whole series for safety.
* Sometimes analysis chains have to be repeated on different datasets.
Sometimes, even within a single analysis, the same manipulation or test has
to be repeated with slightly different parameters.
* Even immediately after the fact, it's easy to forget what was done. 9 months
later when responding to a referee's report, it may be impossible.
* Collaborators, clients or bosses may demand accountability.
* With long but routine tasks, it's easy to get bored and make mistakes.
It this light, |psicons| is a quick and dirty hack to subvert the scons build
system for scientific analysis. Every stage of analysis is a command-line call
to a script or executable that takes *inputs* and produces *outputs*. When
scons is called, dependencies between outputs and inputs are tested and only
those stages are run that are necessary to update outputs. In addition, the
exact sequence of analyses is recorded by the build file.
In summary, |psicons| provides:
* Repeatability: running a build file, reruns the same analysis
* Reproducibility: the build file (and custom scripts) document the steps of
the analysis
* Minimization of effort: if inputs or analysis steps are changed, only the
necessary (dependent) steps of the analysis are rerun
* Mistake-resistant: errors don't derail analysis due to reproducibility ("what
did I do") and minimization of effort (only dependent steps are repeated)
* Programmability: analysis tasks may be constructed with programatically
("repeat the analysis across this parameter range")
Status
------
*psicons* is very much a hack-and-see project, having been produced in the
aftermath of the 2009 H1N1 pandemic from the need for complex processing of
large amounts of sequence data using ad-hoc scripts and formats. It worked
well in that limited role, but is still an early release exploring the
approach. Functionality is limited and the API may change. There are other
more developed (and more specialised) alternatives. Comment is invited.
Installation
------------
The simplest way to install *psicons* is via ``easy_install`` [setuptools]_ or
an equivalent program::
% easy_install psicons.core
Alternatively the tarball can be downloaded, unpacked and ``setup.py`` run::
% tar zxvf psicons-core.tgz
% cd psicons-core
% python set.py install
It should work with just about any version of Python.
*psicons* requires that scons is installed, which is where things get tricky.
Scons by default installs itself in a sandboxed way with multiple versions
living side-by-side and thus being non-importable. Of course, |psicons| needs
to use the scons library, so a conventional installation must be forced.
Download the scons tarball, unpack it and install it like thus::
% python setup.py install --standard-lib
Using psicons
-------------
A full API is included in the source distribution.
Examples
~~~~~~~~
Psicons works just like scons. In fact, it is scons. More details are available
elsewhere but briefly, you run scons like this::
# look for a build file called "Sconstruct" by default
% scons
# looks for a named build file
% scons -f mybuildfile
This causes scons to execute the build file, which is just a Python script,
defining a series of tasks or *commands*::
# an scons build file
# some necessary administration - set up the build environment
env = Environment()
# compile two libraries and then combine into one program
first_libs = env.Object ('hello.c', CCFLAGS='-DHELLO')
second_libs = env.Object ('goodbye.c', CCFLAGS='-DGOODBYE')
env.Program (first_libs + second_libs)
The first time this file is executed, the first two commands build libraries,
while the third combines the libraries into a single executable. Dependencies
between the steps are automatically tracked: should one of the original source
files be changed (e.g. hello.c), when the file is rerun only the steps
"downstream" of it (e.g. recompilation of the first library, and the final
linking) are rerun.
Scons has a large number of commands for all sorts of software builds. Psicons
adds two new commands, so that local scripts or external programs can be used
to in a build. In this way, complex multi-step analyses can be constructed from
a series of interdependent commands, that "build" intermediate data and final
results::
from psicons.core import *
env = Environment()
# call a local script
IN_DATA = 'jg_08-10_2010.csv'
CLEAN_DATA = 'jg_08-10_2010-cleaned.csv'
make_clean_data = Script (env, 'clean_seqs.py',
args = ['--save-as', CLEAN_DATA],
infiles = [IN_DATA],
output = CLEAN_DATA,
)
# call an external command
EPI_DATA = 'jg-types.txt'
RESULT_DATA = 'results.tab'
type_data = External (env, 'treemaker',
args = ['--save-as', RESULT_DATA],
infiles = [CLEAN_DATA, EPI_DATA],
output = [RESULT_DATA],
)
The interfaces of these two commands are similar:
* what is being called?
* what inputs does it use (depend on)?
* what outputs does it produce?
When scons is run on this build file, it calls the script 'clean_seqs.py' on
`IN_DATA` to produce `CLEAN_DATA`. Then the external program `treemaker` is
called on `CLEAN_DATA` and `EPI_DATA` to produce `RESULT_DATA`. Should
`EPI_DATA` be edited, when scons is called again, only the second external
step will be run again as the first step and it's results is still up to date.
Thus:
* Analyses may be run (and rerun) easily
* If data changes (or scripts change - bug fixes), only the necessary steps are
rerun
* The actions taken are recorded in the build file
To ease renaming intermediate or output files in a rational way, *psicons*
offers a few utility functions for interpolating file names from parameters. To
illustrate::
# generate a new string from a template
>>> d = {'foo': '123', 'bar': '456'}
>>> interpolate ('ab{foo}cd{bar}ef', d)
'ab123cd456ef'
# name new file name from old by adding suffix to name
>>> interpolate_from_path ('mydata.csv', '{stem}-cleaned{ext}')
'mydata-cleaned.csv'
Limitations
-----------
Certainly, far, far more complicated reproducibility tools are out there (see `
here <http://csdl2.computer.org/comp/mags/cs/2009/01/mcs2009010005.pdf>`__) but
many are based around certain disciplines (e.g. geophysics, computational math),
require working through web interfaces or using very standard sets of analysis
tools. *psicons* is written from the point of view of a bioinformatician doing
sequence and phylogenetic analysis, working on the commandline using a lot of
custom scripts and an endlessly changing lineup of supplied tools. As sometimes
happens, other tools didn't fit, so I wrote one that did.
As with many quick hack tools, documentation is currently a bit thin.
The need for a modified scons installation is a blemish. Future versions of
|psicons| may need to directly incorporate scons for ease of installation.
Clearly, a set of standard tools for extracting, transforming and plotting
data would be a powerful addition to *psicons*. This doesn't exist as yet.
The process of installing a new tool or module for use by SCons is fiddly
[scons-tools]_, involves copying libraries in a tool directory, registering
their use in build files, voids the ease of using ``easy_install`` and makes
development a pain. Thus, the additional "commands" are real scons commands, so
much as functions that generate commands. But they are easy to use.
Credit
------
Thanks to the architects of Scons, of course.
While this project was started before encountering Madagascar [madagascar]_, it
has inevitably shaped development. It's a remarkably powerful system, although
ill-suited to my current purposes. You should check it out.
Only when writing this document did I become aware of sconstools [sconstools]_,
which seems to be following exactly the same direction as |psicons|.
References
----------
.. [psicons-home] `psicons home page <http://www.agapow.net/software/psicons-core>`__
.. [psicons-pypi] `psicons on PyPi <http://pypi.python.org/pypi/psicons-core>`__
.. [setuptools] `Setuptools & easy_install <http://packages.python.org/distribute/easy_install.html>`__
.. [psicons-github] `psicons on github <https://github.com/agapow/psicons.core>`__
.. [scons] `Scons <http://www.scons.org>`__
.. [madagascar] `Madagscar and Scons for reproducibility <http://reproducibility.org/wiki/Reproducible_computational_experiments_using_SCons>`__
.. [scons-custom] `Where To Put Your Custom Builders and Tools <http://scons.org/doc/production/HTML/scons-user/x3697.html>`__
.. [sconstools] `Sconstools <http://code.google.com/p/sconstools/>`__
=========
Changelog
=========
0.2dev (29072011)
-----------------
- First widely released version
- Added documentation
0.1dev (unreleased)
-------------------
- Initial release
About psicons.core
==================
.. |psicons| replace:: *psicons*
Background
----------
Scientific analysis can be problematic:
* It may involve multiple steps, each using the results of the previous stage.
Making a mistake often means repeating the whole series for safety.
* Sometimes analysis chains have to be repeated on different datasets.
Sometimes, even within a single analysis, the same manipulation or test has
to be repeated with slightly different parameters.
* Even immediately after the fact, it's easy to forget what was done. 9 months
later when responding to a referee's report, it may be impossible.
* Collaborators, clients or bosses may demand accountability.
* With long but routine tasks, it's easy to get bored and make mistakes.
It this light, |psicons| is a quick and dirty hack to subvert the scons build
system for scientific analysis. Every stage of analysis is a command-line call
to a script or executable that takes *inputs* and produces *outputs*. When
scons is called, dependencies between outputs and inputs are tested and only
those stages are run that are necessary to update outputs. In addition, the
exact sequence of analyses is recorded by the build file.
In summary, |psicons| provides:
* Repeatability: running a build file, reruns the same analysis
* Reproducibility: the build file (and custom scripts) document the steps of
the analysis
* Minimization of effort: if inputs or analysis steps are changed, only the
necessary (dependent) steps of the analysis are rerun
* Mistake-resistant: errors don't derail analysis due to reproducibility ("what
did I do") and minimization of effort (only dependent steps are repeated)
* Programmability: analysis tasks may be constructed with programatically
("repeat the analysis across this parameter range")
Status
------
*psicons* is very much a hack-and-see project, having been produced in the
aftermath of the 2009 H1N1 pandemic from the need for complex processing of
large amounts of sequence data using ad-hoc scripts and formats. It worked
well in that limited role, but is still an early release exploring the
approach. Functionality is limited and the API may change. There are other
more developed (and more specialised) alternatives. Comment is invited.
Installation
------------
The simplest way to install *psicons* is via ``easy_install`` [setuptools]_ or
an equivalent program::
% easy_install psicons.core
Alternatively the tarball can be downloaded, unpacked and ``setup.py`` run::
% tar zxvf psicons-core.tgz
% cd psicons-core
% python set.py install
It should work with just about any version of Python.
*psicons* requires that scons is installed, which is where things get tricky.
Scons by default installs itself in a sandboxed way with multiple versions
living side-by-side and thus being non-importable. Of course, |psicons| needs
to use the scons library, so a conventional installation must be forced.
Download the scons tarball, unpack it and install it like thus::
% python setup.py install --standard-lib
Using psicons
-------------
A full API is included in the source distribution.
Examples
~~~~~~~~
Psicons works just like scons. In fact, it is scons. More details are available
elsewhere but briefly, you run scons like this::
# look for a build file called "Sconstruct" by default
% scons
# looks for a named build file
% scons -f mybuildfile
This causes scons to execute the build file, which is just a Python script,
defining a series of tasks or *commands*::
# an scons build file
# some necessary administration - set up the build environment
env = Environment()
# compile two libraries and then combine into one program
first_libs = env.Object ('hello.c', CCFLAGS='-DHELLO')
second_libs = env.Object ('goodbye.c', CCFLAGS='-DGOODBYE')
env.Program (first_libs + second_libs)
The first time this file is executed, the first two commands build libraries,
while the third combines the libraries into a single executable. Dependencies
between the steps are automatically tracked: should one of the original source
files be changed (e.g. hello.c), when the file is rerun only the steps
"downstream" of it (e.g. recompilation of the first library, and the final
linking) are rerun.
Scons has a large number of commands for all sorts of software builds. Psicons
adds two new commands, so that local scripts or external programs can be used
to in a build. In this way, complex multi-step analyses can be constructed from
a series of interdependent commands, that "build" intermediate data and final
results::
from psicons.core import *
env = Environment()
# call a local script
IN_DATA = 'jg_08-10_2010.csv'
CLEAN_DATA = 'jg_08-10_2010-cleaned.csv'
make_clean_data = Script (env, 'clean_seqs.py',
args = ['--save-as', CLEAN_DATA],
infiles = [IN_DATA],
output = CLEAN_DATA,
)
# call an external command
EPI_DATA = 'jg-types.txt'
RESULT_DATA = 'results.tab'
type_data = External (env, 'treemaker',
args = ['--save-as', RESULT_DATA],
infiles = [CLEAN_DATA, EPI_DATA],
output = [RESULT_DATA],
)
The interfaces of these two commands are similar:
* what is being called?
* what inputs does it use (depend on)?
* what outputs does it produce?
When scons is run on this build file, it calls the script 'clean_seqs.py' on
`IN_DATA` to produce `CLEAN_DATA`. Then the external program `treemaker` is
called on `CLEAN_DATA` and `EPI_DATA` to produce `RESULT_DATA`. Should
`EPI_DATA` be edited, when scons is called again, only the second external
step will be run again as the first step and it's results is still up to date.
Thus:
* Analyses may be run (and rerun) easily
* If data changes (or scripts change - bug fixes), only the necessary steps are
rerun
* The actions taken are recorded in the build file
To ease renaming intermediate or output files in a rational way, *psicons*
offers a few utility functions for interpolating file names from parameters. To
illustrate::
# generate a new string from a template
>>> d = {'foo': '123', 'bar': '456'}
>>> interpolate ('ab{foo}cd{bar}ef', d)
'ab123cd456ef'
# name new file name from old by adding suffix to name
>>> interpolate_from_path ('mydata.csv', '{stem}-cleaned{ext}')
'mydata-cleaned.csv'
Limitations
-----------
Certainly, far, far more complicated reproducibility tools are out there (see `
here <http://csdl2.computer.org/comp/mags/cs/2009/01/mcs2009010005.pdf>`__) but
many are based around certain disciplines (e.g. geophysics, computational math),
require working through web interfaces or using very standard sets of analysis
tools. *psicons* is written from the point of view of a bioinformatician doing
sequence and phylogenetic analysis, working on the commandline using a lot of
custom scripts and an endlessly changing lineup of supplied tools. As sometimes
happens, other tools didn't fit, so I wrote one that did.
As with many quick hack tools, documentation is currently a bit thin.
The need for a modified scons installation is a blemish. Future versions of
|psicons| may need to directly incorporate scons for ease of installation.
Clearly, a set of standard tools for extracting, transforming and plotting
data would be a powerful addition to *psicons*. This doesn't exist as yet.
The process of installing a new tool or module for use by SCons is fiddly
[scons-tools]_, involves copying libraries in a tool directory, registering
their use in build files, voids the ease of using ``easy_install`` and makes
development a pain. Thus, the additional "commands" are real scons commands, so
much as functions that generate commands. But they are easy to use.
Credit
------
Thanks to the architects of Scons, of course.
While this project was started before encountering Madagascar [madagascar]_, it
has inevitably shaped development. It's a remarkably powerful system, although
ill-suited to my current purposes. You should check it out.
Only when writing this document did I become aware of sconstools [sconstools]_,
which seems to be following exactly the same direction as |psicons|.
References
----------
.. [psicons-home] `psicons home page <http://www.agapow.net/software/psicons-core>`__
.. [psicons-pypi] `psicons on PyPi <http://pypi.python.org/pypi/psicons-core>`__
.. [setuptools] `Setuptools & easy_install <http://packages.python.org/distribute/easy_install.html>`__
.. [psicons-github] `psicons on github <https://github.com/agapow/psicons.core>`__
.. [scons] `Scons <http://www.scons.org>`__
.. [madagascar] `Madagscar and Scons for reproducibility <http://reproducibility.org/wiki/Reproducible_computational_experiments_using_SCons>`__
.. [scons-custom] `Where To Put Your Custom Builders and Tools <http://scons.org/doc/production/HTML/scons-user/x3697.html>`__
.. [sconstools] `Sconstools <http://code.google.com/p/sconstools/>`__
=========
Changelog
=========
0.2dev (29072011)
-----------------
- First widely released version
- Added documentation
0.1dev (unreleased)
-------------------
- Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
psicons.core-0.1.tar.gz
(52.2 kB
view details)
File details
Details for the file psicons.core-0.1.tar.gz
.
File metadata
- Download URL: psicons.core-0.1.tar.gz
- Upload date:
- Size: 52.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2982cf5e93cea47c65d1022a9dda47217046c50a4cdda241a827d56554d38310 |
|
MD5 | 3a47f0adeb56d8b0c2a7932b3aaee7b2 |
|
BLAKE2b-256 | 941b0ecfa7bce6024b51753518ea04f02ece0bb91d84bc9cb29ddb8d1c8847f7 |