A scientific reproducibility tool supporting existing workflows
Project description
Scimon
A scientific reproducibility tool supporting evolving experimental workflows
Table of Contents
Overview
This tool aims to passively track the user's interactions with the computing environment through bash, and generate Makefiles that support reproducing any given version of any intermediate or result files produced in the experiment.
Project Architecture
Key Components
- Bash: Our way to track user actions passively, all commands being run in an interactive bash shell will be intercepted and run with strace instead to capture relevant system calls.
- Git: We are utilizing git's powerful version control abilities to keep track of file states. The commit hash also serves as a great unique identifier for storing system calls, commands and showing file changes.
- SQLite3: The system calls and related data are then parsed and stored into the database.
- Python CLI: Various functionalities are available through the CLI written in python.
Data Flow
Installation
Dependencies
- SQLite3
- Python >= 3.9
- Strace
- Git
- Bash
# Install the CLI
pip install scimon
Then copy the contents of commandhook.sh into ~/.bashrc, and restart the bash session for the hooks to take effect.
Usage
# Adds current directory for monitoring (WIP)
scimon init
# Reproduce a given file with optionally a specified commit hash, if no commit hash is specified then the latest version will be reproduced
scimon reproduce [file] --git-hash=abc123
# Lists all directories currently being monitored (WIP)
scimon list
# Removes a directory from being monitored (WIP)
scimon remove [dir]
# Outputs a provenance graph for the given file (WIP)
scimon visualize [file] --git-hash=abc123
Running the source code
I would highly recommend using uv to manage the dependencies of this project. Conda works? as an alternative.
- Navigate to project root
pipx install .- Done
Logic Overview
Bash Hooks
Pre-exec/Post-exec Hook:
The hooks that fire before and after each command are implemented with bash's debug trap and PROMPT_COMMAND respectively.
The debug trap will fire before each bash command is executed, this is achieved through trap [your-command] DEBUG.
PROMPT_COMMAND will fire before each user prompt shows up in bash, this means that whenever a command finishes executing we can embed our own logic into a bash function and run it.
The code snippet below enable the hooks whenever an interaction bash session starts:
_init_hook() {
PROMPT_COMMAND='_post_command_git_check'
trap '_pre_command_git_check' DEBUG
}
PROMPT_COMMAND='_init_hook'
Git Checking/commiting:
For every command that the user enters into bash, we will perform 2 checks through git, they happen everytime:
- Before the execution of the command and commit if there are any changes in the monitored repository
- After the execution of the command and commit if there are any changes to the monitored repository
Taking these snapshots allow us to determine commands that are producing side effects, which are the ones worth recording in our case.
Strace Parsing
In the function _pre_command_git_check (I really should rename this already), you can see that we are actually parsing the command and executing it with strace instead. This allows us to have a list of the system calls being used to execute the command which is stored in ~/.scimon/strace.log. After the strace command stops running, we terminate the whole execution early so the original command doesn't get executed again!
Finally, we parse the output log with _parse_strace, and store relevant system calls in the proper tables
Database Operations
Currently we have 5 tables in the SQL database:
commands: Stores all commands that has a side effect, associated with the commit id before and after the command.executed_files: Stores all system calls of theexecveflavour (see details incommandhook.sh: _parse_strace).file_changes: Stores a list of file changes associated with the commit id (Most likely not needed, I created this in the very early stage of the project and haven't found a need for it yet).opened_files: Stores all system calls of theopenatflavour, tracks file reads/writes.processes: Stores system calls of thecloneflavour, not super useful at the moment but good to have.
Python CLI
scimon.py: the heart of the application, contains the main functionalitiesdb.py: database operationsutils.py: various functions that perform git commands that suit the needs of our application, might add other stuff later onmodels.py: class definitions for the provenance graph
Reproduce
Reproduce takes in a filename and optionally a git commit hash, if no commit hash is specified the latest version will be reproduced by default.
First, we generate a provenance graph based on the commit.
Then we perform a graph traversal from the file node that we want to reproduce to identify all dependencies needed. If there are no dependent files for the current file, that means the current file isn't produced by a command side-effect and a git restore command is sufficient. Otherwise, we recursively call reproduce on parent files.
Once we have a list of parent files identified, we then fetch the command used to produce the current file from the database and form a make rule with it. The rule is then appended into the makefile.
Here's a very basic example of a Makefile generated by reproduce.
I prepared a mock experiment where script.py read from digital_mental_health.csv and generates a set of plots, then I modified script.py slightly so that screen_time_vs_digital_device_usage.png is changed.
Then I ran scimon reproduce out/screen_time_vs_digital_device_usage.png --git-hash=version-1-commit-id
We get
script.py:
git restore --source=f985c1d438fa81273483ed4229c87f69c16b1eaa -- script.py
data/digital_diet_mental_health.csv:
git restore --source=f985c1d438fa81273483ed4229c87f69c16b1eaa -- data/digital_diet_mental_health.csv
out/screen_time_vs_digital_device_usage.png: data/digital_diet_mental_health.csv script.py
python3 script.py
In this example, the CSV file and the python script are not modified by bash command side-effects, therefore we perform git restore to check them out at their proper versions.
The plot is modified by python3 script.py, and therefore requires the parent files to be reproduced first before executing the captured command once again.
Then
$ make -B -f reproduce.mk out/screen_time_vs_digital_device_usage.png
git restore --source=f985c1d438fa81273483ed4229c87f69c16b1eaa -- script.py
git restore --source=f985c1d438fa81273483ed4229c87f69c16b1eaa -- data/digital_diet_mental_health.csv
python3 script.py
...script outputs...
branch
$
And we can see the original plots are generated once again through looking at the git change list:
Contributing
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scimon-0.1.1.tar.gz.
File metadata
- Download URL: scimon-0.1.1.tar.gz
- Upload date:
- Size: 229.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4bd5de558f07ae85514d7139835a27d8475a1db21474a0a80f617fab2d584435
|
|
| MD5 |
4a657a39c2b5f728afca4b24c947774e
|
|
| BLAKE2b-256 |
6330b852e309676ba9253be3d1e1cb0ce571850db3d007cfa0683865f0625072
|
File details
Details for the file scimon-0.1.1-py3-none-any.whl.
File metadata
- Download URL: scimon-0.1.1-py3-none-any.whl
- Upload date:
- Size: 15.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e8ed914eb09a6de233651a6dc7bafc9ced7148c0e2c742a61474ddd9b2dd37b3
|
|
| MD5 |
e149864521f9c90f0b47b9e949946c32
|
|
| BLAKE2b-256 |
c50f58c429f3a85afd8bfd202e84f3e6b71a7a432a5ea294a9b1f6b6f8ecc11f
|