Skip to main content

Reproducibility simplified.

Project description

Calkit

Calkit is a lightweight framework for doing reproducible research. It acts as a top-level layer to integrate and simplify the use of enabling technologies such as Git, DVC, Conda, and Docker. Calkit also adds a domain-specific data model such that all aspects of the research process can be fully described in a single repository and therefore easily consumed by others.

Our goal is to make reproducibility easier so it becomes more common. To do this, we try to make it easy for users to follow two simple rules:

  1. Keep everything in version control. This includes large files like datasets, enabled by DVC. The Calkit cloud serves as a simple default DVC remote storage location for those who do not want to manage their own infrastructure.
  2. Generate all important artifacts with a single pipeline. There should be no special instructions required to reproduce a project's artifacts. It should be as simple as calling calkit run. The DVC pipeline (in a project's dvc.yaml file) is therefore the main thing to "build" throughout a research project. Calkit provides helper functionality to build pipeline stages that keep computational environments up-to-date and label their outputs for convenient reuse.

Tutorials

Why does reproducibility matter?

If your work is reproducible, that means that someone else can "run" it and calculate the same results or outputs. This is a major step towards addressing the replication crisis and has some major benefits for both you as an individual and the research community:

  1. You will avoid mistakes caused by, e.g., running an old version of a script and including a figure that wasn't created after fixing a bug in the data processing pipeline.
  2. Since your project is "runnable," it's more likely that someone else will be able to reuse part of your work to run it in a different context, thereby producing a bigger impact and accelerating the pace of discovery. If someone can take what you've done and use it to calculate a prediction, you have just produced truly useful knowledge.

Why another tool/platform?

Git, GitHub, DVC, Docker et al. are amazing tools/platforms, but their use involves multiple fairly difficult learning curves, and tying them together might mean developing something new for each project. Our goal is to provide a single tool and platform to unify all of these so that there is a single, gentle learning curve. However, it is not our goal to hide or replace these underlying components. Advanced users can use them directly, but new users aren't forced to, which helps them get up and running with less effort and training. Calkit should help users understand what is going on under the hood without forcing them to work at that lower level of abstraction.

Installation

To install Calkit, Git and Python must be installed. If you want to use Docker containers, which is typically a good idea, that should also be installed. For Python, we recommend Mambaforge. If you're a Windows user and decide to install Mambaforge or any other Conda-based distribution, e.g., Anaconda, you'll probably want to ensure that environment is activated by default in Git Bash.

After Python is installed, run

pip install calkit-python

Cloud integration

The Calkit cloud platform (https://calkit.io) serves as a project management interface and a DVC remote for easily storing all versions of your data/code/figures/publications, interacting with your collaborators, reusing others' research artifacts, etc.

After signing up, visit the settings page and create a token. Then run

calkit config set token ${YOUR_TOKEN_HERE}

Then, inside a project repo you'd like to connect to the cloud, run

calkit config setup-remote

This will setup the Calkit DVC remote, such that commands like dvc push will allow you to push versions of your data or pipeline outputs to the cloud for safe storage and sharing with your collaborators.

How it works

Calkit creates a simple human-readable "database" inside the calkit.yaml file, which serves as a way to store important information about the project, e.g., what question(s) it seeks to answer, what files should be considered datasets, figures, publications, etc. The Calkit cloud reads this database and registers the various entities as part of the entire ecosystem such that if a project is made public, other researchers can find and reuse your work to accelerate their own.

Design/UX principles

  1. Be opinionated. Users should not be forced to make unimportant decisions. However, if they disagree, they should have the ability to change the default behavior. The most common use case should be default. Commands that are commonly executed as groups should be combined, but still available to be run individually if desired.
  2. Commits should ideally be made automatically as part of actions that make changes to the project repo. For example, if a new object is added via the CLI, a commit should be made right then unless otherwise specified. This saves the trouble of running multiple commands and encourages atomic commits.
  3. Pushes should require explicit input from the user. It is still TBD whether or not a pull should automatically be made, though in general we want to encourage trunk-based development, i.e., only working on a single branch. One exception might be for local experimentation that has a high likelihood of failure, in which case a branch can be a nice way to throw those changes away. Multiple branches should probably not live in the cloud, however, except for small, quickly merged pull requests.
  4. Idempotency is always a good thing. Unnecessary state is bad. For example, we should not encourage caching pipeline outputs for operations that are cheap. Caching should happen either for state that is valuable on its own, like a figure, or for an intermediate result that is expensive to generate.
  5. There should be the smallest number of frequently used commands as possible, and they should require as little memorization as possible to know how to execute, e.g., a user should be able to keep running calkit run and that's all they really need to do to make sure the project is up-to-date.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

calkit_python-0.6.1.tar.gz (426.4 kB view details)

Uploaded Source

Built Distribution

calkit_python-0.6.1-py3-none-any.whl (80.2 kB view details)

Uploaded Python 3

File details

Details for the file calkit_python-0.6.1.tar.gz.

File metadata

  • Download URL: calkit_python-0.6.1.tar.gz
  • Upload date:
  • Size: 426.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for calkit_python-0.6.1.tar.gz
Algorithm Hash digest
SHA256 3fa89532d31a79d01a1667a219f740858a8ae3ec31740fcf59c678db19c6d2aa
MD5 a34d5275fcd0efb50b85209e7ed53b13
BLAKE2b-256 1312a606f6ed4c6e5e58736d204bff079c23c853f596baeefdf870a50f8bd68a

See more details on using hashes here.

Provenance

The following attestation bundles were made for calkit_python-0.6.1.tar.gz:

Publisher: publish.yml on calkit/calkit

Attestations:

File details

Details for the file calkit_python-0.6.1-py3-none-any.whl.

File metadata

File hashes

Hashes for calkit_python-0.6.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bba27c2e093df18e9c4469ed7db90b4220931ffa9b6aad3dd2d2b77a98c160ff
MD5 268acc528fdbb02c33457c2ddeb38d70
BLAKE2b-256 f59fe8bc8b243c8429989330b1ab8c0e0cfab1ef4b8e22918f75adc9c5ea4447

See more details on using hashes here.

Provenance

The following attestation bundles were made for calkit_python-0.6.1-py3-none-any.whl:

Publisher: publish.yml on calkit/calkit

Attestations:

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page