Skip to main content

git-like data management for arbitrary data trees: workspaces, checkouts, remotes, and sync

Project description

🌲 forest Animated isometric forest: data trees on a workspace platform, packets syncing up a git branch to a remote cloud

git-like data management for arbitrary data trees.

CI Docs Release Python 3.9+ License: MIT Ruff mypy: strict

Forest is the data-side parallel to git's version control. Git tracks code in .git/; forest tracks large data — local layout plus remote sync — in .forest/. It borrows git's mental model and verbs (checkout, status, push, pull, remote, a HEAD-style pointer) so your git intuition carries over, but the two domains never overlap and neither requires the other.

Forest is domain-agnostic and self-contained: it manages any data trees, knows nothing about what the data means, and depends on no other project. It moves files and tracks their sync state; it does not validate or interpret their contents.

Model

Fixed-depth, no arbitrary nesting:

workspace → checkout → stage → unit → files
  • Workspace — a per-repo .forest/ control area (registry + active pointer).
  • Checkout — a named data view/focus registered in the workspace (like a git branch you stay rooted in). Switching is an O(1) pointer rewrite; data never moves.
  • Stage — a named data category inside a checkout, with a remote layout.
  • Unit — one addressable item within a stage (a subdirectory, a directory, or a file, per the stage's sync_by).

Install

Requires rclone on $PATH for transfers:

pip install -e ./forest

Quick start

No config files are hand-written. Onboarding is a few commands:

forest init                      # create the nameless .forest/ workspace container
forest checkout demo             # create if absent, register + activate 'demo'
forest remote add origin s3://my-bucket/prefix   # allowed before any local data exists
forest add raw ./data/raw        # register stage 'raw' and bind it to a local path
forest push                      # sync every bound stage to the active remote
  • forest init creates only .forest/config.yaml (version: 1, checkouts: {}) and a managed .gitignore. No root config, no checkout, no active pointer.
  • forest checkout <name> switches to the checkout, creating, registering, and activating it first (with .forest/checkouts/<name>/forest.yaml) when the name is not registered. forest checkout create <name> is the explicit form.
  • Remotes can be added before any local binding — useful when your data is remote-only at first.
  • forest add STAGE PATH registers a new stage and binds it to a local path in one step (use forest bind to rebind an existing stage).

Metadata layout

Everything forest owns lives under .forest/; your data does not.

.forest/
  config.yaml                     # workspace registry: version, checkouts{}
  HEAD                            # active checkout name (gitignored)
  checkouts/
    demo/
      forest.yaml                 # shared: stages, remotes, manifest
      local.yaml                  # user-local: active_remote, stage_paths (gitignored)
      sync_state.json             # user-local push/pull state (gitignored)

Shared metadata (config.yaml, each forest.yaml) is committed so a fresh clone bootstraps with bind + remote use + pull. User-local files (HEAD, local.yaml, sync_state.json) are gitignored.

Commands

Command Purpose
forest init Create the workspace container, or report setup status if it exists.
forest checkout create/adopt/list/current/remove <name> Manage checkouts; bare forest checkout <name> switches, creating first if needed. remove --yes skips the prompt for scripts.
forest add STAGE PATH [--sync-by MODE] Register a new stage and bind it to a local path; --sync-by picks unit discovery (subdirectory/directory/file).
forest bind [STAGE PATH] / forest unbind STAGE Manage local stage↔path bindings.
forest remote add/remove/list/use/show Manage remotes; use selects the active remote (optional while only one remote exists).
forest push / pull / status / diff / ls Sync and inspect against the active remote. Bare push/pull/status/diff cover every bound stage (unbound stages warn and skip); --all requires all stages bound.
forest flow Emit a Mermaid data-flow diagram of the active checkout.
forest migrate Migrate a legacy biostore layout in place (see below).

Run any command with -C <path> to operate on another repo without cd.

Forest syncs all files in a data unit, skipping OS junk (.DS_Store, AppleDouble ._*, *.tmp). It applies no content-based include/exclude rules.

Config reference

Checkout forest.yaml (shared, committed):

project: demo
remotes:
  origin:
    url: s3://my-bucket/prefix
    region: us-east-2          # optional; also endpoint, profile, key_file, known_hosts
stages:
  raw:
    remote_path: demo/raw      # optional; defaults to <checkout>/<stage>
    sync_by: subdirectory      # subdirectory | directory | file

Checkout local.yaml (per-machine, gitignored):

active_remote: origin
stage_paths:
  raw: ../data/raw             # relative resolves from the workspace root

Environment variables

All optional, all off by default — forest is silent and sends nothing anywhere unless configured. Copy .env.example for a commented template; operational guides live in docs/runbooks/.

Variable Default Effect
FOREST_LOG_FILE unset Append structured logs (JSON lines) to this file.
FOREST_LOG_FORMAT json json or text; set without FOREST_LOG_FILE to log to stderr.
FOREST_LOG_LEVEL INFO Standard logging level name.
FOREST_METRICS_FILE unset Append metric samples as JSON lines for external collectors.
FOREST_ANALYTICS_FILE unset Opt-in local usage analytics (JSON lines); nothing leaves the machine.
FOREST_SENTRY_DSN unset Sentry error tracking; needs pip install "forest-cli[observability]".
FOREST_ALERT_WEBHOOK unset POST failure alerts to this HTTPS endpoint (Slack/Mattermost compatible).
FOREST_TRANSFER_RETRIES 2 Extra attempts for transient rclone failures; 0 disables.
FOREST_RETRY_BASE_DELAY 0.5 Initial retry backoff in seconds; doubles per attempt.
FOREST_BREAKER_THRESHOLD 5 Consecutive transfer failures before the circuit opens; 0 disables.
FOREST_BREAKER_RESET_SECONDS 60 Cool-down before an open circuit allows a probe operation.
FOREST_FLAGS unset Comma-separated feature flags; raw-logs disables log secret-scrubbing.

Dogfood: this repo runs forest

This repository manages its own examples/ tree with forest — a live demonstration that .forest/ and .git/ coexist without overlapping. It was set up with exactly the quick-start commands:

forest init
forest checkout demo
forest remote add origin s3://forest-test-542222635421-us-east-2-an --region us-east-2
forest add examples ./examples
forest push

Inspect the result:

git ls-files .forest        # what a clone gets: config.yaml + checkouts/demo/forest.yaml
cat .gitignore              # forest-managed: HEAD, local.yaml, sync_state.json stay local
forest status               # sync state of the examples stage

A fresh clone bootstraps the local half with forest bind examples ./examples followed by forest pull (the single configured remote is used automatically). Pulling needs AWS credentials for the bucket; the layout is the demonstration.

Migrating from biostore

Existing biostore repos use .biostore/ and biostore.yaml. Migrate in place:

forest migrate

This renames .biostore/.forest/, each biostore.yamlforest.yaml, rewrites the managed .gitignore patterns, and verifies the registry parses. It refuses to run if a .forest/ already exists.

Notes

  • Single active machine (v1). HEAD/local.yaml/sync_state.json are git-invisible but may be synced by a file-syncing tool; forest assumes one active machine and uses atomic writes plus a per-checkout flock for intra-machine write races.
  • Real filenames. Forest stores data under real paths, not a content-addressed blob store.
  • See docs/adr/ for the design decisions behind the workspace/checkout model.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forest_cli-0.1.1.tar.gz (169.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forest_cli-0.1.1-py3-none-any.whl (63.7 kB view details)

Uploaded Python 3

File details

Details for the file forest_cli-0.1.1.tar.gz.

File metadata

  • Download URL: forest_cli-0.1.1.tar.gz
  • Upload date:
  • Size: 169.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for forest_cli-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e780c1080978e4614d3d13fa271a75d4f0f05e862322b48501c3310d634fa319
MD5 a82218a08453c6e24ed994d4171c12c7
BLAKE2b-256 7aa5b869430f2afb9df6f0b1cecef622e98b47b969a1ae37d460bc88164136e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for forest_cli-0.1.1.tar.gz:

Publisher: publish.yaml on tmsincomb/forest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file forest_cli-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: forest_cli-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 63.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for forest_cli-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6058c5e772d7a967996e892805c27020a00971433aea1cad146d9b897497469f
MD5 9f30cff5195f4f6104c8c3aedf1b7686
BLAKE2b-256 4a43fbe0add0d928831bc6e61e0e23d7d89b4721c1a46dea85f96454439811e2

See more details on using hashes here.

Provenance

The following attestation bundles were made for forest_cli-0.1.1-py3-none-any.whl:

Publisher: publish.yaml on tmsincomb/forest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page