Skip to main content

SQLite-backed metadata database service for Seamless

Project description

seamless-database

seamless-database is the checksum-based metadata and caching service for the Seamless framework. It acts as the distributed computation cache that allows Seamless workflows to avoid recomputing identical transformations, both within a single session and across the entire cluster.

How it works

Seamless uses content-addressed storage: every piece of data (buffers, code, parameters) is identified by its checksum. When a transformation (computation) is submitted, its inputs are hashed into a transformation checksum. Before executing the computation, Seamless components (such as seamless-dask) query the database: "has this transformation been computed before?" If a cached result is found, the result checksum is returned immediately, skipping the computation entirely.

The database stores the following kinds of records:

Table Purpose
Transformation Maps a transformation checksum to its result checksum
RevTransformation Reverse lookup: finds which transformations produced a given result
BufferInfo Stores buffer metadata (length, dtype, encoding, etc.) for a checksum
SyntacticToSemantic Maps between syntactic and semantic checksums per celltype
Expression Caches expression evaluation results (input checksum + path + celltype → result checksum)
MetaData Stores execution metadata for transformations (executor, environment, timing)
IrreproducibleTransformation Records transformations whose results are not reproducible

All data is persisted in a single SQLite file (typically seamless.db).

Role in the Seamless ecosystem

Other Seamless components interact with the database over HTTP:

  • seamless-dask checks the database cache before scheduling a transformation on the Dask cluster, and writes results back after computation.
  • seamless-remote provides the DatabaseClient / DatabaseLaunchedClient classes that other components use to communicate with the database server.
  • seamless-config defines the launch template for the database server (port range, host, timeout, read/write mode).

The server exposes a JSON-over-HTTP protocol: clients send {"type": "<record_type>", "checksum": "<hex>", ...} via GET (read) or PUT (write) requests.

Installation

pip install seamless-database

Usage

# Start a writable database server on a random port
seamless-database seamless.db --port-range 5520 5530 --writable

# Start a read-only server on a fixed port
seamless-database seamless.db --port 5522

If --port and --port-range are both omitted, seamless-database picks a random free port in the dynamic/private range (49152-65535).

Status-file protocol

seamless-database does not require a status file. If --status-file is omitted, it runs independently.

If --status-file is provided, the file is used for two things:

  1. Report the chosen port, especially when --port-range is used.
  2. Report whether startup succeeded ("running") or failed ("failed").

The status-file protocol is simple:

  1. Wait for the status file to exist and parse it as JSON.
  2. Reuse the existing JSON object as the base payload. An empty JSON object {} is sufficient.
  3. Choose or validate its listening port.
  4. Once the HTTP server is up, rewrite the same file with "status": "running" and the selected "port".
  5. If startup fails before the server is running, rewrite the file with "status": "failed" instead.

If remote-http-launcher is used, it may pre-populate the JSON with fields such as the PID, workdir, or "status": "starting". seamless-database preserves such fields when it writes back the final status.

CLI options

Option Description
database_file Path to the SQLite file (created if it doesn't exist and --writable is set)
--port PORT Fixed network port
--port-range START END Pick a random free port from an inclusive range
--host HOST Bind address (default: 0.0.0.0)
--writable Allow PUT requests; opens the database in read/write mode
--status-file FILE JSON file used to report server status (for process managers)
--timeout SECONDS Stop the server after this many seconds of inactivity

CLI scripts

Installing seamless-database also provides:

  • seamless-database

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seamless_database-2.0.2.tar.gz (13.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seamless_database-2.0.2-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file seamless_database-2.0.2.tar.gz.

File metadata

  • Download URL: seamless_database-2.0.2.tar.gz
  • Upload date:
  • Size: 13.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.23

File hashes

Hashes for seamless_database-2.0.2.tar.gz
Algorithm Hash digest
SHA256 e2e03a45a0376e73f0b994ef90f6715d15b99f306494ee64e8eec048aa1efb5f
MD5 25166b78619da43f2e6e6bab02615c92
BLAKE2b-256 024d1d0d38b75b312dab8ffbf4568ff8fd662aed1c8409ca0da094c96c880c10

See more details on using hashes here.

File details

Details for the file seamless_database-2.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for seamless_database-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 96fc5ff29bd99917ea2c938a4b79b79ae9409e0280db7ea1993e01d546d5e5bd
MD5 459513de2eb6be355592c00a05d01ece
BLAKE2b-256 1bd2d9fda82a6845bdbb0e4bf2ee2b2333b713e1efcf1c0f189bb67641f58a89

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page