Skip to main content

The Modular Autonomous Discovery for Science (MADSci) Data Manager.

Project description

MADSci Data Manager

Handles capturing, storing, and querying data generated during experiments - both JSON values and files.

MADSci Data Manager Diagram

Features

  • DataPoint storage: JSON values and files with metadata
  • Flexible storage: Local filesystem or S3-compatible object storage (MinIO, AWS S3, GCS)
  • Rich metadata: Ownership info, timestamps, custom labels
  • Queryable: Search by value and metadata
  • Cloud integration: Multi-provider cloud storage support

Installation

See the main README for installation options. This package is available as:

Dependencies: MongoDB database, optional MinIO/S3 storage (see example_lab)

Usage

Quick Start

Use the example_lab as a starting point:

# Start with working example
docker compose up  # From repo root
# Data Manager available at http://localhost:8004/docs

# Or run standalone
python src/madsci_data_manager/madsci/data_manager/data_server.py

Manager Setup

For custom deployments, see example_data.manager.yaml for configuration options.

Data Client

Use DataClient to store and retrieve experimental data:

from madsci.client.data_client import DataClient
from madsci.common.types.datapoint_types import DataPoint, DataPointTypeEnum
from datetime import datetime

client = DataClient(data_server_url="http://localhost:8004")

# Store JSON data
value_dp = DataPoint(
    label="Temperature Reading",
    data_type=DataPointTypeEnum.JSON,
    value={"temperature": 23.5, "unit": "Celsius"}
)
submitted = client.submit_datapoint(value_dp)

# Store files
file_dp = DataPoint(
    label="Experiment Log",
    data_type=DataPointTypeEnum.FILE,
    path="/path/to/data.txt"
)
submitted_file = client.submit_datapoint(file_dp)

# Retrieve data
retrieved = client.get_datapoint(submitted.datapoint_id)

# Save file locally
client.save_datapoint_value(submitted_file.datapoint_id, "/local/save/path.txt")

Examples: See experiment_notebook.ipynb for data management workflows.

Storage Configuration

Local Storage (Default)

  • Files stored on filesystem with date-based hierarchy
  • Simple setup, no additional dependencies
  • File paths stored in MongoDB database

Object Storage (S3-Compatible)

Supports cloud and self-hosted storage providers:

  • AWS S3
  • Google Cloud Storage (with HMAC keys)
  • MinIO (self-hosted or cloud)
  • Any S3-compatible service

Benefits:

  • Automatic upload with fallback to local storage
  • Better for large files and distributed setups
  • Built-in metadata and versioning support

Quick Setup

# Use example_lab with pre-configured MinIO
docker compose up  # From repo root
# MinIO Console: http://localhost:9001 (minioadmin/minioadmin)

Configuration Examples

AWS S3:

from madsci.common.types.datapoint_types import ObjectStorageSettings

aws_config = ObjectStorageSettings(
    endpoint="s3.amazonaws.com",
    access_key="YOUR_ACCESS_KEY",
    secret_key="YOUR_SECRET_KEY",
    secure=True,
    default_bucket="my-bucket",
    region="us-east-1"
)
client = DataClient(object_storage_settings=aws_config)

Google Cloud Storage:

gcs_config = ObjectStorageSettings(
    endpoint="storage.googleapis.com",
    access_key="YOUR_HMAC_ACCESS_KEY",
    secret_key="YOUR_HMAC_SECRET",
    secure=True,
    default_bucket="my-gcs-bucket"
)

Direct Object Storage DataPoints

from madsci.common.types.datapoint_types import DataPoint, DataPointTypeEnum

storage_dp = DataPoint(
    label="Large Dataset",
    data_type=DataPointTypeEnum.OBJECT_STORAGE,
    path="/path/to/data.parquet",
    bucket_name="my-bucket",
    object_name="datasets/data.parquet",
    custom_metadata={"version": "v2.1"}
)
uploaded = client.submit_datapoint(storage_dp)

Authentication: Use IAM users/service accounts with appropriate storage permissions. See cloud provider documentation for detailed setup.

Database Migration Tools

MADSci Data Manager includes automated MongoDB migration tools that handle schema changes and version tracking for the data management system.

Features

  • Version Compatibility Checking: Automatically detects mismatches between MADSci package version and MongoDB schema version
  • Automated Backup: Creates MongoDB dumps using mongodump before applying migrations to enable rollback on failure
  • Schema Management: Creates collections and indexes based on schema definitions
  • Index Management: Ensures required indexes exist for optimal query performance
  • Location Independence: Auto-detects schema files or accepts explicit paths
  • Safe Migration: All changes are applied transactionally with automatic rollback on failure

Usage

Standard Usage

# Run migration for data database (auto-detects schema file)
python -m madsci.common.mongodb_migration_tool --database madsci_data

# Migrate with explicit database URL
python -m madsci.common.mongodb_migration_tool --db-url mongodb://localhost:27017 --database madsci_data

# Use custom schema file
python -m madsci.common.mongodb_migration_tool --database madsci_data --schema-file /path/to/schema.json

# Create backup only
python -m madsci.common.mongodb_migration_tool --database madsci_data --backup-only

# Restore from backup
python -m madsci.common.mongodb_migration_tool --database madsci_data --restore-from /path/to/backup

# Check version compatibility without migrating
python -m madsci.common.mongodb_migration_tool --database madsci_data --check-version

Docker Usage

When running in Docker containers, use docker-compose to execute migration commands:

# Run migration for data database in Docker
docker-compose run --rm data-manager python -m madsci.common.mongodb_migration_tool --db-url 'mongodb://mongodb:27017' --database 'madsci_data' --schema-file '/app/madsci/data_manager/schema.json'

# Create backup only in Docker
docker-compose run --rm data-manager python -m madsci.common.mongodb_migration_tool --db-url 'mongodb://mongodb:27017' --database 'madsci_data' --schema-file '/app/madsci/data_manager/schema.json' --backup-only

# Check version compatibility in Docker
docker-compose run --rm data-manager python -m madsci.common.mongodb_migration_tool --db-url 'mongodb://mongodb:27017' --database 'madsci_data' --schema-file '/app/madsci/data_manager/schema.json' --check-version

Server Integration

The Data Manager server automatically checks for version compatibility on startup. If a mismatch is detected, the server will refuse to start and display migration instructions:

DATABASE INITIALIZATION REQUIRED! SERVER STARTUP ABORTED!
The database exists but needs version tracking setup.
To resolve this issue, run the migration tool and restart the server.

Schema File Location

The migration tool automatically searches for schema files in:

  • madsci/data_manager/schema.json

Backup Location

Backups are stored in .madsci/mongodb/backups/ with timestamped filenames:

  • Format: madsci_data_backup_YYYYMMDD_HHMMSS
  • Can be restored using the --restore-from option

Requirements

  • MongoDB server running and accessible
  • MongoDB tools (mongodump, mongorestore) installed
  • Appropriate database permissions for the specified user

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

madsci_data_manager-0.7.0.tar.gz (23.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

madsci_data_manager-0.7.0-py3-none-any.whl (9.6 kB view details)

Uploaded Python 3

File details

Details for the file madsci_data_manager-0.7.0.tar.gz.

File metadata

  • Download URL: madsci_data_manager-0.7.0.tar.gz
  • Upload date:
  • Size: 23.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: pdm/2.26.6 CPython/3.12.3 Linux/6.14.0-1017-azure

File hashes

Hashes for madsci_data_manager-0.7.0.tar.gz
Algorithm Hash digest
SHA256 9632833ab9d518c9dd91f9e98e1b10f12e69a2c650ff7d55dc591a88ddc24d4f
MD5 97e7fde4463823d5e81457e5b8475422
BLAKE2b-256 e22673dd9d69ba1ebabfed3f6bffb7ad8cd670d2966c30147ea66c726f398e0f

See more details on using hashes here.

File details

Details for the file madsci_data_manager-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: madsci_data_manager-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 9.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: pdm/2.26.6 CPython/3.12.3 Linux/6.14.0-1017-azure

File hashes

Hashes for madsci_data_manager-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4f9b375664db9b95d13a9b98479785be0bd1d49b8635756b116aa9e636c42d2f
MD5 019adfec73d5f4ed64a165b21b8ba5a1
BLAKE2b-256 239e2df02acf61aa4977acfbbc2f300194bf70b00c14de117202149f3ea52069

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page