The Modular Autonomous Discovery for Science (MADSci) Data Manager.
Project description
MADSci Data Manager
Handles capturing, storing, and querying data, in either JSON value or file form, created during the course of an experiment (either collected by instruments, or synthesized during anaylsis).
Notable Features
- Collects and stores data generated in the course of an experiment as "datapoints"
- Current datapoint types supported:
- Values, as JSON-serializable data
- Files, stored as-is
- Datapoints include metadata such as ownership info and date-timestamps
- Datapoints are queryable and searchable based on both value and metadata
Installation
The MADSci Data Manager is available via the Python Package Index, and can be installed via:
pip install madsci.data_manager
This python package is also included as part of the madsci Docker image. You can see an example docker image in this example compose file.
Note that you will also need a MongoDB database (included in the example compose file)
Usage
Manager
To create and run a new MADSci Data Manager, do the following in your MADSci lab directory:
- If you're not using docker compose, provision and configure a MongoDB instance.
- If you're using docker compose, define your data manager and mongodb services based on the example compose file.
# Create a Data Manager Definition
madsci manager add -t data_manager
# Start the database and Data Manager Server
docker compose up
# OR
python -m madsci.data_manager.data_server
You should see a REST server started on the configured host and port. Navigate in your browser to the URL you configured (default: http://localhost:8004/) to see if it's working.
You can see up-to-date documentation on the endpoints provided by your event manager, and try them out, via the swagger page served at http://your-data-manager-url-here/docs.
Client
You can use MADSci's DataClient (madsci.client.data_client.DataClient) in your python code to save, get, or query datapoints.
Here are some examples of using the DataClient to interact with the Data Manager:
from madsci.client.data_client import DataClient
from madsci.common.types.datapoint_types import ValueDataPoint, FileDataPoint
from datetime import datetime
# Initialize the DataClient
client = DataClient(url="http://localhost:8004")
# Create a ValueDataPoint
value_datapoint = ValueDataPoint(
label="Temperature Reading",
value={"temperature": 23.5, "unit": "Celsius"},
data_timestamp=datetime.now()
)
# Submit the ValueDataPoint
submitted_value_datapoint = client.submit_datapoint(value_datapoint)
print(f"Submitted ValueDataPoint: {submitted_value_datapoint}")
# Retrieve the ValueDataPoint by ID
retrieved_value_datapoint = client.get_datapoint(submitted_value_datapoint.datapoint_id)
print(f"Retrieved ValueDataPoint: {retrieved_value_datapoint}")
# Create a FileDataPoint
file_datapoint = FileDataPoint(
label="Experiment Log",
path="/path/to/experiment_log.txt",
data_timestamp=datetime.now()
)
# Submit the FileDataPoint
submitted_file_datapoint = client.submit_datapoint(file_datapoint)
print(f"Submitted FileDataPoint: {submitted_file_datapoint}")
# Retrieve the FileDataPoint by ID
retrieved_file_datapoint = client.get_datapoint(submitted_file_datapoint.datapoint_id)
print(f"Retrieved FileDataPoint: {retrieved_file_datapoint}")
# Save the file from the FileDataPoint to a local path
client.save_datapoint_value(submitted_file_datapoint.datapoint_id, "/local/path/to/save/experiment_log.txt")
print("File saved successfully.")
Object Storage Integration
The MADSci Data Manager supports optional MinIO object storage for efficient handling of large files. When configured, file datapoints are automatically stored in object storage instead of local filesystem storage. MinIO Documentation
How It Works
With Object Storage Configured:
- File datapoints are uploaded to MinIO object storage during submission
- Object storage metadata (bucket name, object name, public URL, etc.) is stored in the database
- Datapoint type automatically changes from
filetoobject_storage - Automatic fallback to local storage if object storage upload fails
Without Object Storage (Default Behavior):
- File datapoints are stored locally on the filesystem
- File paths are stored in the database
- Existing behavior is preserved with no changes required
Configuration
Enable object storage by adding MinIO configuration to your Data Manager definition:
# example_data_manager.manager.yaml
name: example_data_manager
db_url: mongodb://localhost:27017
host: localhost
port: 8004
file_storage_path: ./data
# Add MinIO object storage configuration
minio_client_config:
endpoint: "localhost:9000"
access_key: "minioadmin"
secret_key: "minioadmin"
secure: false
default_bucket: "madsci-data"
Docker Compose Setup
The /MADSci/compose.yaml includes a pre-configured MinIO service:
# Start all services including MinIO
docker compose up
# Access MinIO Console
open http://localhost:9001
# Login: minioadmin / minioadmin
MinIO will be available at:
- API Endpoint:
http://localhost:9000 - Web Console:
http://localhost:9001
Cloud Storage Integration
The MadSci Data Client supports multiple cloud storage providers through S3-compatible APIs. This allows you to store large files efficiently across different cloud platforms.
Supported Providers
- Amazon Web Services (AWS) S3
- Google Cloud Storage (GCS) - using S3-compatible HMAC authentication
- MinIO (self-hosted or cloud)
- Any S3-compatible storage service
Configuration
AWS S3
from madsci.common.types.datapoint_types import ObjectStorageDefinition
from madsci.client.data_client import DataClient
aws_config = ObjectStorageDefinition(
endpoint="s3.amazonaws.com",
access_key="AKIAIOSFODNN7EXAMPLE", # Your AWS Access Key ID
secret_key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", # Your AWS Secret Access Key
secure=True,
default_bucket="my-madsci-bucket",
region="us-east-1" # Specify your AWS region
)
client = DataClient(object_storage_config=aws_config)
Google Cloud Storage (GCS)
GCS requires HMAC keys for S3-compatible access:
gcs_config = ObjectStorageDefinition(
endpoint="storage.googleapis.com",
access_key="GOOGTS7C7FIS2E4U4RBGEXAMPLE", # Your GCS HMAC Access Key
secret_key="bGoa+V7g/yqDXvKRqq+JTFn4uQZbPiQJo8rkEXAMPLE", # Your GCS HMAC Secret
secure=True,
default_bucket="my-gcs-bucket"
)
client = DataClient(object_storage_config=gcs_config)
Authentication Setup
AWS S3 Authentication
-
IAM User Method (Recommended):
# Create IAM user with S3 permissions # Get Access Key ID and Secret Access Key from AWS Console
-
Environment Variables:
export AWS_ACCESS_KEY_ID="your-access-key" export AWS_SECRET_ACCESS_KEY="your-secret-key"
-
AWS CLI Profile:
aws configure --profile madsci # Then reference the profile in your application
Google Cloud Storage Authentication
-
Generate HMAC Keys:
# In Google Cloud Console: # Storage > Settings > Interoperability > Create Key
-
Service Account Method:
# Create service account with Storage Admin role # Generate HMAC key for the service account
Usage Examples
from madsci.common.types.datapoint_types import ObjectStorageDataPoint
# Create object storage datapoint directly
storage_datapoint = ObjectStorageDataPoint(
label="Preprocessed Data",
path="/path/to/local-file.parquet",
bucket_name="my-bucket",
object_name="datasets/processed_data.parquet",
storage_endpoint="s3.amazonaws.com",
public_endpoint="s3.amazonaws.com",
content_type="application/octet-stream",
custom_metadata={
"dataset_version": "v2.1",
"processing_date": "2024-01-15"
}
)
uploaded = client.submit_datapoint(storage_datapoint)
Regional Endpoints
AWS S3 Regional Endpoints
# US East (N. Virginia) - Default
endpoint="s3.amazonaws.com"
# US West (Oregon)
endpoint="s3.us-west-2.amazonaws.com"
# Europe (Ireland)
endpoint="s3.eu-west-1.amazonaws.com"
# Asia Pacific (Tokyo)
endpoint="s3.ap-northeast-1.amazonaws.com"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file madsci_data_manager-0.3.1.tar.gz.
File metadata
- Download URL: madsci_data_manager-0.3.1.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.24.2 CPython/3.9.22 Linux/6.11.0-1014-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a09d80a8b134cdaa0444703bd54213532d2070adefdd48f8f7ccc89337e1227
|
|
| MD5 |
7ee1d675aabd920c0151d2df6ed9c223
|
|
| BLAKE2b-256 |
f37511d0e649ae7b455c8a2cf209461a21284e990d1c190d2bea63691ef7e46f
|
File details
Details for the file madsci_data_manager-0.3.1-py3-none-any.whl.
File metadata
- Download URL: madsci_data_manager-0.3.1-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: pdm/2.24.2 CPython/3.9.22 Linux/6.11.0-1014-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8ccf25337794fed42e9d132e7105aeaa453ae98c4ccecdb60b9590edb8344785
|
|
| MD5 |
629307a2ec33bf5a46147f79300022c1
|
|
| BLAKE2b-256 |
46767918008180dc3319e8af36068ae7c4f1dcc87814b4a6fbd00c4302f56bfe
|