Skip to main content

DigitalTwin - Dataspace is a Python package that provides a simple and efficient way to create, manage, and query data spaces.

Project description

DigitalTwin - Data Space

DigitalTwin Data Sapce is a Python package for creating, managing, and querying data spaces, with a focus on modular data pipelines and digital twin applications. It provides a flexible framework to define, schedule, and run data collectors, harvesters, and handlers, supporting complex data workflows and dependencies.


Features

  • Modular Components: Define custom Collectors, Harvesters, and Handlers for your data workflows.
  • Configuration-Driven: Easily configure your data pipeline using TOML files.
  • Dependency Management: Automatically resolves and schedules component dependencies.
  • CLI Interface: Run, schedule, and manage your data pipeline from the command line.
  • Extensible: Add new data sources, processing steps, or outputs by implementing new components.

Installation

pip install digitaltwin_dataspace

Or, for development:

git clone https://github.com/GaspardMerten/digitaltwin_dataspace.git
cd digitaltwin_dataspace
pip install -e .

Dependencies:

  • Python 3.8+
  • requests
  • SQLAlchemy
  • azure-storage-blob
  • schedule
  • dotenv

(See pyproject.toml for the full list.)


Usage

Command Line Interface

The main entry point is the dt-dataspace CLI:

dt-dataspace --config-folder path/to/config [options]

Key options:

  • --config-folder: Path to the configuration folder (default: config)
  • --init-dependencies: Run all harvesters in dependency order
  • --handlers: List of handler names to run
  • --collectors: List of collector names to run
  • --harvesters: List of harvester names to run
  • --now: Run harvesters or collectors once and exit
  • --port: Port for the handlers server (default: 8888)
  • --host: Host for the handlers server (default: localhost)
  • --allowed-hosts: Allowed hosts for the handlers server
  • --log-level: Set logging level (DEBUG, INFO, etc.)
  • --parquetize: List of harvester names to run for parquet output

Project Structure

digitaltwin_dataspace/
│
├── components/
│   ├── collector.py   # Base Collector class
│   ├── handler.py     # Base Handler class
│   └── harvester.py   # Base Harvester class
│
├── configuration/
│   ├── load.py        # Loads and parses component configuration
│   └── model.py       # Configuration data models
│
├── data/
│   ├── sync_db.py     # Database sync logic
│   ├── retrieve.py    # Data retrieval utilities
│   └── ...            # Other data management modules
│
├── cli.py             # Command-line interface
└── ...

Components

  • Collector:
    Gathers data from external sources. Implement the Collector abstract class and its run() method.

  • Harvester:
    Processes or transforms collected data. Implement the Harvester abstract class and its run() method.

  • Handler:
    Serves or exposes processed data, e.g., via an API. Implement the Handler abstract class and its run() method.

You can add your own components by subclassing these base classes and registering them in your configuration.


Configuration

Configuration is done via TOML files in your config folder (default: config/).
Each file can define multiple collectors, harvesters, and handlers, specifying:

  • DATA_TYPE, DATA_FORMAT
  • PATH (Python import path to your component)
  • SCHEDULE (optional, for scheduling)
  • SOURCE, DEPENDENCIES (for workflow chaining)
  • Other custom parameters

See digitaltwin_dataspace/configuration/load.py for all supported options.


Author

Gaspard Merten
gaspard@norse.be


License

Attribution-NonCommercial-ShareAlike 4.0 International


Links


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digitaltwin_dataspace-0.0.1.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

digitaltwin_dataspace-0.0.1-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file digitaltwin_dataspace-0.0.1.tar.gz.

File metadata

  • Download URL: digitaltwin_dataspace-0.0.1.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for digitaltwin_dataspace-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d202ca4020a928288992e5e011790fd49012e3f15ffce6a1814c85f714eb1ca2
MD5 f1509f9577a6bf97640b6b4a83e80f2b
BLAKE2b-256 5c44c42e63dba4ca4fa4602ec7243afea204cd09962b9704f50af78cbcd54f45

See more details on using hashes here.

Provenance

The following attestation bundles were made for digitaltwin_dataspace-0.0.1.tar.gz:

Publisher: python-publish.yml on GaspardMerten/digitaltwin_dataspace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file digitaltwin_dataspace-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for digitaltwin_dataspace-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8ab2156650b7109cf3553faf1c1c878c825093fcbede72b4aef7ed182b862b29
MD5 7851a33ccce1fb6ca44d6dfc5222f5d7
BLAKE2b-256 2e0a05d7e1e3e1b324a693b6e9aef8f334131ffe2cf1d1fa85afe68e3b375d86

See more details on using hashes here.

Provenance

The following attestation bundles were made for digitaltwin_dataspace-0.0.1-py3-none-any.whl:

Publisher: python-publish.yml on GaspardMerten/digitaltwin_dataspace

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page