Skip to main content

Dataset generation and management service for the Juniper ecosystem

Project description

Juniper Data

Dataset generation and management service for the Juniper ecosystem.

Overview

Juniper Data provides a centralized service for generating, storing, and serving datasets used by the Juniper neural network projects. It supports various dataset types including the classic two-spiral classification problem.

Ecosystem Compatibility

This service is part of the Juniper ecosystem. Verified compatible versions:

juniper-data juniper-cascor juniper-canopy data-client cascor-client cascor-worker
0.4.x 0.3.x 0.2.x >=0.3.1 >=0.1.0 >=0.1.0

For full-stack Docker deployment and integration tests, see juniper-deploy.

Architecture

JuniperData is the foundational data layer of the Juniper ecosystem. JuniperCascor and juniper-canopy both call JuniperData to generate and retrieve datasets.

┌─────────────────────┐     REST+WS      ┌──────────────────────┐
│   juniper-canopy     │ ◄──────────────► │    JuniperCascor     │
│   Dashboard         │                  │    Training Svc      │
│   Port 8050         │                  │    Port 8200         │
└──────────┬──────────┘                  └──────────┬───────────┘
           │ REST                                    │ REST
           ▼                                         ▼
┌──────────────────────────────────────────────────────────────┐
│                      JuniperData  ◄── (this service)          │
│                   Dataset Service  ·  Port 8100               │
└──────────────────────────────────────────────────────────────┘

Data contract: datasets are served as NPZ archives with keys X_train, y_train, X_test, y_test, X_full, y_full (all float32).

Related Services

Service Relationship Environment Variable
juniper-cascor Consumes JuniperData for training datasets JUNIPER_DATA_URL=http://localhost:8100
juniper-canopy Consumes JuniperData for visualization data JUNIPER_DATA_URL=http://localhost:8100
juniper-data-client PyPI client library for this service pip install juniper-data-client

Service Configuration

Variable Default Description
JUNIPER_DATA_HOST 0.0.0.0 Listen address
JUNIPER_DATA_PORT 8100 Service port
JUNIPER_DATA_LOG_LEVEL INFO Log verbosity

Docker Deployment

# Full stack with all three services:
git clone https://github.com/pcalnon/juniper-deploy.git  # (private repository)
cd juniper-deploy && docker compose up --build

Dependency Lockfile

The requirements.lock file pins exact dependency versions for reproducible Docker builds. The pyproject.toml retains flexible >= ranges for local development.

Regenerate after changing dependencies in pyproject.toml:

uv pip compile pyproject.toml --extra api --extra observability -o requirements.lock

Installation

Basic Installation

pip install -e .

With API Support

pip install -e ".[api]"

Development Installation

pip install -e ".[dev]"

Full Installation

pip install -e ".[all]"

Quick Start

Generate a Spiral Dataset

from juniper_data.generators.spiral import SpiralGenerator

generator = SpiralGenerator()
dataset = generator.generate(n_points=100, n_spirals=2, noise=0.1)

Start the API Server

uvicorn juniper_data.api.app:app --reload

API Endpoints

Endpoint Method Description
/v1/health GET Health check
/v1/health/live GET Liveness probe
/v1/health/ready GET Readiness probe (checks storage)
/v1/generators GET List all generators with schemas
/v1/generators/{name}/schema GET Get parameter schema for a generator
/v1/datasets POST Create dataset (or return cached dataset)
/v1/datasets GET List dataset IDs
/v1/datasets/filter GET Filter metadata by generator/tags/date/name/version
/v1/datasets/stats GET Aggregate dataset statistics
/v1/datasets/versions GET List all versions for a logical dataset name
/v1/datasets/latest GET Get latest version for a logical dataset name
/v1/datasets/batch-create POST Create multiple datasets
/v1/datasets/batch-delete POST Delete multiple datasets
/v1/datasets/batch-tags PATCH Update tags on multiple datasets
/v1/datasets/batch-export POST Export multiple datasets as ZIP
/v1/datasets/cleanup-expired POST Delete expired datasets
/v1/datasets/{id} GET Get dataset metadata
/v1/datasets/{id} DELETE Delete a dataset
/v1/datasets/{id}/artifact GET Download NPZ artifact
/v1/datasets/{id}/preview GET Preview first N samples as JSON
/v1/datasets/{id}/tags PATCH Add/remove tags on one dataset

See docs/api/JUNIPER_DATA_API.md for full endpoint documentation including filtering, batch operations, and tagging.

Named Dataset Versioning

POST /v1/datasets supports logical names for versioned datasets:

  • Set name to group related datasets into a version series.
  • Persisted creates with the same name auto-increment meta.dataset_version (1, 2, 3, ...).
  • Repeating an identical request returns the cached dataset and keeps its existing version.
  • Use GET /v1/datasets/versions?name=<dataset_name> to view history and GET /v1/datasets/latest?name=<dataset_name> to resolve the latest.

Project Structure

juniper-data/
├── juniper_data/
│   ├── core/           # Core functionality and base classes   ├── generators/     # Dataset generators (8 types)      ├── spiral/     # Multi-spiral classification      ├── xor/        # XOR classification      ├── gaussian/   # Mixture of Gaussians      ├── circles/    # Concentric circles      ├── checkerboard/ # 2D checkerboard pattern      ├── csv_import/ # CSV/JSON file import      ├── mnist/      # MNIST / Fashion-MNIST      └── arc_agi/    # ARC-AGI visual reasoning   ├── storage/        # Dataset persistence layer   ├── api/            # FastAPI application      └── routes/     # API route handlers   └── tests/          # Test suite       ├── unit/       # Unit tests       └── integration/ # Integration tests
├── pyproject.toml      # Project configuration
└── README.md           # This file

Development

Running Tests

pytest

Running Tests with Coverage

pytest --cov=juniper_data --cov-report=html

Code Formatting

ruff format juniper_data tests
ruff check --fix juniper_data tests

Type Checking

mypy juniper_data

Juniper Ecosystem

Repository Description
juniper-data Dataset generation service (this repo)
juniper-cascor CasCor neural network training service
juniper-canopy Real-time monitoring dashboard
juniper-data-client PyPI: juniper-data-client
juniper-cascor-client PyPI: juniper-cascor-client
juniper-cascor-worker PyPI: juniper-cascor-worker

License

MIT License - Copyright (c) 2024-2026 Paul Calnon

Git Leaks

gitleaks badge

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

juniper_data-0.6.0.tar.gz (135.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

juniper_data-0.6.0-py3-none-any.whl (178.4 kB view details)

Uploaded Python 3

File details

Details for the file juniper_data-0.6.0.tar.gz.

File metadata

  • Download URL: juniper_data-0.6.0.tar.gz
  • Upload date:
  • Size: 135.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for juniper_data-0.6.0.tar.gz
Algorithm Hash digest
SHA256 54afc8ebac9baba0b0d220f56dc82be34fe197ef6bc6acca5622b0e57cf3cf8a
MD5 e8ba2c218c436ae8538348446986360d
BLAKE2b-256 4814ce313c84dbe0e0ba884e45c9e705d74596c6af6929806a4fca56bc41854e

See more details on using hashes here.

Provenance

The following attestation bundles were made for juniper_data-0.6.0.tar.gz:

Publisher: publish.yml on pcalnon/juniper-data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file juniper_data-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: juniper_data-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 178.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for juniper_data-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ac1ee5d05de9a78f69d0570acb00346ab63d08358dcf42b35f854ffb8acf4252
MD5 7131a3f7d1473b8ee1870e1947c4bf5e
BLAKE2b-256 dc83ab49f329be1c422eed459338e7432c8493d87ab483a0201b9eef607c04ae

See more details on using hashes here.

Provenance

The following attestation bundles were made for juniper_data-0.6.0-py3-none-any.whl:

Publisher: publish.yml on pcalnon/juniper-data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page