Skip to main content

A comprehensive training service library for AI models in the Nedo Vision platform

Project description

Nedo Vision Training Service

A distributed AI model training service for the Nedo Vision platform. This service manages training workflows, monitoring, and lifecycle management for computer vision models using RF-DETR architecture.

Features

  • Configurable Training Service: Automated training with customizable intervals and parameters
  • gRPC Communication: Reliable communication with the vision manager and other services
  • Distributed Training: Support for multi-GPU and distributed training scenarios
  • Real-time Monitoring: System resource monitoring and training progress tracking
  • Cloud Integration: AWS S3 integration for model storage and dataset management
  • Message Queue Support: RabbitMQ integration for task queue management

Installation

Install the package from PyPI:

pip install nedo-vision-training

For GPU support with CUDA 12.1:

pip install nedo-vision-training[gpu] --extra-index-url https://download.pytorch.org/whl/cu121

For development with all tools:

pip install nedo-vision-training[dev]

Quick Start

Using the CLI

After installation, you can use the training service CLI:

# Show CLI help
nedo-trainer --help

# Start training service with authentication token
nedo-trainer --token YOUR_TOKEN

# Start with custom server configuration
nedo-trainer --token YOUR_TOKEN --server-host custom.server.com --server-port 60000

# Start with custom system usage reporting interval (in seconds)
nedo-trainer --token YOUR_TOKEN --system-usage-interval 30

# Start with custom latency monitoring interval (in seconds)
nedo-trainer --token YOUR_TOKEN --latency-check-interval 15

Configuration Options

The service supports various configuration options:

  • --token: Authentication token for secure communication
  • --server-host: gRPC server host (default: localhost)
  • --server-port: gRPC server port (default: 50051)
  • --system-usage-interval: System usage reporting interval in seconds (default: 30)
  • --latency-check-interval: Latency monitoring interval in seconds (default: 10)

Architecture

Core Components

  • TrainingService: Main service orchestrator for training workflows
  • RFDETRTrainer: RF-DETR algorithm implementation with PyTorch backend
  • TrainerLogger: Real-time training progress logging via gRPC
  • ResourceMonitor: System resource monitoring (GPU, CPU, memory)

Dependencies

The service relies on several key technologies:

  • PyTorch: Deep learning framework with CUDA support
  • RF-DETR: Roboflow's Real-time Detection Transformer
  • gRPC: High-performance RPC framework
  • RabbitMQ: Message queue for distributed task management
  • AWS SDK: Cloud storage integration
  • NVIDIA ML: GPU monitoring and management

Development Setup

Troubleshooting

Common Issues

  1. gRPC Connection Timeouts: Ensure the server host and port are correctly configured
  2. CUDA Out of Memory: Reduce batch size or use gradient accumulation
  3. Missing Dependencies: Reinstall with pip install --upgrade nedo-vision-training

Support

For issues and questions:

  • Check the logs for detailed error information
  • Ensure your token is valid and not expired
  • Verify network connectivity to the training manager

License

This project is part of the Nedo Vision platform. Please refer to the main project license for usage terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nedo_vision_training-1.1.0.tar.gz (49.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nedo_vision_training-1.1.0-py3-none-any.whl (66.3 kB view details)

Uploaded Python 3

File details

Details for the file nedo_vision_training-1.1.0.tar.gz.

File metadata

  • Download URL: nedo_vision_training-1.1.0.tar.gz
  • Upload date:
  • Size: 49.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for nedo_vision_training-1.1.0.tar.gz
Algorithm Hash digest
SHA256 5fc729fcdf50d9ee18ea12bd956492225c7ee8c204cbf876891c434987a3fc85
MD5 32e698803b6c54683297849b6bb9dc16
BLAKE2b-256 9ede3d91c0ebc8a35a5e013a2ecb269dfefcc3dbb20a3a30dd2f2aa8a1da84b7

See more details on using hashes here.

File details

Details for the file nedo_vision_training-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for nedo_vision_training-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 25af51b831e8b0f5c018ed5b1f67353054fb70eb4168d561e9fff301c9cb5f45
MD5 2f9df6308d1ffad315bc3dd0d6b2f336
BLAKE2b-256 5d48a4c031f84028901e6f3d3b898493736f3c9c145f51ab11859d7d0660c9b0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page