Skip to main content

Data Stream

Project description

🚀 Data Proxy Service (data-stream)

A Python-based tool that allows you to stream data from a remote server to your local compute resources. This service is particularly useful when you need to train models on large datasets stored on a remote server but don't have sufficient storage on your local compute node.

This repository is a wrapper around the sshtunnel library and uses fastapi to create a simple HTTP server to stream the data.

✨ Features

  • 🔒 Stream data securely from a remote server using SSH tunneling
  • 📝 Support for SSH config aliases and direct SSH parameters
  • ⚡ FastAPI-powered HTTP endpoint for data access
  • 🤖 Automatic management of remote Python HTTP server
  • 🏥 Health check endpoint for monitoring
  • 🔑 Support for both SSH key and password authentication
  • ⚙️ Configurable ports for local and remote connections
  • 🛑 Graceful shutdown handling

📦 Installation

Install the package using pip:

pip install data-stream

Alternatively, Clone this repository:

   git clone https://github.com/yourusername/data-proxy-service.git
   cd data-proxy-service
   pip install -e .

🔧 Usage: Command-line Interface

To start the Data Proxy Service, use one of the following methods:

1. Using SSH Config Alias 📋

If you have an SSH config file (~/.ssh/config) with your server details:

data-stream --ssh-host-alias myserver --data-path /path/to/remote/data

Here is an example of an SSH config file:

Host myserver
    HostName example.com
    User mouloud
    IdentityFile ~/.ssh/id_rsa

2. Using Direct SSH Parameters 🔑

data-stream \
  --ssh-host example.com \
  --ssh-username myusername \
  --ssh-key-path ~/.ssh/id_rsa \
  --data-path /path/to/remote/data

Optional Parameters ⚙️

  • --local-port: Local port for SSH tunnel (default: 8000)
  • --remote-port: Remote port for HTTP server (default: 8001)
  • --fastapi-port: FastAPI server port (default: 5001)
  • --ssh-password: SSH password (if not using key-based authentication)

Example with all parameters:

data-stream \
  --ssh-host example.com \
  --ssh-username john \
  --data-path /home/john/datasets \
  --ssh-key-path ~/.ssh/id_rsa \
  --local-port 8000 \
  --remote-port 8001 \
  --fastapi-port 5000

3.Using Environment Variables 🔧

You can also configure the service using environment variables:

  • PROXY_SSH_HOST_ALIAS: SSH host alias (for SSH config)
  • PROXY_SSH_HOST: SSH host (cluster 1)
  • PROXY_SSH_USERNAME: SSH username
  • PROXY_DATA_PATH: Path to data on cluster 1
  • PROXY_SSH_KEY_PATH: Path to SSH key
  • PROXY_SSH_PASSWORD: SSH password (if not using key)
  • PROXY_LOCAL_PORT: Local port for SSH tunnel
  • PROXY_REMOTE_PORT: Remote port for HTTP server
  • PROXY_FASTAPI_PORT: FastAPI server port

🖥️ HPC Usage

When using data-stream on an HPC (High-Performance Computing) system:

⚠️ Important: Always start the service on a compute node, not on the login node. Login nodes are shared resources and aren't suitable for running services.

Example using SLURM:

#!/bin/bash
#SBATCH --job-name=data-stream
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=24:00:00

data-stream \
  --ssh-host-alias myserver \
  --data-path /path/to/remote/data

📊 Integration Examples

WebDataset Integration 📦

data-stream works seamlessly with WebDataset for efficient data loading in machine learning pipelines:

import webdataset as wds
from torch.utils.data import DataLoader

# Start data-stream service (as shown above)

# Create WebDataset pipeline
dataset = wds.WebDataset('http://localhost:5000/data/path/to/tarfiles/{000000..999999}.tar')

# Create DataLoader
dataloader = DataLoader(dataset, batch_size=None, num_workers=4)

# Use in training
for batch_input, batch_target in dataloader:
    # Your training code here
    pass

📂 Accessing Data

Once the service is running, you can access your data through:

http://localhost:5000/data/path/to/file

You can test the data stream by running:

curl http://localhost:5000/health/shard_0001.tar -o test.tar

🏥 Health Check

You can verify the service status using:

curl http://localhost:5000/health

This will return:

{
  "status": "OK",
  "connection": {
    "hostname": "example.com",
    "username": "myusername",
    "using_ssh_config": true
  }
}

🐍 Using as a Python Package

You can also use data-stream in your Python code:

from data_stream import DataProxyService, Settings

# Using SSH config alias
settings = Settings(
    ssh_host_alias="myserver",
    data_path="/path/to/remote/data"
)

# Or using direct parameters
settings = Settings(
    ssh_host="example.com",
    ssh_username="myusername",
    ssh_key_path="~/.ssh/id_rsa",
    data_path="/path/to/remote/data"
)

# Initialize and start the service
service = DataProxyService(settings)
await service.start()

# When done
await service.stop()

📋 Requirements

  • Python 3.7+
  • SSH access to the remote server
  • Python installation on the remote server

🔧 Troubleshooting

Common Issues

  1. 🚫 Permission Denied

    • Verify your username and SSH key are correct
    • Check if your user has access to the data directory on the remote server
  2. ⚠️ Port Already in Use

    • Try different ports using --local-port, --remote-port, or --fastapi-port
    • Check if another instance of data-stream is already running
    • On HPC, ensure no other jobs are using the same ports (that why it important to run on the compute node)
  3. 🔌 Remote Server Issues

    • Ensure Python is installed on the remote server
    • Check if the data path exists and is accessible

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Release history Release notifications | RSS feed

This version

1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_streaming-1.0.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

data_streaming-1.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file data_streaming-1.0.tar.gz.

File metadata

  • Download URL: data_streaming-1.0.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.7

File hashes

Hashes for data_streaming-1.0.tar.gz
Algorithm Hash digest
SHA256 f10df69a01cad5d1873979d6f06660380f8c571c44922d4b7da42c9a7284e92e
MD5 de965414ac8bee48ffe01e8c50851da2
BLAKE2b-256 5090ec28ded0ac0fbda084ffa320ec8d3e3fa932ce32bcca23480f5a283afd78

See more details on using hashes here.

File details

Details for the file data_streaming-1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for data_streaming-1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f388d6ae4346f6d9c30fb772d7292c951084b44ad31454da8f4ed81fd164b4b
MD5 23e737f8175dcecf2886bd157ebf3895
BLAKE2b-256 29c5cf0a63ad59d8b98a37f6dcb1954f0683b8f9b6626652ab5dce52908ab498

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page