Skip to main content

Official Python SDK for Helix Connect Data Marketplace

Project description

Helix Connect Python SDK

PyPI version Python 3.8+ License: MIT

Official Python SDK for Helix Connect Data Marketplace - a secure, scalable platform for exchanging datasets between producers and consumers.

๐Ÿš€ Features

  • Consumer API: Download and subscribe to datasets
  • Producer API: Upload and manage datasets (includes all consumer features)
  • Admin API: Platform management (includes all producer + consumer features)
  • Secure: AWS SigV4 authentication + AES-256-GCM envelope encryption
  • Efficient: Compress-then-encrypt pipeline with ~90% space savings
  • Progress Tracking: Real-time upload/download progress callbacks
  • Notifications: SQS-based dataset update notifications with long-polling
  • Type-Safe: Full type hints with mypy support

๐Ÿ“ฆ Installation

pip install helix-connect

Development Installation

git clone https://github.com/helix-tools/helix-connect-sdk-python.git
cd helix-connect-sdk-python
pip install -e ".[dev]"

๐Ÿ”ง Prerequisites

  • Python 3.8 or higher
  • AWS credentials (provided during customer onboarding)
  • Helix Connect customer ID (UUID format)

๐Ÿ“– Quick Start

Consumer: Download Datasets

from helix_connect import HelixConsumer

# Initialize consumer
consumer = HelixConsumer(
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key",
    customer_id="your-customer-id",
    api_endpoint="https://api.helix-connect.com"  # optional
)

# List available datasets
datasets = consumer.list_datasets()
for ds in datasets:
    print(f"{ds['name']}: {ds['description']}")

# Download a dataset
consumer.download_dataset(
    dataset_id="a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    output_path="./data/my_dataset.csv"
)

# Subscribe to dataset updates
consumer.subscribe_to_dataset(dataset_id="...")

# Poll for notifications (long-polling with auto-download)
notifications = consumer.poll_notifications(
    max_messages=10,
    wait_time=20,  # seconds
    auto_download=True,
    output_dir="./downloads"
)

Producer: Upload Datasets

from helix_connect import HelixProducer

# Initialize producer (inherits all consumer capabilities)
producer = HelixProducer(
    aws_access_key_id="your-access-key",
    aws_secret_access_key="your-secret-key",
    customer_id="your-customer-id"
)

# Upload a dataset with progress tracking
def progress_callback(bytes_transferred, total_bytes):
    percent = (bytes_transferred / total_bytes) * 100
    print(f"Progress: {percent:.1f}%")

producer.upload_dataset(
    file_path="./data/my_dataset.csv",
    dataset_name="my-awesome-dataset",
    description="Q4 2024 sales data",
    data_freshness="daily",
    progress_callback=progress_callback
)

# Update existing dataset
producer.update_dataset(
    dataset_id="...",
    file_path="./data/updated_dataset.csv"
)

# List your uploaded datasets
my_datasets = producer.list_my_datasets()

Admin: Platform Management

from helix_connect import HelixAdmin

# Initialize admin (inherits producer + consumer capabilities)
admin = HelixAdmin(
    aws_access_key_id="admin-access-key",
    aws_secret_access_key="admin-secret-key",
    customer_id="admin-customer-id"
)

# Create new customer
customer = admin.create_customer(
    customer_name="Acme Corp",
    contact_email="data@acme.com"
)

# List all customers
customers = admin.list_customers()

# Get platform statistics
stats = admin.get_platform_stats()
print(f"Total datasets: {stats['total_datasets']}")
print(f"Total customers: {stats['total_customers']}")

Admin: JWT Token Generation

Generate schema-compliant JWT tokens for testing, development, or service-to-service communication:

from helix_connect import HelixAdmin

admin = HelixAdmin(
    aws_access_key_id="admin-access-key",
    aws_secret_access_key="admin-secret-key",
    customer_id="admin-customer-id"
)

# Generate a user token
token = admin.generate_token(
    sub="user@example.com",
    customer_id="company-123",
    email="user@example.com",
    customer_type="consumer",  # "producer", "consumer", or "both"
    tier="starter",
)

# Generate an admin token (convenience method)
admin_token = admin.generate_admin_token(
    sub="admin@helix.tools",
    customer_id="company-admin",
    email="admin@helix.tools",
    customer_type="both",
)

# Token with custom expiry and all claims
token = admin.generate_token(
    sub="user@example.com",
    customer_id="company-123",
    email="user@example.com",
    customer_type="producer",
    role="user",
    tier="enterprise",
    login_method="oauth",
    expiry_minutes=120,
)

JWT Secret Resolution Order:

  1. Explicit secret argument
  2. HELIX_JWT_SECRET environment variable
  3. SSM Parameter Store (/{env}/customers/{customer_id}/jwt_secret)

Token Claims:

  • Required: sub, customer_id, email, customer_type, role, iss, iat, exp
  • Optional: tier, authenticated_at, login_method, nbf

๐Ÿ—๏ธ Architecture

Class Hierarchy

HelixConsumer (base class)
    โ†“
HelixProducer (adds upload capabilities)
    โ†“
HelixAdmin (adds platform management)

Each class inherits all capabilities from its parent, so:

  • Producers can also consume data
  • Admins can produce and consume data

Security & Encryption

The SDK implements a compress-then-encrypt pipeline with envelope encryption:

  1. Compression: Gzip compression (configurable levels 1-9)
  2. Envelope Encryption:
    • Generates random 256-bit AES key
    • Encrypts data with AES-256-GCM
    • Encrypts AES key with AWS KMS
    • Packages as: [key_len][encrypted_key][iv][tag][encrypted_data]

This approach:

  • โœ… Supports files of unlimited size (no KMS 4KB limit)
  • โœ… Achieves ~90% space savings through compression
  • โœ… Provides authenticated encryption with GCM
  • โœ… Uses AWS KMS for secure key management

Network Configuration

  • API Timeouts: 10s connect, 30s read (configurable)
  • Download Timeouts: 10s connect, unlimited read (for large files)
  • Credential Validation: Fail-fast with STS on initialization

๐Ÿ“š Examples

See the examples/ directory for comprehensive usage examples:

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=helix_connect --cov-report=html

# Run specific test suite
pytest tests/test_encryption_compression.py -v

# Run standalone pipeline test
python tests/test_pipeline_standalone.py

Test Results

The SDK includes comprehensive tests for the encryption/compression pipeline:

โœ“ test_compress_data - 90.9% compression on JSON data
โœ“ test_envelope_encryption_decryption - AES-256-GCM envelope format
โœ“ test_full_pipeline_compress_then_encrypt - End-to-end verification
โœ“ test_wrong_order_encrypt_then_compress - Proves old order was broken
โœ“ 10 tests total, all passing

โš™๏ธ Configuration

Environment Variables

# Required
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export HELIX_CUSTOMER_ID="your-customer-id"

# Optional
export HELIX_API_ENDPOINT="https://api-go.helix.tools"
export HELIX_COMPRESSION_LEVEL="6"  # 1-9, default: 6

Programmatic Configuration

consumer = HelixConsumer(
    aws_access_key_id="...",
    aws_secret_access_key="...",
    customer_id="...",
    api_endpoint="https://api-go.helix.tools",
    region="us-east-1",
    compression_level=6  # 1=fastest, 9=best compression
)

๐Ÿ” Security Best Practices

  1. Never commit credentials to version control
  2. Use environment variables or AWS Secrets Manager
  3. Rotate credentials regularly
  4. Use IAM roles when running on AWS infrastructure
  5. Validate data integrity after downloads
  6. Monitor CloudWatch logs for anomalies

๐Ÿ› Error Handling

The SDK provides specific exceptions for different error scenarios:

from helix_connect.exceptions import (
    AuthenticationError,
    PermissionDeniedError,
    DatasetNotFoundError,
    RateLimitError,
    UploadError,
    DownloadError,
    HelixError  # Base exception
)

try:
    consumer.download_dataset(dataset_id="...", output_path="...")
except AuthenticationError:
    print("Invalid AWS credentials")
except PermissionDeniedError:
    print("No access to this dataset - subscribe first")
except DatasetNotFoundError:
    print("Dataset doesn't exist")
except RateLimitError as e:
    print(f"Rate limit exceeded - retry after {e.retry_after}s")
except HelixError as e:
    print(f"General error: {e}")

๐Ÿ“Š Performance

Compression Benchmarks

Based on real-world testing with JSON data:

Data Type Original Size Compressed Savings
JSON (user data) 92 KB 8 KB 90.9%
CSV (sales data) 150 KB 18 KB 88.0%
XML (config) 45 KB 6 KB 86.7%

Note: Encrypting first (old broken code) resulted in ~0% compression!

Network Performance

  • Chunked uploads: 8MB chunks for large files
  • Parallel downloads: Multi-threaded for multiple datasets
  • Progress callbacks: Real-time feedback without performance impact
  • Connection pooling: Reuses HTTP connections for efficiency

๐Ÿ› ๏ธ Development

Build & Validate

# Build package
python -m build

# Run build script (includes validation)
./scripts/build.sh

# Lint code
flake8 helix_connect/
black helix_connect/
mypy helix_connect/

Project Structure

helix-connect-sdk-python/
โ”œโ”€โ”€ helix_connect/          # SDK source code
โ”‚   โ”œโ”€โ”€ __init__.py         # Package exports
โ”‚   โ”œโ”€โ”€ consumer.py         # Consumer API
โ”‚   โ”œโ”€โ”€ producer.py         # Producer API
โ”‚   โ”œโ”€โ”€ admin.py            # Admin API
โ”‚   โ””โ”€โ”€ exceptions.py       # Custom exceptions
โ”œโ”€โ”€ tests/                  # Test suite
โ”‚   โ”œโ”€โ”€ test_encryption_compression.py
โ”‚   โ””โ”€โ”€ test_pipeline_standalone.py
โ”œโ”€โ”€ examples/               # Usage examples
โ”‚   โ”œโ”€โ”€ consumer_example.py
โ”‚   โ”œโ”€โ”€ producer_example.py
โ”‚   โ””โ”€โ”€ admin_example.py
โ”œโ”€โ”€ scripts/                # Build scripts
โ”‚   โ””โ”€โ”€ build.sh
โ”œโ”€โ”€ pyproject.toml          # Package configuration
โ””โ”€โ”€ README.md               # This file

๐Ÿค Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Run tests (pytest)
  4. Commit changes (git commit -m 'Add amazing feature')
  5. Push to branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

Code Standards

  • Style: Follow PEP 8 (enforced by black)
  • Types: Include type hints for all functions
  • Tests: Maintain >80% coverage
  • Docs: Update docstrings for public APIs

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ”— Links

๐Ÿ“ Changelog

v1.0.0 (2024-10-14)

โœจ Features

  • Initial release with Consumer, Producer, and Admin APIs
  • AES-256-GCM envelope encryption for unlimited file sizes
  • Compress-then-encrypt pipeline with ~90% space savings
  • Real-time progress tracking for uploads/downloads
  • SQS-based dataset update notifications
  • Long-polling support with auto-download
  • Comprehensive test suite (10 tests, all passing)

๐Ÿ”ง Improvements

  • Network timeouts (API: 30s, Downloads: unlimited)
  • Credential validation on initialization (fail-fast)
  • Proper exception handling throughout
  • Type hints for all public APIs

๐Ÿ› Bug Fixes

  • Fixed KMS 4KB limit with envelope encryption
  • Fixed compress-then-encrypt order (was reversed)
  • Removed all emojis (encoding issues)
  • Fixed bare except clauses

๐Ÿ’ฌ Support

For questions, issues, or feature requests:

๐Ÿ™ Acknowledgments

Built with:


Made with โค๏ธ by the Helix Tools team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

helix_connect-2.3.0.tar.gz (58.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

helix_connect-2.3.0-py3-none-any.whl (40.4 kB view details)

Uploaded Python 3

File details

Details for the file helix_connect-2.3.0.tar.gz.

File metadata

  • Download URL: helix_connect-2.3.0.tar.gz
  • Upload date:
  • Size: 58.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for helix_connect-2.3.0.tar.gz
Algorithm Hash digest
SHA256 b802d245cab2d09b232e630c862cfd7baf014c7f6b7dcc5f7f7a226b0a219730
MD5 766820c2d47034f6e8cfcb0dced057fa
BLAKE2b-256 619f524f5e05fe7159e2d1fe0bff5cc5c13138867cc5b0c3e9013a896588ba00

See more details on using hashes here.

File details

Details for the file helix_connect-2.3.0-py3-none-any.whl.

File metadata

  • Download URL: helix_connect-2.3.0-py3-none-any.whl
  • Upload date:
  • Size: 40.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for helix_connect-2.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5dbd70cfa1e8c38e5088685d6cab273775893efb523956e866adc992da5b863f
MD5 f3a48530b4a28952a5b2f74e724cbabb
BLAKE2b-256 e562455ea75dd1dc56e71b79729964671dc2ce5fc5af91562b79df10b16f96f4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page