Skip to main content

Automated connector wrapper for streaming data securely from a private MinIO Data Lake

Project description

Terrafox Data Lake Connector

A simple, secure wrapper module to stream files out of your private data lake into remote notebook runtimes seamlessly.

terrafox-datalake

A lightweight, universal, stream-native connector wrapper designed to stream datasets securely from private MinIO and S3-compatible Data Lakes straight into Pandas dataframes.

By replacing traditional file-system directory mapping wrappers (s3fs/fsspec) with direct object streaming via boto3, this package completely eliminates network edge bottlenecks, Cloudflare proxy payload limits, and 403 Forbidden credential collisions caused by background directory scanning.


Key Features

  • Stream-Native Engine: Reads multi-gigabyte datasets (e.g., 1.3 GiB+ CSVs) linearly using high-performance byte-stream network chunks, keeping your local or Google Colab memory consumption minimal.
  • Bypasses Proxy Blocks: Sidesteps standard reverse-proxy constraints (like Cloudflare Tunnel 100 MiB Client Max Body Size upload blocks) during active read cycles.
  • Fully Universal & Repurposable: Zero hardcoded endpoints. Works natively out-of-the-box with your configured defaults or targets any custom local/cloud data lake clusters dynamically.
  • Zero Configuration Conflict: Completely abstracts complex botocore configuration arguments, address styling structures, and signature parameters out of your notebooks.

Installation

Terrafox Data Lake

A lightweight Python package for securely connecting to and streaming data from private MinIO-based data lake environments.

Installation

pip install terrafox-datalake

Quick Start

1. Connecting Natively via Interactive Prompt

If no background credentials are found, calling connect() will securely prompt you for your data lake credentials.

import terrafox_datalake as dl

# Initialize the data lake client context securely
dl.connect()

2. Silent Credentials Injection (Automated Workflows)

For automated scripts, CI/CD pipelines, headless environments, or to bypass the interactive login prompt in Google Colab, set your credentials as environment variables before initializing the connection.

import os
import terrafox_datalake as dl

# Pre-populate session credentials
os.environ["MINIO_USER"] = "admin"
os.environ["MINIO_PASSWORD"] = "your_secure_password"
os.environ["MINIO_ENDPOINT"] = "https://minio.terrafoxai.com"

# Initialize the connection
dl.connect()

3. Advanced Usage: Connecting to Different Infrastructures

Terrafox Data Lake is designed to be dynamic and reusable. Switch seamlessly between production environments, staging clusters, or local development instances.

import terrafox_datalake as dl

# Connect to an alternate cluster or local MinIO instance
dl.connect(endpoint="https://local-testing-cluster.local:9000")

# Read data from a different environment
df = dl.read_csv(
    bucket="test-bucket",
    key="metrics.csv"
)

Example: Reading Data from a Data Lake

import terrafox_datalake as dl

dl.connect()

df = dl.read_csv(
    bucket="bigdata",
    key="vehicles.csv"
)

print(df.head())

Architecture Requirements

  • Python: 3.7 or higher
  • Supported Storage: MinIO (S3-compatible object storage)

Dependencies

  • pandas
  • boto3
  • s3fs
  • fsspec

Features

  • Secure interactive authentication
  • Environment variable support for automation
  • Native MinIO integration
  • S3-compatible object storage access
  • Simple DataFrame-based data retrieval
  • Flexible infrastructure switching between environments

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

terrafox_datalake-0.1.3.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

terrafox_datalake-0.1.3-py3-none-any.whl (3.9 kB view details)

Uploaded Python 3

File details

Details for the file terrafox_datalake-0.1.3.tar.gz.

File metadata

  • Download URL: terrafox_datalake-0.1.3.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for terrafox_datalake-0.1.3.tar.gz
Algorithm Hash digest
SHA256 4cbf9bbb55b0d83442c753066bfa94c9d0718a2453ae099387412b68a3fbd020
MD5 1b3b9f150cc7d4ba5523c537ba8c783c
BLAKE2b-256 f197977e195c28e8d5a7a6f3aa434c1a40525ebf007a0151c6d9bbe672c88bc7

See more details on using hashes here.

File details

Details for the file terrafox_datalake-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for terrafox_datalake-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 77dc1db0de8ee841c3f24fc0ea75f0362e616956c375beaf1e1eff59dea41c05
MD5 e73611196b39c5a8fb2a13a2c6f37021
BLAKE2b-256 c2c138d5e20116b39edf4be6d0ddf01a9e0cb520e5e7c9f5a2d791747aca7bd0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page