Skip to main content

Automated connector wrapper for streaming data securely from a private MinIO Data Lake

Project description

Terrafox Data Lake Connector

A simple, secure wrapper module to stream files out of your private data lake into remote notebook runtimes seamlessly.

terrafox-datalake

A lightweight, universal, stream-native connector wrapper designed to stream datasets securely from private MinIO and S3-compatible Data Lakes straight into Pandas dataframes.

By replacing traditional file-system directory mapping wrappers (s3fs/fsspec) with direct object streaming via boto3, this package completely eliminates network edge bottlenecks, Cloudflare proxy payload limits, and 403 Forbidden credential collisions caused by background directory scanning.


Key Features

  • Stream-Native Engine: Reads multi-gigabyte datasets (e.g., 1.3 GiB+ CSVs) linearly using high-performance byte-stream network chunks, keeping your local or Google Colab memory consumption minimal.
  • Bypasses Proxy Blocks: Sidesteps standard reverse-proxy constraints (like Cloudflare Tunnel 100 MiB Client Max Body Size upload blocks) during active read cycles.
  • Fully Universal & Repurposable: Zero hardcoded endpoints. Works natively out-of-the-box with your configured defaults or targets any custom local/cloud data lake clusters dynamically.
  • Zero Configuration Conflict: Completely abstracts complex botocore configuration arguments, address styling structures, and signature parameters out of your notebooks.

Installation

Terrafox Data Lake

A lightweight Python package for securely connecting to and streaming data from private MinIO-based data lake environments.

Installation

pip install terrafox-datalake

Quick Start

1. Connecting Natively via Interactive Prompt

If no background credentials are found, calling connect() will securely prompt you for your data lake credentials.

import terrafox_datalake as dl

# Initialize the data lake client context securely
dl.connect()

2. Silent Credentials Injection (Automated Workflows)

For automated scripts, CI/CD pipelines, headless environments, or to bypass the interactive login prompt in Google Colab, set your credentials as environment variables before initializing the connection.

import os
import terrafox_datalake as dl

# Pre-populate session credentials
os.environ["MINIO_USER"] = "admin"
os.environ["MINIO_PASSWORD"] = "your_secure_password"
os.environ["MINIO_ENDPOINT"] = "https://minio.terrafoxai.com"

# Initialize the connection
dl.connect()

3. Advanced Usage: Connecting to Different Infrastructures

Terrafox Data Lake is designed to be dynamic and reusable. Switch seamlessly between production environments, staging clusters, or local development instances.

import terrafox_datalake as dl

# Connect to an alternate cluster or local MinIO instance
dl.connect(endpoint="https://local-testing-cluster.local:9000")

# Read data from a different environment
df = dl.read_csv(
    bucket="test-bucket",
    key="metrics.csv"
)

Example: Reading Data from a Data Lake

import terrafox_datalake as dl

dl.connect()

df = dl.read_csv(
    bucket="bigdata",
    key="vehicles.csv"
)

print(df.head())

Architecture Requirements

  • Python: 3.7 or higher
  • Supported Storage: MinIO (S3-compatible object storage)

Dependencies

  • pandas
  • boto3
  • s3fs
  • fsspec

Features

  • Secure interactive authentication
  • Environment variable support for automation
  • Native MinIO integration
  • S3-compatible object storage access
  • Simple DataFrame-based data retrieval
  • Flexible infrastructure switching between environments

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

terrafox_datalake-0.1.4.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

terrafox_datalake-0.1.4-py3-none-any.whl (3.9 kB view details)

Uploaded Python 3

File details

Details for the file terrafox_datalake-0.1.4.tar.gz.

File metadata

  • Download URL: terrafox_datalake-0.1.4.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for terrafox_datalake-0.1.4.tar.gz
Algorithm Hash digest
SHA256 09952fc83375bdafb726ceea43083f0ea0f172fa22f33917f721f7ce2867eec6
MD5 2bf9e196c39d4695e8bdaa50e1479fd3
BLAKE2b-256 434b6b1a0250d2e7215eea4ae2e45fc272bfbf8af171d65d0c51db953c9cbf7e

See more details on using hashes here.

File details

Details for the file terrafox_datalake-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for terrafox_datalake-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 b51b8ef7d2709c799d021fe13fe269ba803ecdb8f3d8d1a529315dd7fc39bcbb
MD5 fe58ee4989cbe651a6020e88e7dbc292
BLAKE2b-256 7a2a8226e6ada29942fc42dcef0ee38cc0d1acfc537fe760c772308318410e0c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page