Skip to main content

Automated connector wrapper for streaming data securely from a private MinIO Data Lake

Project description

Terrafox Data Lake Connector

A simple, secure wrapper module to stream files out of your private data lake into remote notebook runtimes seamlessly.

terrafox-datalake

A lightweight, universal, stream-native connector wrapper designed to stream datasets securely from private MinIO and S3-compatible Data Lakes straight into Pandas dataframes.

By replacing traditional file-system directory mapping wrappers (s3fs/fsspec) with direct object streaming via boto3, this package completely eliminates network edge bottlenecks, Cloudflare proxy payload limits, and 403 Forbidden credential collisions caused by background directory scanning.


Key Features

  • Stream-Native Engine: Reads multi-gigabyte datasets (e.g., 1.3 GiB+ CSVs) linearly using high-performance byte-stream network chunks, keeping your local or Google Colab memory consumption minimal.
  • Bypasses Proxy Blocks: Sidesteps standard reverse-proxy constraints (like Cloudflare Tunnel 100 MiB Client Max Body Size upload blocks) during active read cycles.
  • Fully Universal & Repurposable: Zero hardcoded endpoints. Works natively out-of-the-box with your configured defaults or targets any custom local/cloud data lake clusters dynamically.
  • Zero Configuration Conflict: Completely abstracts complex botocore configuration arguments, address styling structures, and signature parameters out of your notebooks.

Installation

Terrafox Data Lake

A lightweight Python package for securely connecting to and streaming data from private MinIO-based data lake environments.

Installation

pip install terrafox-datalake

Quick Start

1. Connecting Natively via Interactive Prompt

If no background credentials are found, calling connect() will securely prompt you for your data lake credentials.

import terrafox_datalake as dl

# Initialize the data lake client context securely
dl.connect()

2. Silent Credentials Injection (Automated Workflows)

For automated scripts, CI/CD pipelines, headless environments, or to bypass the interactive login prompt in Google Colab, set your credentials as environment variables before initializing the connection.

import os
import terrafox_datalake as dl

# Pre-populate session credentials
os.environ["MINIO_USER"] = "admin"
os.environ["MINIO_PASSWORD"] = "your_secure_password"
os.environ["MINIO_ENDPOINT"] = "https://minio.terrafoxai.com"

# Initialize the connection
dl.connect()

3. Advanced Usage: Connecting to Different Infrastructures

Terrafox Data Lake is designed to be dynamic and reusable. Switch seamlessly between production environments, staging clusters, or local development instances.

import terrafox_datalake as dl

# Connect to an alternate cluster or local MinIO instance
dl.connect(endpoint="https://local-testing-cluster.local:9000")

# Read data from a different environment
df = dl.read_csv(
    bucket="test-bucket",
    key="metrics.csv"
)

Example: Reading Data from a Data Lake

import terrafox_datalake as dl

dl.connect()

df = dl.read_csv(
    bucket="bigdata",
    key="vehicles.csv"
)

print(df.head())

Architecture Requirements

  • Python: 3.7 or higher
  • Supported Storage: MinIO (S3-compatible object storage)

Dependencies

  • pandas
  • boto3
  • s3fs
  • fsspec

Features

  • Secure interactive authentication
  • Environment variable support for automation
  • Native MinIO integration
  • S3-compatible object storage access
  • Simple DataFrame-based data retrieval
  • Flexible infrastructure switching between environments

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

terrafox_datalake-0.1.5.tar.gz (3.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

terrafox_datalake-0.1.5-py3-none-any.whl (3.9 kB view details)

Uploaded Python 3

File details

Details for the file terrafox_datalake-0.1.5.tar.gz.

File metadata

  • Download URL: terrafox_datalake-0.1.5.tar.gz
  • Upload date:
  • Size: 3.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for terrafox_datalake-0.1.5.tar.gz
Algorithm Hash digest
SHA256 dd0f7419f28da4983774fe0eb7617792d0ef6c4b3264379d8e8630f714f4949d
MD5 0c415e2484fa68144619bd233adc1754
BLAKE2b-256 b5f43eff2539480b8e95eaa57ec0e8d41a75c8a89dd548d87ed90d97b34f94ec

See more details on using hashes here.

File details

Details for the file terrafox_datalake-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for terrafox_datalake-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 240d49fe71731ca401f4d1cb3fccad140a16635203d36c01c74d33f4dcbf67b2
MD5 ee78d6df6f14baf4d04ea6601b859d9e
BLAKE2b-256 17210c15fe1ddcdf132cd37497a6f3465658f8f88db7012e343793238ad30e4e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page