Skip to main content

Python SDK for PixCrawler image dataset platform - simple, lightweight, ML-ready

Project description

PixCrawler Python SDK

A simple, lightweight Python SDK for accessing PixCrawler datasets. Designed for ML workflows with minimal API surface and maximum ease of use.

Installation

pip install pixcrawler

Or install from source:

cd sdk
pip install -e .

Quick Start

import pixcrawler as pix

# Set authentication (optional if using environment variables)
pix.auth(token="your_api_key")

# Load dataset into memory
project = pix.project("project-id")
dataset = project.dataset("dataset-id-123")
data = dataset.load()

# Iterate over items
for item in dataset:
    print(item)

Authentication

The SDK supports three authentication methods (in priority order):

1. Programmatic Authentication (Recommended for Scripts)

import pixcrawler as pix

pix.auth(token="your_api_key")

2. Environment Variables (Recommended for Production)

export PIXCRAWLER_SERVICE_KEY="your_api_key"

3. Per-Request Configuration

import pixcrawler as pix

dataset = pix.dataset(
    "dataset-id-123",
    config={"api_key": "your_api_key", "project_id": "project-id-123"}
).load()

API Reference

auth(token, base_url=None)

Set global authentication token for the session.

Parameters:

  • token (str): API token or JWT token from Supabase Auth
  • base_url (str, optional): Override API base URL (default: https://api.pixcrawler.com/v1)

Example:

import pixcrawler as pix

pix.auth(token="your_api_key", project_id="project-id-123")
# All subsequent calls will use this token

dataset(dataset_id, config=None)

Load dataset into memory for iteration.

Parameters:

  • dataset_id (str): UUID of the dataset
  • config (dict, optional): Configuration with 'api_key' and 'base_url'

Returns:

  • Dataset: In-memory dataset object

Raises:

  • AuthenticationError: If authentication fails
  • NotFoundError: If dataset not found
  • RuntimeError: If dataset exceeds memory limit (300MB)

Example:

import pixcrawler as pix

# Load dataset
dataset = pix.load_dataset("dataset-id-123")

# Iterate over items
for item in dataset:
    image_url = item['url']
    label = item['label']
    print(f"{label}: {image_url}")

datasets(config=None)

List user's datasets with pagination.

Parameters:

  • config (dict, optional): Configuration with 'api_key' and 'base_url'

Returns:

  • List[dict]: List of dataset metadata dictionaries

Raises:

  • AuthenticationError: If authentication fails
  • APIError: If API request fails

Example:

import pixcrawler as pix

pix.auth(token="your_api_key")

# List all datasets
project = pix.project(project_id="")
datasets = project.datasets()

for dataset in datasets:
    print(f"{dataset['id']}: {dataset['name']} ({dataset['image_count']} images)")

get_dataset_info(dataset_id, config=None)

Get dataset metadata without downloading.

Parameters:

  • dataset_id (str): UUID of the dataset
  • config (dict, optional): Configuration with 'api_key' and 'base_url'

Returns:

  • dict: Dataset metadata (image_count, size_mb, labels, etc.)

Raises:

  • AuthenticationError: If authentication fails
  • NotFoundError: If dataset not found

Example:

import pixcrawler as pix

# Get metadata
dataset = pix.dataset("dataset-id-123")

print(f"Name: {dataset.name}")
print(f"Images: {dataset.image_count}")
print(f"Size: {dataset.size_mb} MB")

download_dataset(dataset_id, output_path, config=None)

Download dataset archive to local file.

Parameters:

  • dataset_id (str): UUID of the dataset
  • output_path (str): Local file path (e.g., "./wildlife.zip")
  • config (dict, optional): Configuration with 'api_key' and 'base_url'

Returns:

  • str: Absolute path to downloaded file

Raises:

  • AuthenticationError: If authentication fails
  • NotFoundError: If dataset not found
  • PixCrawlerError: If download fails

Example:

import pixcrawler as pix

pix.auth(token="your_api_key")

# Download to file (doesn't load into memory)
path = pix.dataset("dataset-id-123").download("./my_dataset.zip")
print(f"Downloaded to: {path}")

Exception Handling

The SDK provides custom exceptions for different error scenarios:

import pixcrawler as pix
from pixcrawler import (
  PixCrawlerError,  # Base exception
  APIError,  # API returned error
  AuthenticationError,  # Auth failed
  NotFoundError,  # Resource not found
  RateLimitError,  # Rate limit exceeded
)

try:
  dataset = pix.dataset("dataset-id-123")
except AuthenticationError:
  print("Authentication failed. Check your API key.")
except NotFoundError:
  print("Dataset not found.")
except RateLimitError:
  print("Rate limit exceeded. Please try again later.")
except APIError as e:
  print(f"API error {e.status_code}: {e.message}")
except PixCrawlerError as e:
  print(f"SDK error: {e}")

Complete Examples

Example 1: Load and Process Dataset

import pixcrawler as pix

# Authenticate
pix.auth(token="your_api_key")

# Load dataset
dataset = pix.dataset("dataset-id-123")

# Process items
for item in dataset:
    # Your ML preprocessing here
    image_url = item['url']
    label = item['label']
    # Download image, apply transforms, etc.

Example 2: List and Download Datasets

import pixcrawler as pix

pix.auth(token="your_api_key")

# List all datasets
datasets = pix.datasets()

# Find specific dataset
target_dataset = next(
    (d for d in datasets if d['name'] == 'Wildlife Images'),
    None
)

if target_dataset:
    # Get detailed info
    info = pix.dataset(target_dataset['id']).info()
    print(f"Found dataset: {info['name']} ({info['image_count']} images)")
    
    # Download to file
    path = pix.download_dataset(target_dataset['id'], "./wildlife.zip")
    print(f"Downloaded to: {path}")

Example 3: Environment-Based Authentication

# Set environment variable first:
# export SERVICE_API_KEY="your_api_key"

import pixcrawler as pix

# No need to call auth() - uses environment variable
dataset = pix.dataset("dataset-id-123")

for item in dataset:
    print(item)

Example 4: Custom Base URL (Testing)

import pixcrawler as pix

# Use custom API URL (e.g., for testing)
pix.auth(
    token="your_api_key",
)

datasets = pix.datasets()

Memory Considerations

The dataset() function loads data into memory and has a 300MB limit to prevent memory issues. For larger datasets:

  1. Use dataset().download() to save to disk
  2. Process the downloaded file in chunks
  3. Or use the API directly for streaming
import pixcrawler as pix

# For large datasets, download to file instead
path = pix.dataset("large-dataset-id").download("./large_dataset.zip")

# Then process the ZIP file in chunks
import zipfile
with zipfile.ZipFile(path, 'r') as zf:
    # Process files one at a time
    for filename in zf.namelist():
        with zf.open(filename) as f:
            # Process file
            pass

Requirements

  • Python 3.8+
  • requests
  • python-dotenv

License

MIT License

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pixcrawler_sdk-0.2.0.dev1.tar.gz (26.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pixcrawler_sdk-0.2.0.dev1-py2.py3-none-any.whl (10.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file pixcrawler_sdk-0.2.0.dev1.tar.gz.

File metadata

  • Download URL: pixcrawler_sdk-0.2.0.dev1.tar.gz
  • Upload date:
  • Size: 26.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.17

File hashes

Hashes for pixcrawler_sdk-0.2.0.dev1.tar.gz
Algorithm Hash digest
SHA256 dc7ebe9c747bbc8f07831a7f1044cee9ce7c11fb5b3d75bd076d0353fd631de9
MD5 5ba2bddcb091d28ead5735ef8ceec4cf
BLAKE2b-256 1bba7d484fbac5b7add19b6e1d4ae7bba2cba1f0f048490cdafbe6ce1475c048

See more details on using hashes here.

File details

Details for the file pixcrawler_sdk-0.2.0.dev1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for pixcrawler_sdk-0.2.0.dev1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 df7d34fe07d52c6d4aa24daa0c62c2d54184f8cba1e0d7d6c3706e4115b348d8
MD5 e8d775a4e15d867d83469620b535653f
BLAKE2b-256 d09959dc0fd43fe33c6630d438b294783507986de2cbb5fc394e54d3d792d2ef

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page