Python SDK for PixCrawler image dataset platform - simple, lightweight, ML-ready
Project description
PixCrawler Python SDK
A simple, lightweight Python SDK for accessing PixCrawler datasets. Designed for ML workflows with minimal API surface and maximum ease of use.
Installation
pip install pixcrawler
Or install from source:
cd sdk
pip install -e .
Quick Start
import pixcrawler as pix
# Set authentication (optional if using environment variables)
pix.auth(token="your_api_key")
# Load dataset into memory
project = pix.project("project-id")
dataset = project.dataset("dataset-id-123")
data = dataset.load()
# Iterate over items
for item in dataset:
print(item)
Authentication
The SDK supports three authentication methods (in priority order):
1. Programmatic Authentication (Recommended for Scripts)
import pixcrawler as pix
pix.auth(token="your_api_key")
2. Environment Variables (Recommended for Production)
export PIXCRAWLER_SERVICE_KEY="your_api_key"
3. Per-Request Configuration
import pixcrawler as pix
dataset = pix.dataset(
"dataset-id-123",
config={"api_key": "your_api_key", "project_id": "project-id-123"}
).load()
API Reference
auth(token, base_url=None)
Set global authentication token for the session.
Parameters:
token(str): API token or JWT token from Supabase Authbase_url(str, optional): Override API base URL (default: https://api.pixcrawler.com/v1)
Example:
import pixcrawler as pix
pix.auth(token="your_api_key", project_id="project-id-123")
# All subsequent calls will use this token
dataset(dataset_id, config=None)
Load dataset into memory for iteration.
Parameters:
dataset_id(str): UUID of the datasetconfig(dict, optional): Configuration with 'api_key' and 'base_url'
Returns:
Dataset: In-memory dataset object
Raises:
AuthenticationError: If authentication failsNotFoundError: If dataset not foundRuntimeError: If dataset exceeds memory limit (300MB)
Example:
import pixcrawler as pix
# Load dataset
dataset = pix.load_dataset("dataset-id-123")
# Iterate over items
for item in dataset:
image_url = item['url']
label = item['label']
print(f"{label}: {image_url}")
datasets(config=None)
List user's datasets with pagination.
Parameters:
config(dict, optional): Configuration with 'api_key' and 'base_url'
Returns:
List[dict]: List of dataset metadata dictionaries
Raises:
AuthenticationError: If authentication failsAPIError: If API request fails
Example:
import pixcrawler as pix
pix.auth(token="your_api_key")
# List all datasets
project = pix.project(project_id="")
datasets = project.datasets()
for dataset in datasets:
print(f"{dataset['id']}: {dataset['name']} ({dataset['image_count']} images)")
get_dataset_info(dataset_id, config=None)
Get dataset metadata without downloading.
Parameters:
dataset_id(str): UUID of the datasetconfig(dict, optional): Configuration with 'api_key' and 'base_url'
Returns:
dict: Dataset metadata (image_count, size_mb, labels, etc.)
Raises:
AuthenticationError: If authentication failsNotFoundError: If dataset not found
Example:
import pixcrawler as pix
# Get metadata
dataset = pix.dataset("dataset-id-123")
print(f"Name: {dataset.name}")
print(f"Images: {dataset.image_count}")
print(f"Size: {dataset.size_mb} MB")
download_dataset(dataset_id, output_path, config=None)
Download dataset archive to local file.
Parameters:
dataset_id(str): UUID of the datasetoutput_path(str): Local file path (e.g., "./wildlife.zip")config(dict, optional): Configuration with 'api_key' and 'base_url'
Returns:
str: Absolute path to downloaded file
Raises:
AuthenticationError: If authentication failsNotFoundError: If dataset not foundPixCrawlerError: If download fails
Example:
import pixcrawler as pix
pix.auth(token="your_api_key")
# Download to file (doesn't load into memory)
path = pix.dataset("dataset-id-123").download("./my_dataset.zip")
print(f"Downloaded to: {path}")
Exception Handling
The SDK provides custom exceptions for different error scenarios:
import pixcrawler as pix
from pixcrawler import (
PixCrawlerError, # Base exception
APIError, # API returned error
AuthenticationError, # Auth failed
NotFoundError, # Resource not found
RateLimitError, # Rate limit exceeded
)
try:
dataset = pix.dataset("dataset-id-123")
except AuthenticationError:
print("Authentication failed. Check your API key.")
except NotFoundError:
print("Dataset not found.")
except RateLimitError:
print("Rate limit exceeded. Please try again later.")
except APIError as e:
print(f"API error {e.status_code}: {e.message}")
except PixCrawlerError as e:
print(f"SDK error: {e}")
Complete Examples
Example 1: Load and Process Dataset
import pixcrawler as pix
# Authenticate
pix.auth(token="your_api_key")
# Load dataset
dataset = pix.dataset("dataset-id-123")
# Process items
for item in dataset:
# Your ML preprocessing here
image_url = item['url']
label = item['label']
# Download image, apply transforms, etc.
Example 2: List and Download Datasets
import pixcrawler as pix
pix.auth(token="your_api_key")
# List all datasets
datasets = pix.datasets()
# Find specific dataset
target_dataset = next(
(d for d in datasets if d['name'] == 'Wildlife Images'),
None
)
if target_dataset:
# Get detailed info
info = pix.dataset(target_dataset['id']).info()
print(f"Found dataset: {info['name']} ({info['image_count']} images)")
# Download to file
path = pix.download_dataset(target_dataset['id'], "./wildlife.zip")
print(f"Downloaded to: {path}")
Example 3: Environment-Based Authentication
# Set environment variable first:
# export SERVICE_API_KEY="your_api_key"
import pixcrawler as pix
# No need to call auth() - uses environment variable
dataset = pix.dataset("dataset-id-123")
for item in dataset:
print(item)
Example 4: Custom Base URL (Testing)
import pixcrawler as pix
# Use custom API URL (e.g., for testing)
pix.auth(
token="your_api_key",
)
datasets = pix.datasets()
Memory Considerations
The dataset() function loads data into memory and has a 300MB limit to prevent memory issues. For larger datasets:
- Use
dataset().download()to save to disk - Process the downloaded file in chunks
- Or use the API directly for streaming
import pixcrawler as pix
# For large datasets, download to file instead
path = pix.dataset("large-dataset-id").download("./large_dataset.zip")
# Then process the ZIP file in chunks
import zipfile
with zipfile.ZipFile(path, 'r') as zf:
# Process files one at a time
for filename in zf.namelist():
with zf.open(filename) as f:
# Process file
pass
Requirements
- Python 3.8+
- requests
- python-dotenv
License
MIT License
Support
- Documentation: https://docs.pixcrawler.com
- Issues: https://github.com/pixcrawler/pixcrawler/issues
- Email: support@pixcrawler.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pixcrawler_sdk-0.2.0.dev1.tar.gz.
File metadata
- Download URL: pixcrawler_sdk-0.2.0.dev1.tar.gz
- Upload date:
- Size: 26.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc7ebe9c747bbc8f07831a7f1044cee9ce7c11fb5b3d75bd076d0353fd631de9
|
|
| MD5 |
5ba2bddcb091d28ead5735ef8ceec4cf
|
|
| BLAKE2b-256 |
1bba7d484fbac5b7add19b6e1d4ae7bba2cba1f0f048490cdafbe6ce1475c048
|
File details
Details for the file pixcrawler_sdk-0.2.0.dev1-py2.py3-none-any.whl.
File metadata
- Download URL: pixcrawler_sdk-0.2.0.dev1-py2.py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df7d34fe07d52c6d4aa24daa0c62c2d54184f8cba1e0d7d6c3706e4115b348d8
|
|
| MD5 |
e8d775a4e15d867d83469620b535653f
|
|
| BLAKE2b-256 |
d09959dc0fd43fe33c6630d438b294783507986de2cbb5fc394e54d3d792d2ef
|