Python SDK for the Discovery Engine API
Project description
Discovery Engine Python API
The Discovery Engine Python API provides a simple programmatic interface to run analyses via Python, offering an alternative to using the web dashboard. Instead of uploading datasets and configuring analyses through the UI, you can automate your discovery workflows directly from your Python code or scripts.
All analyses run through the API are fully integrated with your Discovery Engine account. Results are automatically displayed in the dashboard, where you can view detailed reports, explore patterns, and share findings with your team. Your account management, credit balance, and subscription settings are all handled through the dashboard.
Installation
pip install discovery-engine-api
For pandas DataFrame support:
pip install discovery-engine-api[pandas]
For Jupyter notebook support:
pip install discovery-engine-api[jupyter]
This installs nest-asyncio, which is required to use engine.run() in Jupyter notebooks. Alternatively, you can use await engine.run_async() directly in Jupyter notebooks without installing the jupyter extra.
Configuration
API Keys
Get your API key from the Developers page in your Discovery Engine dashboard.
Quick Start
from discovery import Engine
# Initialize engine
engine = Engine(api_key="your-api-key")
# Run analysis on a dataset and wait for results
result = engine.run(
file="data.csv",
target_column="diagnosis",
description="Rare diseases dataset",
excluded_columns=["patient_id"], # Exclude ID column from analysis
wait=True # Wait for completion and return full results
)
print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Found {len(result.patterns)} patterns")
Examples
Working with Pandas DataFrames
import pandas as pd
from discovery import Engine
df = pd.read_csv("data.csv")
# or create DataFrame directly
engine = Engine(api_key="your-api-key")
result = engine.run(
file=df, # Pass DataFrame directly
target_column="outcome",
column_descriptions={
"age": "Patient age in years",
"heart rate": None
},
excluded_columns=["id", "timestamp"], # Exclude ID and timestamp columns from analysis
wait=True
)
Async Workflow
import asyncio
from discovery import Engine
async def run_analysis():
async with Engine(api_key="your-api-key") as engine:
# Start analysis without waiting
result = await engine.run_async(
file="data.csv",
target_column="target",
wait=False
)
print(f"Started run: {result.run_id}")
# Later, get results
result = await engine.get_results(result.run_id)
# Or wait for completion
result = await engine.wait_for_completion(result.run_id, timeout=1200)
return result
result = asyncio.run(run_analysis())
Using in Jupyter Notebooks
In Jupyter notebooks, you have two options:
Option 1: Install the jupyter extra (recommended)
pip install discovery-engine-api[jupyter]
Then use engine.run() as normal:
from discovery import Engine
engine = Engine(api_key="your-api-key")
result = engine.run(file="data.csv", target_column="target", wait=True)
Option 2: Use async directly
from discovery import Engine
engine = Engine(api_key="your-api-key")
result = await engine.run_async(file="data.csv", target_column="target", wait=True)
Configuration Options
The run() and run_async() methods accept the following parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
str, Path, or DataFrame |
Required | Dataset file path or pandas DataFrame |
target_column |
str |
Required | Name of column to predict |
depth_iterations |
int |
1 |
Analysis depth — number of iterative feature-removal cycles. Higher values find more subtle patterns but use more credits. The maximum useful value is num_columns - 2; values above that are capped server-side. |
title |
str |
None |
Optional dataset title |
description |
str |
None |
Optional dataset description |
column_descriptions |
Dict[str, str] |
None |
Optional column name → description mapping |
excluded_columns |
List[str] |
None |
Optional list of column names to exclude from analysis (e.g., IDs, timestamps) |
visibility |
"public" / "private" |
"public" |
Dataset visibility. Public runs are free but always use depth 1. Private runs require credits and support higher depth. |
auto_report_use_llm_evals |
bool |
True |
Use LLM for pattern descriptions and citations |
author |
str |
None |
Optional dataset author attribution |
source_url |
str |
None |
Optional source URL for dataset attribution |
wait |
bool |
False |
Wait for analysis to complete and return full results |
wait_timeout |
float |
None |
Maximum seconds to wait for completion (only if wait=True) |
Note on depth and visibility: Public runs are always
depth_iterations=1regardless of settings. To usedepth_iterations > 1, setvisibility="private". Private runs consume credits based on file size × depth.
File Size Limits
The SDK supports file uploads up to 1 GB. Files are uploaded directly to cloud storage using presigned URLs, so there is no HTTP body size restriction.
Supported file formats: CSV, Parquet.
Credits and Pricing
If you don't have enough credits for a private run, the SDK will raise a ValueError with a message like:
Insufficient credits. You need X credits but only have Y available.
Solutions:
- Make your dataset public (set
visibility="public") — completely free - Visit https://disco.leap-labs.com/account to:
- Purchase additional credits
- Upgrade to a subscription plan that includes more credits
Return Value
The run() and run_async() methods return an EngineResult object with the following fields:
EngineResult
@dataclass
class EngineResult:
# Identifiers
run_id: str # Unique run identifier
report_id: Optional[str] # Report ID (if report created)
status: str # "pending", "processing", "completed", "failed"
# Dataset metadata
dataset_title: Optional[str]
dataset_description: Optional[str]
total_rows: Optional[int] # Number of rows in dataset
target_column: Optional[str] # Name of target column
task: Optional[str] # "regression", "binary_classification", or "multiclass_classification"
# LLM-generated summary
summary: Optional[Summary]
# Discovered patterns
patterns: List[Pattern]
# Column/feature information
columns: List[Column] # List of columns with statistics and importance
# Correlation matrix
correlation_matrix: List[CorrelationEntry] # Feature correlations
# Global feature importance
feature_importance: Optional[FeatureImportance] # Feature importance scores
# Job tracking
job_id: Optional[str]
job_status: Optional[str]
error_message: Optional[str]
Pattern
@dataclass
class Pattern:
id: str
task: str # "regression", "binary_classification", "multiclass_classification"
target_column: str
target_change_direction: str # "max" (increases target) or "min" (decreases target)
p_value: float # FDR-adjusted p-value (lower = more significant)
conditions: List[Dict] # Conditions defining the pattern (see below)
abs_target_change: float # Absolute change in target (always positive, magnitude of effect)
support_count: int # Number of rows matching pattern
support_percentage: float # Percentage of dataset matching pattern
novelty_type: str # "novel" or "confirmatory"
target_score: float # Effect size score
description: str # Human-readable description
novelty_explanation: str # Why the pattern is novel or confirmatory
target_class: Optional[str] # For classification tasks
target_mean: Optional[float] # Target mean within pattern (regression)
target_std: Optional[float] # Target std within pattern (regression)
citations: List[Dict] # Academic citations if available
p_value_raw: Optional[float] # Raw p-value before FDR adjustment
Pattern Conditions
Each condition in pattern.conditions is a dict with a type field:
Continuous condition — a numeric range:
{
"type": "continuous",
"feature": "age",
"min_value": 45.0,
"max_value": 65.0,
"min_q": 0.35, # quantile of min_value
"max_q": 0.72 # quantile of max_value
}
Categorical condition — a set of values:
{
"type": "categorical",
"feature": "region",
"values": ["north", "east"]
}
Datetime condition — a time range:
{
"type": "datetime",
"feature": "date",
"min_value": 1609459200000, # epoch ms
"max_value": 1640995200000,
"min_datetime": "2021-01-01", # human-readable
"max_datetime": "2022-01-01"
}
Summary
@dataclass
class Summary:
overview: str # High-level summary of findings
key_insights: List[str] # Main takeaways
novel_patterns: PatternGroup # Novel pattern IDs and explanation
selected_pattern_id: Optional[str] # Featured pattern ID
Note: The
data_insightsfield from v0.1.x has been removed. Useresult.feature_importanceandresult.correlation_matrixdirectly instead — these provide the raw computed values without LLM summarization artifacts.
Column
@dataclass
class Column:
id: str
name: str
display_name: str
type: str # "continuous" or "categorical"
data_type: str # "int", "float", "string", "boolean", "datetime"
enabled: bool
description: Optional[str]
# Statistics (for numeric columns)
mean: Optional[float]
median: Optional[float]
std: Optional[float]
min: Optional[float]
max: Optional[float]
iqr_min: Optional[float]
iqr_max: Optional[float]
mode: Optional[str] # Statistical mode (None if all values unique)
approx_unique: Optional[int]
null_percentage: Optional[float]
# Feature importance
feature_importance_score: Optional[float] # Signed importance score (see FeatureImportance)
FeatureImportance
Feature importance is computed using Hierarchical Perturbation (HiPe), an efficient ablation-based method. Scores are signed to indicate direction:
- Positive: feature increases the prediction / supports predicted class
- Negative: feature decreases the prediction / works against predicted class
@dataclass
class FeatureImportance:
kind: str # "global"
baseline: float # Baseline model output (mean prediction)
scores: List[FeatureImportanceScore]
@dataclass
class FeatureImportanceScore:
feature: str # Feature/column name
score: float # Signed importance score
CorrelationEntry
@dataclass
class CorrelationEntry:
feature_x: str
feature_y: str
value: float # Correlation coefficient (-1 to 1)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file discovery_engine_api-0.2.13.tar.gz.
File metadata
- Download URL: discovery_engine_api-0.2.13.tar.gz
- Upload date:
- Size: 16.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
19b4e9cf12c257c6620f46a494754eff40a310dbbd9adc69f41290ee8c78cd35
|
|
| MD5 |
07f677b25632bf2519711896e92ae3e4
|
|
| BLAKE2b-256 |
d1d7cef942f571b789b5a8979a7d13cfad20c2b2c58cb4f2c05845bebe6cb8df
|
File details
Details for the file discovery_engine_api-0.2.13-py3-none-any.whl.
File metadata
- Download URL: discovery_engine_api-0.2.13-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
196c85e4d4a37f8e7040bd1695ec90197be0f2d31b4c5fe00ae11eb908238fb0
|
|
| MD5 |
b03185ba73bcb43942ccfe8a4f8b6504
|
|
| BLAKE2b-256 |
f34b53b480b0ac53edb07c1956d8290eeec7021d59bd648843837decb57d885d
|