Skip to main content

Python SDK for the Discovery Engine API

Project description

Discovery Engine Python SDK

Find novel, statistically validated patterns in tabular data — feature interactions, subgroup effects, and conditional relationships that correlation analysis and LLMs miss.

Installation

pip install discovery-engine-api

For pandas DataFrame support:

pip install discovery-engine-api[pandas]

Quick Start

from discovery import Engine

engine = Engine(api_key="disco_...")

result = await engine.discover(
    file="data.csv",
    target_column="outcome",
)

for pattern in result.patterns:
    if pattern.p_value < 0.05 and pattern.novelty_type == "novel":
        print(f"{pattern.description} (p={pattern.p_value:.4f})")

print(f"Full report: {result.report_url}")

Get your API key from the Developers page.

Parameters

await engine.discover(
    file: str | Path | pd.DataFrame,  # Dataset to analyze
    target_column: str,                 # Column to predict/analyze
    depth_iterations: int = 1,          # 1=fast, higher=deeper search
    visibility: str = "public",         # "public" (free) or "private" (credits)
    title: str | None = None,           # Dataset title
    description: str | None = None,     # Dataset description
    column_descriptions: dict[str, str] | None = None,  # Improves pattern explanations
    excluded_columns: list[str] | None = None,           # Columns to exclude (e.g., IDs)
    timeout: float = 1800,              # Max seconds to wait
)

Tip: Providing column_descriptions significantly improves pattern explanations. If your columns have non-obvious names, always describe them.

Depth and visibility: Public runs are always depth_iterations=1 regardless of settings. To use depth_iterations > 1, set visibility="private". Private runs consume credits based on file size × depth.

Examples

Working with Pandas DataFrames

import pandas as pd
from discovery import Engine

df = pd.read_csv("data.csv")

engine = Engine(api_key="disco_...")
result = await engine.discover(
    file=df,
    target_column="outcome",
    column_descriptions={
        "age": "Patient age in years",
        "bmi": "Body mass index",
    },
    excluded_columns=["patient_id", "timestamp"],
)

Inspecting Columns Before Running

If you need to see the dataset's columns before choosing a target column, upload first and inspect:

# Upload once and get the server's parsed column list
upload = await engine.upload_file(file="data.csv", title="My dataset")
print(upload["columns"])   # [{"name": "col1", "type": "continuous", ...}, ...]
print(upload["rowCount"])  # e.g., 5000

# Pass the result to avoid re-uploading
result = await engine.run_async(
    file="data.csv",
    target_column="col1",
    wait=True,
    upload_result=upload,  # skips the upload step
)

Running in the Background

Runs take 3–15 minutes. If you need to do other work while Discovery Engine runs:

import asyncio
from discovery import Engine

async def main():
    async with Engine(api_key="disco_...") as engine:
        # Submit without waiting
        run = await engine.run_async(
            file="data.csv",
            target_column="outcome",
            wait=False,
        )
        print(f"Submitted run {run.run_id}, continuing...")

        # ... do other work ...

        # Check back later
        result = await engine.wait_for_completion(run.run_id, timeout=1800)
        return result

result = asyncio.run(main())

Synchronous Usage

For scripts and Jupyter notebooks:

from discovery import Engine

engine = Engine(api_key="disco_...")
result = engine.run(
    file="data.csv",
    target_column="outcome",
    wait=True,
)

For Jupyter notebooks, install the jupyter extra for engine.run() compatibility:

pip install discovery-engine-api[jupyter]

Or use await engine.discover(...) / await engine.run_async(...) directly in async notebook cells.

Working with Results

# Filter for significant novel patterns
novel = [p for p in result.patterns
         if p.p_value < 0.05 and p.novelty_type == "novel"]

# Get patterns that increase the target
increasing = [p for p in result.patterns if p.target_change_direction == "max"]

# Inspect conditions
for pattern in result.patterns:
    for cond in pattern.conditions:
        print(f"  {cond['feature']}: {cond}")

# Feature importance
if result.feature_importance:
    top = sorted(result.feature_importance.scores,
                 key=lambda s: abs(s.score), reverse=True)

# Share the interactive report
print(f"Explore: {result.report_url}")

Credits and Pricing

  • Public runs: Free. Results published to public gallery. Locked to depth=1.
  • Private runs: 1 credit per MB per depth iteration. $1.00 per credit.
  • Formula: credits = max(1, ceil(file_size_mb * depth_iterations))
# Estimate cost before running
estimate = await engine.estimate(
    file_size_mb=10.5,
    num_columns=25,
    depth_iterations=2,
    visibility="private",
)
# estimate["cost"]["credits"] -> 21
# estimate["cost"]["free_alternative"] -> True
# estimate["account"]["sufficient"] -> True/False

Manage credits and plans at disco.leap-labs.com/account.

File Size Limits

Uploads up to 5 GB. Files are uploaded directly to cloud storage using presigned URLs.

Supported formats: CSV, TSV, Excel (.xlsx), JSON, Parquet, ARFF, Feather.

Return Value

EngineResult

@dataclass
class EngineResult:
    run_id: str
    status: str                                    # "pending", "processing", "completed", "failed"
    summary: Summary | None                        # LLM-generated insights
    patterns: list[Pattern]                        # Discovered patterns (the core output)
    columns: list[Column]                          # Feature info and statistics
    feature_importance: FeatureImportance | None   # Global importance scores
    correlation_matrix: list[CorrelationEntry]     # Feature correlations
    report_url: str | None                         # Shareable link to interactive web report
    task: str | None                               # "regression", "binary_classification", "multiclass_classification"
    total_rows: int | None
    error_message: str | None

Pattern

@dataclass
class Pattern:
    id: str
    description: str                    # Human-readable description
    conditions: list[dict]              # Conditions defining the pattern
    p_value: float                      # FDR-adjusted p-value
    p_value_raw: float | None           # Raw p-value before adjustment
    novelty_type: str                   # "novel" or "confirmatory"
    novelty_explanation: str            # Why this is novel or confirmatory
    citations: list[dict]               # Academic citations
    target_change_direction: str        # "max" (increases target) or "min" (decreases)
    abs_target_change: float            # Magnitude of effect
    support_count: int                  # Rows matching this pattern
    support_percentage: float           # Percentage of dataset
    target_mean: float | None           # For regression tasks
    target_std: float | None

Pattern Conditions

Each condition in pattern.conditions is a dict with a type field:

Continuous condition — a numeric range:

{
    "type": "continuous",
    "feature": "age",
    "min_value": 45.0,
    "max_value": 65.0,
    "min_q": 0.35,   # quantile of min_value
    "max_q": 0.72    # quantile of max_value
}

Categorical condition — a set of values:

{
    "type": "categorical",
    "feature": "region",
    "values": ["north", "east"]
}

Datetime condition — a time range:

{
    "type": "datetime",
    "feature": "date",
    "min_value": 1609459200000,   # epoch ms
    "max_value": 1640995200000,
    "min_datetime": "2021-01-01", # human-readable
    "max_datetime": "2022-01-01"
}

Summary

@dataclass
class Summary:
    overview: str                       # High-level summary of findings
    key_insights: list[str]             # Main takeaways
    novel_patterns: PatternGroup        # Novel pattern IDs and explanation

Column

@dataclass
class Column:
    id: str
    name: str
    display_name: str
    type: str                           # "continuous" or "categorical"
    data_type: str                      # "int", "float", "string", "boolean", "datetime"
    enabled: bool
    description: str | None
    mean: float | None
    median: float | None
    std: float | None
    min: float | None
    max: float | None
    feature_importance_score: float | None  # Signed importance score

FeatureImportance

Computed using Hierarchical Perturbation (HiPe), an ablation-based method. Scores are signed — positive means the feature increases the prediction, negative means it decreases it.

@dataclass
class FeatureImportance:
    kind: str                           # "global"
    baseline: float                     # Baseline model output
    scores: list[FeatureImportanceScore]

@dataclass
class FeatureImportanceScore:
    feature: str
    score: float                        # Signed importance score

Error Handling

from discovery import (
    Engine,
    AuthenticationError,
    InsufficientCreditsError,
    RateLimitError,
    RunFailedError,
    PaymentRequiredError,
)

try:
    result = await engine.discover(file="data.csv", target_column="target")
except AuthenticationError as e:
    print(e.suggestion)  # "Check your API key at https://disco.leap-labs.com/developers"
except InsufficientCreditsError as e:
    print(f"Need {e.credits_required}, have {e.credits_available}")
    print(e.suggestion)  # "Purchase credits or run publicly for free"
except RateLimitError as e:
    print(f"Retry after {e.retry_after} seconds")
except RunFailedError as e:
    print(f"Run {e.run_id} failed: {e}")
except TimeoutError:
    pass  # Retrieve later with engine.wait_for_completion(run_id)

All errors include a suggestion field with actionable instructions.

MCP Server

Discovery Engine is available as an MCP server with tools for the full discovery lifecycle — estimate, analyze, check status, get results, manage account.

{
  "mcpServers": {
    "discovery-engine": {
      "url": "https://disco.leap-labs.com/mcp",
      "env": { "DISCOVERY_API_KEY": "disco_..." }
    }
  }
}

Links

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discovery_engine_api-0.2.74.tar.gz (17.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

discovery_engine_api-0.2.74-py3-none-any.whl (21.6 kB view details)

Uploaded Python 3

File details

Details for the file discovery_engine_api-0.2.74.tar.gz.

File metadata

  • Download URL: discovery_engine_api-0.2.74.tar.gz
  • Upload date:
  • Size: 17.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for discovery_engine_api-0.2.74.tar.gz
Algorithm Hash digest
SHA256 1acb53c4e713afacdd8eb5ee01c36a354e09514316b3fec2c0135f030c357ab9
MD5 1513d2e4a7b46c3b695f8027aec8fa13
BLAKE2b-256 25059b9e78118338c0fbe9ee9091788845f365c4fce6bb6c816cd3a1cb54f3fe

See more details on using hashes here.

File details

Details for the file discovery_engine_api-0.2.74-py3-none-any.whl.

File metadata

File hashes

Hashes for discovery_engine_api-0.2.74-py3-none-any.whl
Algorithm Hash digest
SHA256 d3775dbb475e5258d39d789d583cc1a0d101613bf2cf950f9785901934f34615
MD5 826ac3bb183e940a6748fffa68720b66
BLAKE2b-256 9d75169bf601d58c057bb8521b8a20e5be52b0a093e7116dcf738770296b67cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page