Skip to main content

Python SDK for the Discovery Engine API

Project description

Discovery Engine Python API

The Discovery Engine Python API provides a simple programmatic interface to run analyses via Python, offering an alternative to using the web dashboard. Instead of uploading datasets and configuring analyses through the UI, you can automate your discovery workflows directly from your Python code or scripts.

All analyses run through the API are fully integrated with your Discovery Engine account. Results are automatically displayed in the dashboard, where you can view detailed reports, explore patterns, and share findings with your team. Your account management, credit balance, and subscription settings are all handled through the dashboard.

Installation

pip install discovery-engine-api

For pandas DataFrame support:

pip install discovery-engine-api[pandas]

For Jupyter notebook support:

pip install discovery-engine-api[jupyter]

This installs nest-asyncio, which is required to use engine.run() in Jupyter notebooks. Alternatively, you can use await engine.run_async() directly in Jupyter notebooks without installing the jupyter extra.

Configuration

API Keys

Get your API key from the Developers page in your Discovery Engine dashboard.

Quick Start

from discovery import Engine

# Initialize engine
engine = Engine(api_key="your-api-key")

# Run analysis on a dataset and wait for results
result = engine.run(
    file="data.csv",
    target_column="diagnosis",
    mode="fast",
    description="Rare diseases dataset",
    excluded_columns=["patient_id"],  # Exclude ID column from analysis
    wait=True  # Wait for completion and return full results
)

print(f"Run ID: {result.run_id}")
print(f"Status: {result.status}")
print(f"Found {len(result.patterns)} patterns")

Examples

Working with Pandas DataFrames

import pandas as pd
from discovery import Engine

df = pd.read_csv("data.csv")
# or create DataFrame directly

engine = Engine(api_key="your-api-key")
result = engine.run(
    file=df,  # Pass DataFrame directly
    target_column="outcome",
    column_descriptions={
        "age": "Patient age in years",
        "heart rate": None
    },
    excluded_columns=["id", "timestamp"],  # Exclude ID and timestamp columns from analysis
    wait=True
)

Async Workflow

import asyncio
from discovery import Engine

async def run_analysis():
    async with Engine(api_key="your-api-key") as engine:
        # Start analysis without waiting
        result = await engine.run_async(
            file="data.csv",
            target_column="target",
            wait=False
        )
        print(f"Started run: {result.run_id}")

        # Later, get results
        result = await engine.get_results(result.run_id)
        
        # Or wait for completion
        result = await engine.wait_for_completion(result.run_id, timeout=1200)
        return result

result = asyncio.run(run_analysis())

Using in Jupyter Notebooks

In Jupyter notebooks, you have two options:

Option 1: Install the jupyter extra (recommended)

pip install discovery-engine-api[jupyter]

Then use engine.run() as normal:

from discovery import Engine

engine = Engine(api_key="your-api-key")
result = engine.run(file="data.csv", target_column="target", wait=True)

Option 2: Use async directly

from discovery import Engine

engine = Engine(api_key="your-api-key")
result = await engine.run_async(file="data.csv", target_column="target", wait=True)

Configuration Options

The run() and run_async() methods accept the following parameters:

Parameter Type Default Description
file str, Path, or DataFrame Required Dataset file path or pandas DataFrame
target_column str Required Name of column to predict
mode "fast" / "deep" "fast" Analysis depth
title str None Optional dataset title
description str None Optional dataset description
column_descriptions Dict[str, str] None Optional column name -> description mapping
excluded_columns List[str] None Optional list of column names to exclude from analysis (e.g., IDs, timestamps)
visibility "public" / "private" "public" Dataset visibility (private requires credits)
auto_report_use_llm_evals bool True Use LLM for pattern descriptions
author str None Optional dataset author attribution
source_url str None Optional source URL for dataset attribution
wait bool False Wait for analysis to complete and return full results
wait_timeout float None Maximum seconds to wait for completion (only if wait=True)

Credits and Pricing

If you don't have enough credits for a private run, the SDK will raise an httpx.HTTPStatusError with an error message like:

Insufficient credits. You need X credits but only have Y available.

Solutions:

  1. Make your dataset public (set visibility="public") - completely free
  2. Visit https://disco.leap-labs.com/account to:
    • Purchase additional credits
    • Upgrade to a subscription plan that includes more credits

Return Value

The run() and run_async() methods return an EngineResult object with the following fields:

EngineResult

@dataclass
class EngineResult:
    # Identifiers
    run_id: str                    # Unique run identifier
    report_id: Optional[str]       # Report ID (if report created)
    status: str                    # "pending", "processing", "completed", "failed"
    
    # Dataset metadata
    dataset_title: Optional[str]           # Dataset title
    dataset_description: Optional[str]    # Dataset description
    total_rows: Optional[int]              # Number of rows in dataset
    target_column: Optional[str]           # Name of target column
    task: Optional[str]                    # "regression", "binary_classification", or "multiclass_classification"
    
    # LLM-generated summary
    summary: Optional[Summary]             # Summary object with overview, insights, etc.
    
    # Discovered patterns
    patterns: List[Pattern]                 # List of discovered patterns
    
    # Column/feature information
    columns: List[Column]                  # List of columns with statistics and importance
    
    # Correlation matrix
    correlation_matrix: List[CorrelationEntry]  # Feature correlations
    
    # Global feature importance
    feature_importance: Optional[FeatureImportance]  # Feature importance scores
    
    # Job tracking
    job_id: Optional[str]           # Job ID for tracking processing
    job_status: Optional[str]      # Job status
    error_message: Optional[str]   # Error message if analysis failed

Summary

@dataclass
class Summary:
    overview: str                          # High-level explanation of findings
    key_insights: List[str]                # List of main takeaways
    novel_patterns: PatternGroup           # Novel pattern explanations
    surprising_findings: PatternGroup      # Surprising findings
    statistically_significant: PatternGroup  # Statistically significant patterns
    data_insights: Optional[DataInsights]  # Important features, correlations
    selected_pattern_id: Optional[str]     # ID of selected pattern

Pattern

@dataclass
class Pattern:
    id: str                                # Pattern identifier
    task: str                              # Task type
    target_column: str                     # Target column name
    direction: str                         # "min" or "max"
    p_value: float                         # Statistical p-value
    conditions: List[Dict]                 # Pattern conditions (continuous, categorical, datetime)
    lift_value: float                      # Lift value (how much the pattern increases/decreases target)
    support_count: int                     # Number of rows matching pattern
    support_percentage: float              # Percentage of rows matching pattern
    pattern_type: str                      # "validated" or "speculative"
    novelty_type: str                      # "novel" or "confirmatory"
    target_score: float                    # Target score for this pattern
    description: str                       # Human-readable description
    novelty_explanation: str               # Explanation of novelty
    target_class: Optional[str]            # Target class (for classification)
    target_mean: Optional[float]           # Target mean (for regression)
    target_std: Optional[float]            # Target standard deviation
    citations: List[Dict]                  # Academic citations

Column

@dataclass
class Column:
    id: str                                # Column identifier
    name: str                              # Column name
    display_name: str                      # Display name
    type: str                              # "continuous" or "categorical"
    data_type: str                         # "int", "float", "string", "boolean", "datetime"
    enabled: bool                          # Whether column is enabled
    description: Optional[str]              # Column description
    
    # Statistics
    mean: Optional[float]                  # Mean value
    median: Optional[float]                 # Median value
    std: Optional[float]                   # Standard deviation
    min: Optional[float]                   # Minimum value
    max: Optional[float]                   # Maximum value
    iqr_min: Optional[float]               # IQR minimum
    iqr_max: Optional[float]               # IQR maximum
    mode: Optional[str]                    # Mode value
    approx_unique: Optional[int]           # Approximate unique count
    null_percentage: Optional[float]      # Percentage of null values
    
    # Feature importance
    feature_importance_score: Optional[float]  # Feature importance score

FeatureImportance

@dataclass
class FeatureImportance:
    kind: str                              # Feature importance type: "global" 
    baseline: float                        # Baseline model output
    scores: List[FeatureImportanceScore]   # List of feature scores

CorrelationEntry

@dataclass
class CorrelationEntry:
    feature_x: str                         # First feature name
    feature_y: str                         # Second feature name
    value: float                           # Correlation value (-1 to 1)

Pattern

@dataclass
class Pattern:
    id: str
    task: str # regression/classification
    target_column: str # target column
    direction: str  # "min" or "max"
    p_value: float # p-value
    conditions: List[Dict]  # Continuous, categorical, or datetime conditions
    lift_value: float # change in average target
    support_count: int # how many rows contain the pattern
    support_percentage: float # how many rows contain the pattern as % of total data
    novelty_type: str  # "novel" or "confirmatory"
    target_score: float # effect size
    description: str # short description of the pattern
    novelty_explanation: str # generated explanation 
    citations: List[Dict] # relevant literature

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

discovery_engine_api-0.1.95.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

discovery_engine_api-0.1.95-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file discovery_engine_api-0.1.95.tar.gz.

File metadata

  • Download URL: discovery_engine_api-0.1.95.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for discovery_engine_api-0.1.95.tar.gz
Algorithm Hash digest
SHA256 39643b711c5241379fdf6cd914b78ba3469f213ecd9c1d80f8c750b8619d1b59
MD5 89424336d69fc2c54cc9130d373b848b
BLAKE2b-256 053e655794bfd4a4d3a172cc650326cfd5df8760cd4db91a0d45a871e00b570a

See more details on using hashes here.

File details

Details for the file discovery_engine_api-0.1.95-py3-none-any.whl.

File metadata

File hashes

Hashes for discovery_engine_api-0.1.95-py3-none-any.whl
Algorithm Hash digest
SHA256 afae39bfb360f5ca1301c59f5fcfffc5127e790bdb7b813fa1054471b857637d
MD5 58470637bb9adf0971ea6546b572f872
BLAKE2b-256 f420c425bdc231b937fb8d0ad3342a0e36e6dc0db3c8ca737427427731d7048e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page