Skip to main content

Pure analytics engine with lazy execution, graph-based transformations, and extensible analysis

Project description

Xelytics-Core

Python package for automated analytics with a lazy, graph-aware execution engine.

Version Python Status

Status: v0.3.0 documentation update in progress | v0.2.x APIs remain supported while the lazy graph execution model becomes the recommended path.


What It Does

Xelytics-Core is a zero-configuration analytics engine that analyzes your data and produces professional insights, statistical tests, interactive visualizations, and predictions โ€” all with a single function call.

One-line analysis:

from xelytics import analyze
import pandas as pd

df = pd.read_csv("data.csv")
result = analyze(df)  # That's it!

for insight in result.insights:
    print(f"๐Ÿ“Š {insight.title}: {insight.description}")

Output includes:

  • โœ… 50+ statistical tests (parametric & non-parametric)
  • โœ… Time series decomposition & forecasting (ARIMA, Exponential Smoothing)
  • โœ… Anomaly detection & change point detection
  • โœ… Clustering analysis (K-Means, DBSCAN, Hierarchical)
  • โœ… Interactive Plotly visualizations
  • โœ… Human-readable insights (with optional LLM narration)
  • โœ… Professional HTML, PDF, PowerPoint, and Jupyter reports

What's New in v0.3.0

Added in v0.3.0

v0.3.0 evolves Xelytics-Core from an eager, mostly linear DataFrame analysis pipeline into a lazy, graph-aware analytics engine. The existing analyze(df) workflow is still supported for v0.2.x compatibility, while new projects should prefer the chainable Xelytics API when they need lazy data binding, execution planning, SQL pushdown, lineage, or plugin extension points.

Area v0.2.x Behavior v0.3.0 Behavior Compatibility
Entry point analyze(df) runs the pipeline directly Xelytics().dataset(df).analyze().run() builds then executes a plan analyze(df) remains supported
Data model DataFrame-first, connector results usually materialized Unified Dataset abstraction with materialized and lazy datasets Existing DataFrame inputs still work
Execution Eager pipeline with optional parallel tasks Lazy ExecutionPlan DAG with scan, transform, analysis nodes Eager behavior is preserved through legacy API
SQL sources Query first, then analyze returned DataFrame Filter/project nodes can be pushed into SQL where supported Connector APIs remain available
Transformations Custom pipeline steps execute before analysis Transformations can be represented as graph nodes Pipeline remains supported
Caching Result/intermediate cache for analysis stages Node-level cache support for transformation graph nodes Existing cache backends remain supported
Metadata Run metadata plus optional sampling/parallel fields Adds trace, profiling, lineage, cache, and analyzer outputs Existing result fields remain stable
Extensibility Pipelines, exporters, LLM providers Registries for analyzers, transformations, and output formats Existing extension patterns remain valid

Legacy API (v0.2.x Compatible)

Legacy API (still supported)

from xelytics import analyze

result = analyze(df)

Recommended v0.3.0 API

Added in v0.3.0

from xelytics import Xelytics

result = (
    Xelytics()
      .dataset(df)
      .filter("revenue > 1000")
      .analyze()
      .run()
)

load_dataframe(df) is also available as an explicit DataFrame-loading name in the current implementation. Documentation uses dataset(df) for the recommended v0.3.0 abstraction.


Core Principles

Principle Meaning
advanced advanced time series dependencies such as ruptures and pmdarima
connectors database, cloud storage, Excel, and Parquet connector dependencies
export PDF, PowerPoint, notebook, and static chart export dependencies
llm OpenAI and Groq provider dependencies
large_data Dask dataframe support
dev test, lint, type-check, and formatting tools

Quick Start

v0.2-Compatible API

Use this for simple one-shot DataFrame analysis.

import pandas as pd
from xelytics import AnalysisConfig, analyze

df = pd.read_csv("sales.csv")

config = AnalysisConfig(
    enable_llm_insights=False,
    generate_visualizations=False,
)

result = analyze(df, config=config)

print(result.summary.row_count)
print(result.metadata.tests_executed)

for insight in result.insights[:5]:
    print(f"{insight.severity.value}: {insight.title}")

result.export_to("analysis.json")

Recommended v0.3.0 API

Use the chainable API when you want to bind data first, record operations, and execute only when .run() is called.

import pandas as pd
from xelytics import AnalysisConfig, Xelytics

df = pd.read_csv("sales.csv")

result = (
    Xelytics(config=AnalysisConfig(enable_llm_insights=False))
      .dataset(df)
      .filter("revenue > 1000")
      .analyze()
      .run()
)

print(result.summary.row_count)
print(result.trace.print_trace() if result.trace else "No trace")

load_dataframe(df) and from_dataset(dataset) are also available aliases for explicit binding.

The full runnable notebook for this release is examples/xelytics_core_v0_3_0_complete.ipynb. It uses generated data and local files only, so it can be executed without API keys or database credentials.

What Changed in v0.3.0

Area v0.2.x v0.3.0
Entry point analyze(df) analyze(df) still works; Xelytics().dataset(df).analyze().run() is recommended for lazy workflows
Data model DataFrame-first Dataset, MaterializedDataset, LazyDataset, and TransformedDataset
Execution eager pipeline ExecutionPlan, PlanNode, PlanBuilder, and DAG execution
Connectors mostly materialized DataFrames database connectors can back lazy datasets
SQL behavior query then analyze filter/project/limit plan nodes can use SQL pushdown when supported
Transformations eager Pipeline preprocessing TransformGraph, graph nodes, node cache, and lineage APIs
Analysis outputs stats, visualizations, insights, time series, clustering adds correlation, trend_anomaly, and segmentation analyzer outputs
Observability logs and metadata TraceCollector and ExecutionProfiler attached to results
Extensibility pipelines/exporters/providers registries for analyzers, transformations, and output formats
Compatibility v0.2.x public API no public v0.2.x API removed

See MIGRATION_GUIDE_v0.2_to_v0.3.md for the full migration guide.

Implemented v0.3.0 Story Map

The v0.3.0 implementation is organized around the story set in aidlc-docs/inception/v0.3.0.

Epic Implemented surface Main modules
Epic 1: Data Connectivity Engine source abstraction, schema inference, lazy data binding, connector timeouts/retries/sampling hints xelytics.dataset, xelytics.schemas.schema, xelytics.connectors, xelytics.schemas.config
Epic 2: Execution Engine execution plans, lazy execution, SQL pushdown helpers, chunked planning support xelytics.execution, xelytics.engine
Epic 3: Transformation Graph Engine graph nodes, graph execution, node cache, schema hooks, lineage records xelytics.graph
Epic 4: Analysis and Insight Engine profiling, correlation, trend/anomaly, segmentation, ranked and deduplicated insights xelytics.analyzers, xelytics.insights
Epic 5: Output Layer and Python API structured JSON, optional visualizations, chainable Xelytics API, result export xelytics.api, xelytics.schemas.outputs, xelytics.export
Epic 6: Observability and Debugging execution logs, trace collection, node profiling, trace/profile serialization xelytics.observability, xelytics.engine
Epic 7: Extensibility System custom analyzer, transformation, and output-format registries xelytics.extension

Regression coverage for these surfaces lives in tests/test_epic1_connectivity.py through tests/test_epic7.py, plus compatibility tests for earlier APIs.

Feature Overview

Capability Status
Automatic statistical test planning and execution supported
Dataset summaries and column profiling supported
Rule-based insights and ranked insights supported
Plotly-compatible visualization specs supported
Time series detection, decomposition, forecasting, anomalies, and change points supported through xelytics.timeseries and v0.3 analyzer outputs
K-Means, DBSCAN, hierarchical clustering, and cluster profiling supported
PostgreSQL, MySQL, SQLite, Snowflake, BigQuery, S3, Azure Blob, GCS, and file connectors supported through optional extras
File and Redis caching supported
Large dataset summary and sample analysis supported through analyze_large_dataset()
HTML, PDF, PowerPoint, Jupyter notebook, and JSON export supported through xelytics.export
CLI for CSV and Excel analysis supported through the xelytics command
Optional LLM provider integrations OpenAI and Groq dependencies available through llm extra

Public API

The stable top-level imports are:

# Define which columns to analyze
config = AnalysisConfig(
    include_columns=["age", "income", "purchase_frequency"],
    exclude_columns=["customer_id", "timestamp"],
    categorical_max_categories=50,  # Skip columns with >50 unique values
)

result = analyze(df, config=config)

Statistics Covered:

  • โœ… Descriptive: mean, median, variance, skewness, kurtosis
  • โœ… t-tests, ANOVA, Welch's test, Mann-Whitney U, Kruskal-Wallis
  • โœ… Correlation: Pearson, Spearman, Kendall Tau
  • โœ… Chi-square tests for categorical associations
  • โœ… Effect sizes: Cohen's d, Cramรฉr's V, Eta-squared
  • โœ… Assumption checks: Normality (Shapiro-Wilk), Homogeneity of variance (Levene)

2๏ธโƒฃ Time Series Analysis (NEW in v0.2.0)

Complete time series toolkit: detection, decomposition, forecasting, anomalies.

Time Series Detection

from xelytics import analyze, AnalysisConfig

# Option 1: Auto-detect time series columns
config = AnalysisConfig(enable_time_series=True)
result = analyze(df, config=config)

# Option 2: Specify datetime column
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
)
result = analyze(df, config=config)

# Check which columns were detected as time series
for ts in result.time_series_analysis:
    print(f"{ts.column_name}:")
    print(f"  Type: {ts.series_type.value}")
    print(f"  Frequency: {ts.frequency}")
    print(f"  Has trend: {ts.has_trend}")
    print(f"  Has seasonality: {ts.has_seasonality}")
    if ts.has_seasonality:
        print(f"  Seasonal period: {ts.seasonal_period}")

Time Series Decomposition

# Automatically decompose into trend, seasonal, residual
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    decomposition_method="additive",  # or "multiplicative", "stl"
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.decomposition:
        print(f"{ts.column_name} decomposition:")
        print(f"  Trend strength: {ts.decomposition.trend_strength:.3f}")
        print(f"  Seasonal strength: {ts.decomposition.seasonal_strength:.3f}")

Forecasting

# ARIMA and Exponential Smoothing forecasting
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    forecast_periods=30,  # Forecast next 30 periods
    forecast_methods=["arima", "exponential_smoothing"],
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.forecasts:
        print(f"\n{ts.column_name} - Next 30 periods forecast:")
        for forecast in ts.forecasts[:5]:  # Show first 5
            print(f"  Period {forecast.period}: {forecast.value:.2f} "
                  f"(95% CI: {forecast.lower_bound:.2f}-{forecast.upper_bound:.2f})")

Anomaly Detection

# Multiple detection methods: Z-score, IQR, MAD, Isolation Forest
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,  # 95th percentile threshold
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.anomalies:
        print(f"\n{ts.column_name} - Anomalies detected:")
        for anomaly in ts.anomalies[:3]:
            print(f"  Index {anomaly.index}: {anomaly.value:.2f} "
                  f"(severity: {anomaly.severity}, confidence: {anomaly.confidence:.2f})")

Change Point Detection

# Detect structural breaks (CUSUM algorithm)
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    detect_change_points=True,
    change_point_sensitivity=0.05,
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.change_points:
        print(f"\n{ts.column_name} - Change points:")
        for cp in ts.change_points:
            print(f"  At index {cp.index}: magnitude={cp.magnitude:.2f}, "
                  f"confidence={cp.confidence:.2f}")

3๏ธโƒฃ Clustering & Segmentation (NEW in v0.2.0)

Unsupervised learning for customer segmentation, market clustering, etc.

Basic Clustering

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=8,
    exclude_columns=["customer_id", "name"],
)
result = analyze(df, config=config)

# View clusters
print(f"Algorithm used: {result.clusters[0].algorithm}")
for cluster in result.clusters:
    print(f"\nCluster {cluster.cluster_id}:")
    print(f"  Size: {cluster.size} members ({cluster.size/result.summary.row_count*100:.1f}%)")
    print(f"  Silhouette score: {cluster.silhouette_score:.3f}")
    print(f"  Profile: {cluster.profile}")

K-Means (with Automatic K Selection)

# K-Means tries multiple K values and picks the best
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=10,
    k_selection_method="elbow",  # elbow, silhouette, gap_statistic
)
result = analyze(df, config=config)

# View metrics for each K
for cluster in result.clusters:
    print(f"K={cluster.algorithm_params['n_clusters']}: "
          f"silhouette={cluster.silhouette_score:.3f}")

DBSCAN (Density-Based)

# DBSCAN finds natural clusters and noise points
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="dbscan",
    dbscan_eps=0.5,  # Auto-estimated if not provided
    dbscan_min_samples=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    noise_label = "Noise" if cluster.cluster_id == -1 else f"Cluster {cluster.cluster_id}"
    print(f"{noise_label}: {cluster.size} points")

Hierarchical Clustering

# Produces dendrograms and tree-based clusters
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="hierarchical",
    hierarchical_linkage="ward",  # ward, complete, average, single
    max_clusters=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    print(f"Cluster {cluster.cluster_id}: {cluster.size} members")

4๏ธโƒฃ Data Connectors (NEW in v0.2.0)

Analyze data directly from databases and cloud storageโ€”no manual data export needed.

PostgreSQL

from pathlib import Path
from xelytics.connectors import connect_to_source

output_dir = Path(".cache/xelytics_readme")
output_dir.mkdir(parents=True, exist_ok=True)

csv_path = output_dir / "sales.csv"
df.to_csv(csv_path, index=False)

file_dataset = connect_to_source("file", path=str(csv_path))
print(file_dataset.to_pandas().head())

Database pattern:

from xelytics import AnalysisConfig, Xelytics

result = (
    Xelytics(config=AnalysisConfig(enable_llm_insights=False))
      .connect(
          "postgresql",
          host="localhost",
          database="analytics",
          user="reader",
          password="secret",
          query="SELECT * FROM sales",
      )
      .filter("revenue > 1000")
      .analyze()
      .run()
)

Cache APIs

API Purpose
Cache(backend="file", **kwargs) Direct cache instance
Cache.get(key) Read cached value
Cache.set(key, value, ttl=None) Store cached value
Cache.delete(key) Delete key
Cache.clear(pattern=None) Clear backend
Cache.cached(ttl=None) Decorator for function caching
get_cache(backend, **kwargs) Create/get global cache
clear_cache(pattern=None) Clear global cache
NodeCache.get(node_id, input_dfs, func) Read transform-node output
NodeCache.set(node_id, input_dfs, func, result) Store transform-node output
connector = connect_to_source(
    source_type="bigquery",
    project_id="my-project",
    credentials_path="/path/to/service-account.json",
)

df = connector.query("""
    SELECT * FROM `my-project.dataset.events`
    WHERE event_date >= '2025-01-01'
    LIMIT 100000
""")
result = analyze(df)

Snowflake

connector = connect_to_source(
    source_type="snowflake",
    account="xy12345",
    warehouse="COMPUTE",
    database="ANALYTICS",
    schema="PUBLIC",
    user=os.getenv("SNOWFLAKE_USER"),
    password=os.getenv("SNOWFLAKE_PASSWORD"),
)

df = connector.query("SELECT * FROM CUSTOMER_DATA")
result = analyze(df)

S3 / Cloud Storage

# Amazon S3
connector = connect_to_source(
    source_type="s3",
    bucket="my-analytics-bucket",
    key="data/sales.parquet",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("AWS_SECRET_KEY"),
)
df = connector.query()  # Returns DataFrame
result = analyze(df)

# Azure Blob Storage
connector = connect_to_source(
    source_type="azure_blob",
    container_name="data",
    blob_name="sales.csv",
    connection_string=os.getenv("AZURE_CONN_STRING"),
)
df = connector.query()
result = analyze(df)

# Google Cloud Storage
connector = connect_to_source(
    source_type="gcs",
    bucket="my-bucket",
    key="data/sales.csv",
    credentials_path="/path/to/gcp-key.json",
)
df = connector.query()
result = analyze(df)

5๏ธโƒฃ Report Generation (NEW in v0.2.0)

Generate professional, interactive reports in multiple formats.

HTML Report

from xelytics.pipeline import Pipeline, correlation_analysis, normalize, pca, remove_outliers

pipeline = Pipeline(name="demo")
pipeline.add_step(
    remove_outliers,
    name="remove_outliers",
    inputs=["df"],
    outputs=["df"],
    columns=["revenue"],
    method="iqr",
    threshold=3.0,
)
pipeline.add_step(
    normalize,
    name="normalize",
    inputs=["df"],
    outputs=["normalized"],
    columns=["revenue", "cost"],
    method="minmax",
)

context = pipeline.execute({"df": df})
print(context["normalized"].head())
print(pca(df[["revenue", "cost"]], n_components=2).head())
print(correlation_analysis(df[["revenue", "cost", "profit"]]))

Transformation Graph, Lineage, Trace, and Profiling

from xelytics.dataset import MaterializedDataset
from xelytics.graph.graph import TransformGraph
from xelytics.graph.lineage import LineageTracker
from xelytics.graph.node import DataSourceNode, TransformNode
from xelytics.observability.profiler import ExecutionProfiler
from xelytics.observability.trace import TraceCollector, TraceEntry

graph = TransformGraph()
graph.add_node(DataSourceNode(id="source", dataset=MaterializedDataset(df)))
graph.add_node(
    TransformNode(
        id="filter",
        name="filter",
        func=lambda frame: frame.query("revenue > 1000"),
        inputs=["source"],
    )
)
graph.add_edge("source", "filter")
graph.validate()
graph_df = graph.run()

lineage = LineageTracker()
lineage.record_execution("filter", {"source": "hash-a"}, "hash-b", 12.5)
print(lineage.get_record("filter"))
lineage.clear()

trace = TraceCollector()
trace.add(TraceEntry(step_name="demo", row_count=len(graph_df)))
print(trace.print_trace())

profiler = ExecutionProfiler()
profiler.start("node")
profiler.stop("node", operation="demo", rows_fetched=len(graph_df))
print(profiler.print_profile())

JSON Export

import json

# For programmatic access or storage
with open("analysis.json", "w") as f:
    json.dump(result.to_dict(), f, indent=2)

# Later, reconstruct from JSON
from xelytics.schemas.outputs import AnalysisResult
with open("analysis.json") as f:
    data = json.load(f)
    result = AnalysisResult(**data)

6๏ธโƒฃ Custom Pipelines (NEW in v0.2.0)

Pre-process data with custom steps before analysis.

from xelytics.pipeline import Pipeline, normalize, pca, remove_outliers, correlation_analysis
from xelytics import analyze

# Build a custom pipeline
pipeline = Pipeline([
    remove_outliers(method="iqr", threshold=1.5),
    normalize(method="minmax"),
    pca(n_components=10),
    correlation_analysis(threshold=0.7),
])

# Apply before analysis
df_processed = pipeline.fit_transform(df)
result = analyze(df_processed)

# Or use in AnalysisConfig
config = AnalysisConfig(
    run_custom_pipeline=True,
    custom_pipeline=pipeline,
)
result = analyze(df, config=config)

7๏ธโƒฃ Caching (NEW in v0.2.0)

Speed up repeated analyses on the same data.

File-Based Cache

from xelytics import analyze, AnalysisConfig
from xelytics.cache import FileCache

cache = FileCache(cache_dir="./cache")

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

# First run: takes full time
result1 = analyze(df, config=config)

# Subsequent runs on same data: instant
result2 = analyze(df, config=config)  # Retrieved from cache!

Redis Cache (Distributed)

from xelytics.cache import RedisCache

cache = RedisCache(host="localhost", port=6379, db=0, ttl=3600)

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

result = analyze(df, config=config)

Clear Cache

from xelytics.cache import clear_cache

# Clear all caches
clear_cache(pattern="*")

# Clear specific patterns
clear_cache(pattern="stats:*")  # Only clear stats caches

8๏ธโƒฃ CLI (Command-Line Interface)

Analyze without writing Python code.

# Basic analysis - outputs JSON
xelytics analyze data.csv

# Save to file
xelytics analyze data.csv --output results.json

# Set parameters
xelytics analyze data.csv \
  --format=json \
  --alpha 0.01 \
  --no-llm \
  --max-visualizations 20 \
  --datetime-column "date"

# Time series analysis
xelytics analyze data.csv \
  --enable-time-series \
  --datetime-column "date" \
  --forecast-periods 30

# Clustering
xelytics analyze data.csv \
  --enable-clustering \
  --clustering-algorithm kmeans \
  --max-clusters 5

# Show version
xelytics --version

# Help
xelytics --help

9๏ธโƒฃ LLM Integration (Optional)

Enhance insights with AI narration.

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",  # openai, groq, or local
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

result = analyze(df, config=config)

# Insights now include AI-generated descriptions
for insight in result.insights:
    print(f"{insight.title}")
    print(f"  ๐Ÿ“ {insight.narrative}")  # AI-generated explanation

Multiple LLM Providers

# OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

# Groq (fast, open source)
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="groq",
    llm_model="mixtral-8x7b",
    llm_api_key=os.getenv("GROQ_API_KEY"),
)

# Azure OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="azure",
    llm_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    llm_api_key=os.getenv("AZURE_OPENAI_KEY"),
)

Extension Registries and Custom Output

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    # Auto-sample if > 1M rows
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Or force sampling
    sampling_strategy="stratified",
    sample_size=100_000,
    
    # Parallel execution
    parallel_execution=True,
    max_workers=4,
)

result = analyze(df, config=config)

Chunked Processing for Very Large Files

from xelytics.engine import analyze_large_dataset

# Process 10M row file without loading into memory
result = analyze_large_dataset(
    source="huge_sales_data.csv",
    chunksize=50_000,
    sample_size=100_000,  # Take a sample for full analysis
    config=AnalysisConfig(),
)

โš™๏ธ Configuration Reference

from xelytics import AnalysisConfig

config = AnalysisConfig(
    # General
    significance_level=0.05,
    mode="automated",  # automated or semi-automated
    
    # Columns
    include_columns=None,  # [list] Include only these columns
    exclude_columns=None,  # [list] Exclude these columns
    datetime_column=None,  # [str] Column name for time series
    
    # Time Series
    enable_time_series=False,
    decomposition_method="additive",  # additive, multiplicative, stl
    forecast_periods=0,
    forecast_methods=["arima", "exponential_smoothing"],
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,
    detect_change_points=False,
    
    # Clustering
    enable_clustering=False,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=10,
    k_selection_method="elbow",
    
    # Performance
    parallel_execution=True,
    max_workers=4,
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Caching
    enable_caching=False,
    cache_backend=None,
    
    # Reporting
    max_visualizations=15,
    run_custom_pipeline=False,
    custom_pipeline=None,
    
    # LLM
    enable_llm_insights=False,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=None,
    
    # Other
    random_seed=42,
    verbose=True,
)

Usage Examples

Configure Analysis

from xelytics import AnalysisConfig, analyze

config = AnalysisConfig(
    significance_level=0.01,
    enable_llm_insights=False,
    enable_time_series=True,
    datetime_column="date",
    forecast_periods=14,
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=5,
    parallel_execution=True,
    enable_caching=True,
    
    # Reporting
    max_visualizations=20,
    enable_llm_insights=True,
    llm_provider="openai",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

result = analyze(df, config=config)

# 4. EXPLORE RESULTS
print(f"\nโœ“ Analysis complete in {result.metadata.execution_time_ms}ms")
print(f"  โ€ข Tests: {result.metadata.tests_executed}")
print(f"  โ€ข Visualizations: {len(result.visualizations)}")
print(f"  โ€ข Insights: {len(result.insights)}")
print(f"  โ€ข Time Series Series: {len(result.time_series_analysis)}")
print(f"  โ€ข Clusters: {len(result.clusters)}")

print("\n๐Ÿ“Š Key Insights:")
for i, insight in enumerate(result.insights[:5], 1):
    print(f"  {i}. {insight.title}")
    if hasattr(insight, 'narrative'):
        print(f"     {insight.narrative[:100]}...")

# 5. GENERATE REPORTS
print("\n๐Ÿ“„ Generating reports...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# HTML Report
html_generator = HTMLReportGenerator(
    theme="light",
    logo_text="Sales Analytics",
    company_name="ACME Corp"
)
html = html_generator.generate(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
html_path = f"reports/sales_analysis_{timestamp}.html"
os.makedirs("reports", exist_ok=True)
with open(html_path, "w") as f:
    f.write(html)
print(f"  โœ“ HTML: {html_path}")

# PDF Report
pdf_bytes = generate_pdf_report(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
pdf_path = f"reports/sales_analysis_{timestamp}.pdf"
with open(pdf_path, "wb") as f:
    f.write(pdf_bytes)
print(f"  โœ“ PDF:  {pdf_path}")

# JSON Export
json_path = f"reports/sales_analysis_{timestamp}.json"
import json
with open(json_path, "w") as f:
    json.dump(result.to_dict(), f, indent=2)
print(f"  โœ“ JSON: {json_path}")

print("\nโœ… Analysis complete!")
print(f"Reports saved to: {os.path.abspath('reports')}")

Output:

๐Ÿ“ Loading data...
โœ“ Loaded 150,432 rows

โš™๏ธ  Configuring analysis...

๐Ÿ” Running analysis...

โœ“ Analysis complete in 3421ms
  โ€ข Tests: 47
  โ€ข Visualizations: 18
  โ€ข Insights: 12
  โ€ข Time Series Series: 2
  โ€ข Clusters: 5

๐Ÿ“Š Key Insights:
  1. Significant correlation detected: total_amount vs. customer_age
  2. Strong seasonality in Q4 sales
  3. Customer segmentation: 5 distinct groups identified
  4. Outliers detected in unit_price column
  5. Increasing trend in repeat customer rate

๐Ÿ“„ Generating reports...
  โœ“ HTML: reports/sales_analysis_20250307_143021.html
  โœ“ PDF:  reports/sales_analysis_20250307_143021.pdf
  โœ“ JSON: reports/sales_analysis_20250307_143021.json

โœ… Analysis complete!
Reports saved to: /home/user/reports

๐Ÿ“ˆ Performance & Scaling

Dataset Size Processing Time Max Parallel Tasks
10K rows 1โ€“2 seconds 3
100K rows 5โ€“10 seconds 4
1M rows 30โ€“60 seconds 4
10M rows 3โ€“5 minutes 4 (chunked)
100M rows 10โ€“30 minutes 4 (chunked + sampled)

Optimization Strategies:

  • โœ… Automatic sampling for datasets > 1M rows
  • โœ… Parallel execution (4 workers by default)
  • โœ… Result caching (file or Redis)
  • โœ… Progress callbacks for long-running analyses
  • โœ… Memory-aware warnings (logs warning if > 1GB)

๐Ÿ“Š Feature Comparison

Feature v0.1.0 v0.2.0
Statistical Analysis โœ… โœ…
Automated test selection โœ… โœ…
Effect size calculation โœ… โœ…
Assumption checking โœ… โœ…
Time Series (NEW) โ€” โœ…
Detection & decomposition โ€” โœ…
ARIMA & ES forecasting โ€” โœ…
Anomaly detection โ€” โœ…
Change point detection โ€” โœ…
Clustering (NEW) โ€” โœ…
K-Means โ€” โœ…
DBSCAN โ€” โœ…
Hierarchical โ€” โœ…
Cluster profiling โ€” โœ…
Performance (NEW) โ€” โœ…
Parallel execution โ€” โœ…
Result caching โ€” โœ…
Sampling strategies โ€” โœ…
Chunked processing โ€” โœ…
Connectors (NEW) โ€” โœ…
PostgreSQL โ€” โœ…
MySQL/MariaDB โ€” โœ…
SQLite โ€” โœ…
BigQuery โ€” โœ…
Snowflake โ€” โœ…
S3/Azure/GCS โ€” โœ…
Export (NEW) โ€” โœ…
HTML reports โ€” โœ…
PDF export โ€” โœ…
PowerPoint slides โ€” โœ…
Jupyter notebooks โ€” โœ…
JSON export โ€” โœ…
Other Features
Data profiling โœ… โœ…
Rule-based insights โœ… โœ…
LLM narration โœ… โœ…
Custom pipelines โ€” โœ…
Progress callbacks โ€” โœ…
CLI interface โ€” โœ…
Backward compatible โ€” โœ…

๐Ÿ”ง Installation & Setup

System Requirements

  • Python: 3.9, 3.10, 3.11, 3.12
  • OS: Linux, macOS, Windows
  • RAM: 2GB minimum; 8GB+ recommended for large datasets

Basic Installation

# Minimal (core features only)
pip install -e .

# Development
pip install -e ".[dev]"

# Production (all features)
pip install -e ".[advanced,connectors,export,llm]"

# Everything (including dev tools)
pip install -e ".[advanced,connectors,export,llm,dev]"

Verify Installation

python -c "from xelytics import analyze; print('โœ“ Xelytics installed')"

# Check version
python -c "import xelytics; print(xelytics.__version__)"

# Test CLI
xelytics --version

๐Ÿ“š Documentation

Full documentation is available in the docs/ folder:

Topic Location
๐Ÿš€ Installation docs/installation.md
๐Ÿ“– Quick Start docs/quickstart.md
๐Ÿ“Š Statistical Analysis docs/guides/01_basic_analysis.md
โฑ๏ธ Time Series docs/guides/02_time_series.md
๐ŸŽฏ Clustering docs/guides/03_clustering.md
โšก Performance docs/guides/04_performance.md
๐Ÿ”— Connectors docs/guides/05_connectors.md
๐Ÿ“„ Export & Reports docs/guides/06_export_reports.md
๐Ÿ› ๏ธ Custom Pipelines docs/guides/07_custom_pipelines.md
๐Ÿ’ป CLI Guide docs/guides/08_cli.md
๐Ÿ“ก Observability docs/guides/09_observability.md
๐Ÿงฉ Extensibility docs/guides/10_extensibility.md
๐Ÿ” API Reference docs/api/
๐Ÿ“‹ Examples examples/
๐Ÿ“œ Migration Guide docs/migration/v01_to_v02.md
๐Ÿ“œ v0.2 โ†’ v0.3 Migration MIGRATION_GUIDE_v0.2_to_v0.3.md
๐Ÿ—๏ธ Architecture ARCHITECTURE.md
๐Ÿ“‘ API Contract API_CONTRACT.md
๐Ÿ“ Comprehensive Docs COMPREHENSIVE_DOCUMENTATION.md

๐Ÿ› ๏ธ Development

Setup Development Environment

# Clone repository
git clone https://github.com/xelytics/xelytics-core.git
cd xelytics-core

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dev dependencies
pip install -e ".[dev,advanced,connectors,export]"

Running Tests

# All tests
pytest tests/ -v

# Specific test file
pytest tests/test_clustering.py -v

# Tests matching pattern
pytest tests/ -k "test_kmeans" -v

# With coverage report
pytest tests/ --cov=xelytics --cov-report=html

# Only unit tests (exclude slow integration tests)
pytest tests/ -m "not integration" -v

# Only fast tests
pytest tests/ -m "not slow" -v

Code Formatting & Linting

# Format code with Black
black xelytics/ tests/ examples/

# Check formatting
black --check xelytics/ tests/

# Lint with Ruff
ruff check xelytics/ tests/ --fix

# Type checking with mypy
mypy xelytics/

Use the CLI

# Build package
pip install build
python -m build

# Publish to PyPI (requires credentials)
pip install twine
python -m twine upload dist/*

๐Ÿงช Testing & Quality Assurance

Test Coverage: 85%+ (307 tests)

Test Categories:

Category Count Status
Unit Tests 200+ โœ… Passing
Integration Tests 50+ โœ… Passing
Performance Tests 20+ โœ… Passing
Backward Compatibility Tests 8 โœ… Passing (v0.1.0 code works in v0.2.0)
Example Scripts 5 โœ… Working

Key Test Suites:

  • โœ… test_core.py - Data ingestion, profiling, feature detection
  • โœ… test_clustering.py - K-Means, DBSCAN, Hierarchical
  • โœ… test_timeseries_advanced.py - Decomposition, forecasting, anomalies
  • โœ… test_stats.py - Statistical tests, effect sizes, assumptions
  • โœ… test_connectors_integration.py - Database connectivity
  • โœ… test_export.py - HTML, PDF, PowerPoint, notebook export
  • โœ… test_caching.py - File and Redis caching
  • โœ… test_v02_backward_compatibility.py - v0.1.0 compatibility

Run Full Test Suite:

# Quick run (excludes slow tests)
pytest tests/ -m "not slow" --tb=short

# Full run (includes slow + integration)
pytest tests/ -v --tb=short

# With coverage
pytest tests/ --cov=xelytics --cov-report=term-missing

Architecture Evolution (v0.2.x โ†’ v0.3.0)

Added in v0.3.0

The v0.2.x architecture remains valid for simple DataFrame workflows: ingest data, detect schema/features, profile columns, run analysis modules, generate visualizations and insights, then export the result. v0.3.0 adds a planning layer in front of that pipeline rather than replacing it outright.

v0.2.x eager flow:
DataFrame -> ingestion -> profiling -> stats/time series/clustering -> insights -> exports

v0.3.0 lazy flow:
Dataset -> ExecutionPlan -> TransformGraph nodes -> executor -> analysis -> trace/profile/result
Layer v0.2.x v0.3.0
Public API analyze(df) analyze(df) plus Xelytics().dataset(...).analyze().run()
Data source DataFrame or connector-loaded DataFrame Dataset, MaterializedDataset, LazyDataset, connector-backed sources
Pipeline shape Mostly linear, eager execution DAG of plan nodes and transform nodes
Optimization Parallel tasks, sampling, result cache Execution planning, SQL pushdown, chunk-aware execution hooks, node cache
Metadata RunMetadata RunMetadata plus trace/profiling/lineage-capable metadata
Extensibility Pipeline steps, exporters, LLM providers Analyzer, transformation, and output registries

Compatibility guarantee: the v0.3.0 executor still materializes into the established AnalysisResult schema after planning. Existing code that reads summary, statistics, visualizations, insights, metadata, time_series_analysis, or clusters can continue to do so.

๐Ÿ—๏ธ Architecture

System Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Public API Layer             โ”‚
โ”‚  analyze() / AnalysisConfig     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Data Ingestion Layer         โ”‚
โ”‚  Connectors, DataFrames, Files  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    Processing Core              โ”‚
โ”‚  Type Detection, Sampling        โ”‚
โ”‚  Feature Detection, Profiling    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚                 โ”‚              โ”‚
   โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”
   โ”‚ Stats  โ”‚  โ”‚ TimeSeriesโ”‚  โ”‚Clusteringโ”‚
   โ”‚Engine  โ”‚  โ”‚ Engine    โ”‚  โ”‚ Engine   โ”‚
   โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”˜
       โ”‚                โ”‚              โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ”‚
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚  Visualization &   โ”‚
      โ”‚  Insight Generator โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                โ”‚
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚  Export Layer      โ”‚
      โ”‚  HTML/PDF/PPTX/etc โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Module Breakdown

xelytics-core/
โ”œโ”€โ”€ xelytics/
โ”‚   โ”œโ”€โ”€ __init__.py               # Public API
โ”‚   โ”œโ”€โ”€ engine.py                 # Main analyze() function
โ”‚   โ”œโ”€โ”€ api.py                    # Chainable Xelytics API (v0.3.0)
โ”‚   โ”œโ”€โ”€ dataset.py                # Dataset abstraction: materialized/lazy/transformed (v0.3.0)
โ”‚   โ”œโ”€โ”€ exceptions.py             # Exception hierarchy
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ core/                     # Data pipeline
โ”‚   โ”‚   โ”œโ”€โ”€ ingestion.py          # Type detection, validation
โ”‚   โ”‚   โ”œโ”€โ”€ profiler.py           # Column statistics
โ”‚   โ”‚   โ”œโ”€โ”€ features.py           # Feature detection
โ”‚   โ”‚   โ””โ”€โ”€ chunked.py            # Large dataset processing
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ stats/                    # Statistical analysis
โ”‚   โ”‚   โ”œโ”€โ”€ engine.py             # Test selection & execution
โ”‚   โ”‚   โ”œโ”€โ”€ planner.py            # Analysis planning
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ timeseries/               # Time series (v0.2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ detector.py           # Series detection
โ”‚   โ”‚   โ”œโ”€โ”€ decomposition.py      # Trend/seasonal separation
โ”‚   โ”‚   โ”œโ”€โ”€ forecasting.py        # ARIMA/ExpSmoothing
โ”‚   โ”‚   โ”œโ”€โ”€ anomaly.py            # Anomaly detection
โ”‚   โ”‚   โ””โ”€โ”€ change_points.py      # Change point detection
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ clustering/               # Clustering (v0.2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ kmeans.py             # K-Means
โ”‚   โ”‚   โ”œโ”€โ”€ dbscan.py             # DBSCAN
โ”‚   โ”‚   โ”œโ”€โ”€ hierarchical.py       # Hierarchical clustering
โ”‚   โ”‚   โ””โ”€โ”€ profiler.py           # Cluster profiling
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ connectors/               # Data sources (v0.2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ postgres.py           # PostgreSQL
โ”‚   โ”‚   โ”œโ”€โ”€ mysql.py              # MySQL/MariaDB
โ”‚   โ”‚   โ”œโ”€โ”€ database.py           # Base SQL class
โ”‚   โ”‚   โ”œโ”€โ”€ s3.py                 # AWS S3
โ”‚   โ”‚   โ”œโ”€โ”€ cloud.py              # Azure/GCS
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ export/                   # Report generation (v0.2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ html.py               # HTML reports
โ”‚   โ”‚   โ”œโ”€โ”€ pdf.py                # PDF export
โ”‚   โ”‚   โ”œโ”€โ”€ pptx.py               # PowerPoint slides
โ”‚   โ”‚   โ”œโ”€โ”€ notebook.py           # Jupyter notebooks
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ cache/                    # Caching (v0.2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ base.py               # Cache interface
โ”‚   โ”‚   โ”œโ”€โ”€ file.py               # File-based cache
โ”‚   โ”‚   โ””โ”€โ”€ redis.py              # Redis cache
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ pipeline/                 # Custom pipelines (v0.2.0)
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py           # Pipeline class
โ”‚   โ”‚   โ””โ”€โ”€ steps.py              # Pre-built steps
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ execution/                # Lazy execution planning (v0.3.0)
โ”‚   โ”‚   โ”œโ”€โ”€ plan.py               # ExecutionPlan and PlanNode
โ”‚   โ”‚   โ”œโ”€โ”€ builder.py            # PlanBuilder
โ”‚   โ”‚   โ”œโ”€โ”€ executor.py           # DAG executor with tracing/profiling
โ”‚   โ”‚   โ””โ”€โ”€ pushdown.py           # SQL pushdown helpers
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ graph/                    # Transformation DAG (v0.3.0)
โ”‚   โ”‚   โ”œโ”€โ”€ graph.py              # TransformGraph
โ”‚   โ”‚   โ”œโ”€โ”€ node.py               # DataSourceNode, TransformNode, SinkNode
โ”‚   โ”‚   โ”œโ”€โ”€ cache.py              # NodeCache
โ”‚   โ”‚   โ””โ”€โ”€ lineage.py            # LineageTracker
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ analyzers/                # Modular analyzers (v0.3.0)
โ”‚   โ”‚   โ”œโ”€โ”€ profiling.py          # ProfilingAnalyzer
โ”‚   โ”‚   โ”œโ”€โ”€ correlation.py        # CorrelationAnalyzer
โ”‚   โ”‚   โ”œโ”€โ”€ trend_anomaly.py      # TrendAnomalyAnalyzer
โ”‚   โ”‚   โ””โ”€โ”€ segmentation.py       # SegmentationAnalyzer
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ observability/            # Tracing and profiling (v0.3.0)
โ”‚   โ”‚   โ”œโ”€โ”€ trace.py              # TraceCollector
โ”‚   โ”‚   โ””โ”€โ”€ profiler.py           # ExecutionProfiler
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ extension/                # Plugin registries (v0.3.0)
โ”‚   โ”‚   โ”œโ”€โ”€ interfaces.py         # Analyzer, CustomTransform, OutputFormat
โ”‚   โ”‚   โ””โ”€โ”€ registry.py           # register_* decorators
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ llm/                      # LLM integration
โ”‚   โ”‚   โ”œโ”€โ”€ openai.py             # OpenAI provider
โ”‚   โ”‚   โ”œโ”€โ”€ groq.py               # Groq provider
โ”‚   โ”‚   โ””โ”€โ”€ base.py               # Provider interface
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ viz/                      # Visualizations
โ”‚   โ”‚   โ”œโ”€โ”€ generator.py          # Plotly spec generation
โ”‚   โ”‚   โ””โ”€โ”€ themes.py             # Color schemes
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ insights/                 # Insight generation
โ”‚   โ”‚   โ”œโ”€โ”€ rules.py              # Rule-based insights
โ”‚   โ”‚   โ””โ”€โ”€ templates.py          # Insight templates
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ schemas/                  # Type definitions
โ”‚   โ”‚   โ”œโ”€โ”€ config.py             # AnalysisConfig
โ”‚   โ”‚   โ””โ”€โ”€ outputs.py            # AnalysisResult & schemas
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ cli/                      # Command-line interface
โ”‚       โ””โ”€โ”€ main.py               # CLI entry point
โ”‚
โ”œโ”€โ”€ tests/                        # 300+ tests
โ”‚   โ”œโ”€โ”€ test_core.py
โ”‚   โ”œโ”€โ”€ test_clustering.py
โ”‚   โ”œโ”€โ”€ test_timeseries_*.py
โ”‚   โ”œโ”€โ”€ test_connectors_integration.py
โ”‚   โ”œโ”€โ”€ test_export.py
โ”‚   โ””โ”€โ”€ ...
โ”‚
โ”œโ”€โ”€ examples/                     # Example scripts
โ”‚   โ”œโ”€โ”€ quickstart.py
โ”‚   โ”œโ”€โ”€ forecasting_demo.py
โ”‚   โ””โ”€โ”€ ...
โ”‚
โ”œโ”€โ”€ docs/                         # Full documentation
โ”‚   โ”œโ”€โ”€ guides/                   # Step-by-step guides
โ”‚   โ”œโ”€โ”€ api/                      # API reference
โ”‚   โ””โ”€โ”€ examples/                 # Example notebooks
โ”‚
โ””โ”€โ”€ pyproject.toml                # Dependencies & config

๐Ÿ“‹ API Classes & Functions

Core Classes

from xelytics import analyze

result = analyze(df)

Adopt this when you need lazy data binding, plan inspection, graph transforms, observability, or extension registries:

from xelytics import Xelytics

result = Xelytics().dataset(df).analyze().run()

Migration notes:

  • analyze(df), AnalysisConfig, AnalysisResult, connectors, cache backends, exporters, pipelines, time series modules, and clustering modules remain supported.
  • v0.3.0 adds optional result fields: correlation, trend_anomaly, segmentation, trace, and profiling.
  • v0.2.x eager Pipeline remains supported for preprocessing.
  • Prefer Dataset.transform() or TransformGraph for transformations that need lineage, node caching, or plan visibility.
  • Prefer Xelytics or build_plan() for new lazy and graph-aware workflows.

Documentation

Document Purpose
examples/xelytics_core_v0_3_0_complete.ipynb complete executable v0.3.0 notebook
docs/quickstart.md copy-paste examples
docs/index.md documentation index and feature matrix
docs/api/analyze.md analyze(), Xelytics, and large-dataset API
docs/api/config.md all AnalysisConfig fields
docs/api/result_schema.md result dataclasses and serialization
docs/api/execution.md Dataset, ExecutionPlan, TransformGraph, observability
docs/api/extensions.md extension registry APIs
docs/guides/05_connectors.md database and cloud source usage
docs/guides/06_export_reports.md report export formats
MIGRATION_GUIDE_v0.2_to_v0.3.md v0.2.x to v0.3.0 migration
ARCHITECTURE.md package architecture
CHANGELOG.md release history

Development

pip install -e ".[dev]"
pytest tests/

Focused v0.3.0 verification:

pytest tests/test_epic1_connectivity.py tests/test_epic2.py tests/test_epic3.py
pytest tests/test_epic4.py tests/test_epic5.py tests/test_epic6.py tests/test_epic7.py

The package supports Python 3.9 through 3.12.

Project Status

Xelytics-Core is beta software. v0.3.0 is compatibility-first: older v0.2.x DataFrame workflows remain valid while the package moves toward the lazy, graph-aware engine model. See CHANGELOG.md and API_CONTRACT.md for versioning and compatibility policy.

License

MIT, as declared in pyproject.toml.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xelytics_core-0.3.0.tar.gz (224.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xelytics_core-0.3.0-py3-none-any.whl (182.4 kB view details)

Uploaded Python 3

File details

Details for the file xelytics_core-0.3.0.tar.gz.

File metadata

  • Download URL: xelytics_core-0.3.0.tar.gz
  • Upload date:
  • Size: 224.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xelytics_core-0.3.0.tar.gz
Algorithm Hash digest
SHA256 944b9f600cc534f092af6890318398ca50616bde0329f52689b759ef982a4163
MD5 285e64fd22cc9019b3e8005daca895a1
BLAKE2b-256 6de154c790c8256e43977cbcbdfa55c1c718c2ffccf4ae22aa6b95819e01f671

See more details on using hashes here.

File details

Details for the file xelytics_core-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: xelytics_core-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 182.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xelytics_core-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3a3ce0871de74aa1253fa610b3b6524192a03da1884546de9f73e3bff164472a
MD5 1cb676474b7a3d667d77b923d1f9a7dc
BLAKE2b-256 d259a0fb611e1fdeb1efc5c29fcca474c2c70b3bbe2b444e1771c3ce56b738cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page