Pure analytics engine with lazy execution, graph-based transformations, and extensible analysis

These details have not been verified by PyPI

Project links

Project description

Xelytics-Core

Python package for automated analytics with a lazy, graph-aware execution engine.

Status: v0.3.0 documentation update in progress | v0.2.x APIs remain supported while the lazy graph execution model becomes the recommended path.

What It Does

Xelytics-Core is a zero-configuration analytics engine that analyzes your data and produces professional insights, statistical tests, interactive visualizations, and predictions — all with a single function call.

One-line analysis:

from xelytics import analyze
import pandas as pd

df = pd.read_csv("data.csv")
result = analyze(df)  # That's it!

for insight in result.insights:
    print(f"📊 {insight.title}: {insight.description}")

Output includes:

✅ 50+ statistical tests (parametric & non-parametric)
✅ Time series decomposition & forecasting (ARIMA, Exponential Smoothing)
✅ Anomaly detection & change point detection
✅ Clustering analysis (K-Means, DBSCAN, Hierarchical)
✅ Interactive Plotly visualizations
✅ Human-readable insights (with optional LLM narration)
✅ Professional HTML, PDF, PowerPoint, and Jupyter reports

What's New in v0.3.0

Added in v0.3.0

v0.3.0 evolves Xelytics-Core from an eager, mostly linear DataFrame analysis pipeline into a lazy, graph-aware analytics engine. The existing analyze(df) workflow is still supported for v0.2.x compatibility, while new projects should prefer the chainable Xelytics API when they need lazy data binding, execution planning, SQL pushdown, lineage, or plugin extension points.

Area	v0.2.x Behavior	v0.3.0 Behavior	Compatibility
Entry point	`analyze(df)` runs the pipeline directly	`Xelytics().dataset(df).analyze().run()` builds then executes a plan	`analyze(df)` remains supported
Data model	DataFrame-first, connector results usually materialized	Unified `Dataset` abstraction with materialized and lazy datasets	Existing DataFrame inputs still work
Execution	Eager pipeline with optional parallel tasks	Lazy `ExecutionPlan` DAG with scan, transform, analysis nodes	Eager behavior is preserved through legacy API
SQL sources	Query first, then analyze returned DataFrame	Filter/project nodes can be pushed into SQL where supported	Connector APIs remain available
Transformations	Custom pipeline steps execute before analysis	Transformations can be represented as graph nodes	`Pipeline` remains supported
Caching	Result/intermediate cache for analysis stages	Node-level cache support for transformation graph nodes	Existing cache backends remain supported
Metadata	Run metadata plus optional sampling/parallel fields	Adds trace, profiling, lineage, cache, and analyzer outputs	Existing result fields remain stable
Extensibility	Pipelines, exporters, LLM providers	Registries for analyzers, transformations, and output formats	Existing extension patterns remain valid

Legacy API (v0.2.x Compatible)

Legacy API (still supported)

from xelytics import analyze

result = analyze(df)

Recommended v0.3.0 API

Added in v0.3.0

from xelytics import Xelytics

result = (
    Xelytics()
      .dataset(df)
      .filter("revenue > 1000")
      .analyze()
      .run()
)

load_dataframe(df) is also available as an explicit DataFrame-loading name in the current implementation. Documentation uses dataset(df) for the recommended v0.3.0 abstraction.

Core Principles

Principle	Meaning
`advanced`	advanced time series dependencies such as `ruptures` and `pmdarima`
`connectors`	database, cloud storage, Excel, and Parquet connector dependencies
`export`	PDF, PowerPoint, notebook, and static chart export dependencies
`llm`	OpenAI and Groq provider dependencies
`large_data`	Dask dataframe support
`dev`	test, lint, type-check, and formatting tools

Quick Start

v0.2-Compatible API

Use this for simple one-shot DataFrame analysis.

import pandas as pd
from xelytics import AnalysisConfig, analyze

df = pd.read_csv("sales.csv")

config = AnalysisConfig(
    enable_llm_insights=False,
    generate_visualizations=False,
)

result = analyze(df, config=config)

print(result.summary.row_count)
print(result.metadata.tests_executed)

for insight in result.insights[:5]:
    print(f"{insight.severity.value}: {insight.title}")

result.export_to("analysis.json")

Recommended v0.3.0 API

Use the chainable API when you want to bind data first, record operations, and execute only when .run() is called.

import pandas as pd
from xelytics import AnalysisConfig, Xelytics

df = pd.read_csv("sales.csv")

result = (
    Xelytics(config=AnalysisConfig(enable_llm_insights=False))
      .dataset(df)
      .filter("revenue > 1000")
      .analyze()
      .run()
)

print(result.summary.row_count)
print(result.trace.print_trace() if result.trace else "No trace")

load_dataframe(df) and from_dataset(dataset) are also available aliases for explicit binding.

The full runnable notebook for this release is examples/xelytics_core_v0_3_0_complete.ipynb. It uses generated data and local files only, so it can be executed without API keys or database credentials.

What Changed in v0.3.0

Area	v0.2.x	v0.3.0
Entry point	`analyze(df)`	`analyze(df)` still works; `Xelytics().dataset(df).analyze().run()` is recommended for lazy workflows
Data model	DataFrame-first	`Dataset`, `MaterializedDataset`, `LazyDataset`, and `TransformedDataset`
Execution	eager pipeline	`ExecutionPlan`, `PlanNode`, `PlanBuilder`, and DAG execution
Connectors	mostly materialized DataFrames	database connectors can back lazy datasets
SQL behavior	query then analyze	filter/project/limit plan nodes can use SQL pushdown when supported
Transformations	eager `Pipeline` preprocessing	`TransformGraph`, graph nodes, node cache, and lineage APIs
Analysis outputs	stats, visualizations, insights, time series, clustering	adds `correlation`, `trend_anomaly`, and `segmentation` analyzer outputs
Observability	logs and metadata	`TraceCollector` and `ExecutionProfiler` attached to results
Extensibility	pipelines/exporters/providers	registries for analyzers, transformations, and output formats
Compatibility	v0.2.x public API	no public v0.2.x API removed

See MIGRATION_GUIDE_v0.2_to_v0.3.md for the full migration guide.

Implemented v0.3.0 Story Map

The v0.3.0 implementation is organized around the story set in aidlc-docs/inception/v0.3.0.

Epic	Implemented surface	Main modules
Epic 1: Data Connectivity Engine	source abstraction, schema inference, lazy data binding, connector timeouts/retries/sampling hints	`xelytics.dataset`, `xelytics.schemas.schema`, `xelytics.connectors`, `xelytics.schemas.config`
Epic 2: Execution Engine	execution plans, lazy execution, SQL pushdown helpers, chunked planning support	`xelytics.execution`, `xelytics.engine`
Epic 3: Transformation Graph Engine	graph nodes, graph execution, node cache, schema hooks, lineage records	`xelytics.graph`
Epic 4: Analysis and Insight Engine	profiling, correlation, trend/anomaly, segmentation, ranked and deduplicated insights	`xelytics.analyzers`, `xelytics.insights`
Epic 5: Output Layer and Python API	structured JSON, optional visualizations, chainable `Xelytics` API, result export	`xelytics.api`, `xelytics.schemas.outputs`, `xelytics.export`
Epic 6: Observability and Debugging	execution logs, trace collection, node profiling, trace/profile serialization	`xelytics.observability`, `xelytics.engine`
Epic 7: Extensibility System	custom analyzer, transformation, and output-format registries	`xelytics.extension`

Regression coverage for these surfaces lives in tests/test_epic1_connectivity.py through tests/test_epic7.py, plus compatibility tests for earlier APIs.

Feature Overview

Capability	Status
Automatic statistical test planning and execution	supported
Dataset summaries and column profiling	supported
Rule-based insights and ranked insights	supported
Plotly-compatible visualization specs	supported
Time series detection, decomposition, forecasting, anomalies, and change points	supported through `xelytics.timeseries` and v0.3 analyzer outputs
K-Means, DBSCAN, hierarchical clustering, and cluster profiling	supported
PostgreSQL, MySQL, SQLite, Snowflake, BigQuery, S3, Azure Blob, GCS, and file connectors	supported through optional extras
File and Redis caching	supported
Large dataset summary and sample analysis	supported through `analyze_large_dataset()`
HTML, PDF, PowerPoint, Jupyter notebook, and JSON export	supported through `xelytics.export`
CLI for CSV and Excel analysis	supported through the `xelytics` command
Optional LLM provider integrations	OpenAI and Groq dependencies available through `llm` extra

Public API

The stable top-level imports are:

# Define which columns to analyze
config = AnalysisConfig(
    include_columns=["age", "income", "purchase_frequency"],
    exclude_columns=["customer_id", "timestamp"],
    categorical_max_categories=50,  # Skip columns with >50 unique values
)

result = analyze(df, config=config)

Statistics Covered:

✅ Descriptive: mean, median, variance, skewness, kurtosis
✅ t-tests, ANOVA, Welch's test, Mann-Whitney U, Kruskal-Wallis
✅ Correlation: Pearson, Spearman, Kendall Tau
✅ Chi-square tests for categorical associations
✅ Effect sizes: Cohen's d, Cramér's V, Eta-squared
✅ Assumption checks: Normality (Shapiro-Wilk), Homogeneity of variance (Levene)

2️⃣ Time Series Analysis (NEW in v0.2.0)

Complete time series toolkit: detection, decomposition, forecasting, anomalies.

Time Series Detection

from xelytics import analyze, AnalysisConfig

# Option 1: Auto-detect time series columns
config = AnalysisConfig(enable_time_series=True)
result = analyze(df, config=config)

# Option 2: Specify datetime column
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
)
result = analyze(df, config=config)

# Check which columns were detected as time series
for ts in result.time_series_analysis:
    print(f"{ts.column_name}:")
    print(f"  Type: {ts.series_type.value}")
    print(f"  Frequency: {ts.frequency}")
    print(f"  Has trend: {ts.has_trend}")
    print(f"  Has seasonality: {ts.has_seasonality}")
    if ts.has_seasonality:
        print(f"  Seasonal period: {ts.seasonal_period}")

Time Series Decomposition

# Automatically decompose into trend, seasonal, residual
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    decomposition_method="additive",  # or "multiplicative", "stl"
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.decomposition:
        print(f"{ts.column_name} decomposition:")
        print(f"  Trend strength: {ts.decomposition.trend_strength:.3f}")
        print(f"  Seasonal strength: {ts.decomposition.seasonal_strength:.3f}")

Forecasting

# ARIMA and Exponential Smoothing forecasting
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    forecast_periods=30,  # Forecast next 30 periods
    forecast_methods=["arima", "exponential_smoothing"],
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.forecasts:
        print(f"\n{ts.column_name} - Next 30 periods forecast:")
        for forecast in ts.forecasts[:5]:  # Show first 5
            print(f"  Period {forecast.period}: {forecast.value:.2f} "
                  f"(95% CI: {forecast.lower_bound:.2f}-{forecast.upper_bound:.2f})")

Anomaly Detection

# Multiple detection methods: Z-score, IQR, MAD, Isolation Forest
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,  # 95th percentile threshold
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.anomalies:
        print(f"\n{ts.column_name} - Anomalies detected:")
        for anomaly in ts.anomalies[:3]:
            print(f"  Index {anomaly.index}: {anomaly.value:.2f} "
                  f"(severity: {anomaly.severity}, confidence: {anomaly.confidence:.2f})")

Change Point Detection

# Detect structural breaks (CUSUM algorithm)
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    detect_change_points=True,
    change_point_sensitivity=0.05,
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.change_points:
        print(f"\n{ts.column_name} - Change points:")
        for cp in ts.change_points:
            print(f"  At index {cp.index}: magnitude={cp.magnitude:.2f}, "
                  f"confidence={cp.confidence:.2f}")

3️⃣ Clustering & Segmentation (NEW in v0.2.0)

Unsupervised learning for customer segmentation, market clustering, etc.

Basic Clustering

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=8,
    exclude_columns=["customer_id", "name"],
)
result = analyze(df, config=config)

# View clusters
print(f"Algorithm used: {result.clusters[0].algorithm}")
for cluster in result.clusters:
    print(f"\nCluster {cluster.cluster_id}:")
    print(f"  Size: {cluster.size} members ({cluster.size/result.summary.row_count*100:.1f}%)")
    print(f"  Silhouette score: {cluster.silhouette_score:.3f}")
    print(f"  Profile: {cluster.profile}")

K-Means (with Automatic K Selection)

# K-Means tries multiple K values and picks the best
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=10,
    k_selection_method="elbow",  # elbow, silhouette, gap_statistic
)
result = analyze(df, config=config)

# View metrics for each K
for cluster in result.clusters:
    print(f"K={cluster.algorithm_params['n_clusters']}: "
          f"silhouette={cluster.silhouette_score:.3f}")

DBSCAN (Density-Based)

# DBSCAN finds natural clusters and noise points
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="dbscan",
    dbscan_eps=0.5,  # Auto-estimated if not provided
    dbscan_min_samples=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    noise_label = "Noise" if cluster.cluster_id == -1 else f"Cluster {cluster.cluster_id}"
    print(f"{noise_label}: {cluster.size} points")

Hierarchical Clustering

# Produces dendrograms and tree-based clusters
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="hierarchical",
    hierarchical_linkage="ward",  # ward, complete, average, single
    max_clusters=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    print(f"Cluster {cluster.cluster_id}: {cluster.size} members")

4️⃣ Data Connectors (NEW in v0.2.0)

Analyze data directly from databases and cloud storage—no manual data export needed.

PostgreSQL

from pathlib import Path
from xelytics.connectors import connect_to_source

output_dir = Path(".cache/xelytics_readme")
output_dir.mkdir(parents=True, exist_ok=True)

csv_path = output_dir / "sales.csv"
df.to_csv(csv_path, index=False)

file_dataset = connect_to_source("file", path=str(csv_path))
print(file_dataset.to_pandas().head())

Database pattern:

from xelytics import AnalysisConfig, Xelytics

result = (
    Xelytics(config=AnalysisConfig(enable_llm_insights=False))
      .connect(
          "postgresql",
          host="localhost",
          database="analytics",
          user="reader",
          password="secret",
          query="SELECT * FROM sales",
      )
      .filter("revenue > 1000")
      .analyze()
      .run()
)

Cache APIs

API	Purpose
`Cache(backend="file", **kwargs)`	Direct cache instance
`Cache.get(key)`	Read cached value
`Cache.set(key, value, ttl=None)`	Store cached value
`Cache.delete(key)`	Delete key
`Cache.clear(pattern=None)`	Clear backend
`Cache.cached(ttl=None)`	Decorator for function caching
`get_cache(backend, **kwargs)`	Create/get global cache
`clear_cache(pattern=None)`	Clear global cache
`NodeCache.get(node_id, input_dfs, func)`	Read transform-node output
`NodeCache.set(node_id, input_dfs, func, result)`	Store transform-node output

connector = connect_to_source(
    source_type="bigquery",
    project_id="my-project",
    credentials_path="/path/to/service-account.json",
)

df = connector.query("""
    SELECT * FROM `my-project.dataset.events`
    WHERE event_date >= '2025-01-01'
    LIMIT 100000
""")
result = analyze(df)

Snowflake

connector = connect_to_source(
    source_type="snowflake",
    account="xy12345",
    warehouse="COMPUTE",
    database="ANALYTICS",
    schema="PUBLIC",
    user=os.getenv("SNOWFLAKE_USER"),
    password=os.getenv("SNOWFLAKE_PASSWORD"),
)

df = connector.query("SELECT * FROM CUSTOMER_DATA")
result = analyze(df)

S3 / Cloud Storage

# Amazon S3
connector = connect_to_source(
    source_type="s3",
    bucket="my-analytics-bucket",
    key="data/sales.parquet",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("AWS_SECRET_KEY"),
)
df = connector.query()  # Returns DataFrame
result = analyze(df)

# Azure Blob Storage
connector = connect_to_source(
    source_type="azure_blob",
    container_name="data",
    blob_name="sales.csv",
    connection_string=os.getenv("AZURE_CONN_STRING"),
)
df = connector.query()
result = analyze(df)

# Google Cloud Storage
connector = connect_to_source(
    source_type="gcs",
    bucket="my-bucket",
    key="data/sales.csv",
    credentials_path="/path/to/gcp-key.json",
)
df = connector.query()
result = analyze(df)

5️⃣ Report Generation (NEW in v0.2.0)

Generate professional, interactive reports in multiple formats.

HTML Report

from xelytics.pipeline import Pipeline, correlation_analysis, normalize, pca, remove_outliers

pipeline = Pipeline(name="demo")
pipeline.add_step(
    remove_outliers,
    name="remove_outliers",
    inputs=["df"],
    outputs=["df"],
    columns=["revenue"],
    method="iqr",
    threshold=3.0,
)
pipeline.add_step(
    normalize,
    name="normalize",
    inputs=["df"],
    outputs=["normalized"],
    columns=["revenue", "cost"],
    method="minmax",
)

context = pipeline.execute({"df": df})
print(context["normalized"].head())
print(pca(df[["revenue", "cost"]], n_components=2).head())
print(correlation_analysis(df[["revenue", "cost", "profit"]]))

Transformation Graph, Lineage, Trace, and Profiling

from xelytics.dataset import MaterializedDataset
from xelytics.graph.graph import TransformGraph
from xelytics.graph.lineage import LineageTracker
from xelytics.graph.node import DataSourceNode, TransformNode
from xelytics.observability.profiler import ExecutionProfiler
from xelytics.observability.trace import TraceCollector, TraceEntry

graph = TransformGraph()
graph.add_node(DataSourceNode(id="source", dataset=MaterializedDataset(df)))
graph.add_node(
    TransformNode(
        id="filter",
        name="filter",
        func=lambda frame: frame.query("revenue > 1000"),
        inputs=["source"],
    )
)
graph.add_edge("source", "filter")
graph.validate()
graph_df = graph.run()

lineage = LineageTracker()
lineage.record_execution("filter", {"source": "hash-a"}, "hash-b", 12.5)
print(lineage.get_record("filter"))
lineage.clear()

trace = TraceCollector()
trace.add(TraceEntry(step_name="demo", row_count=len(graph_df)))
print(trace.print_trace())

profiler = ExecutionProfiler()
profiler.start("node")
profiler.stop("node", operation="demo", rows_fetched=len(graph_df))
print(profiler.print_profile())

JSON Export

import json

# For programmatic access or storage
with open("analysis.json", "w") as f:
    json.dump(result.to_dict(), f, indent=2)

# Later, reconstruct from JSON
from xelytics.schemas.outputs import AnalysisResult
with open("analysis.json") as f:
    data = json.load(f)
    result = AnalysisResult(**data)

6️⃣ Custom Pipelines (NEW in v0.2.0)

Pre-process data with custom steps before analysis.

from xelytics.pipeline import Pipeline, normalize, pca, remove_outliers, correlation_analysis
from xelytics import analyze

# Build a custom pipeline
pipeline = Pipeline([
    remove_outliers(method="iqr", threshold=1.5),
    normalize(method="minmax"),
    pca(n_components=10),
    correlation_analysis(threshold=0.7),
])

# Apply before analysis
df_processed = pipeline.fit_transform(df)
result = analyze(df_processed)

# Or use in AnalysisConfig
config = AnalysisConfig(
    run_custom_pipeline=True,
    custom_pipeline=pipeline,
)
result = analyze(df, config=config)

7️⃣ Caching (NEW in v0.2.0)

Speed up repeated analyses on the same data.

File-Based Cache

from xelytics import analyze, AnalysisConfig
from xelytics.cache import FileCache

cache = FileCache(cache_dir="./cache")

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

# First run: takes full time
result1 = analyze(df, config=config)

# Subsequent runs on same data: instant
result2 = analyze(df, config=config)  # Retrieved from cache!

Redis Cache (Distributed)

from xelytics.cache import RedisCache

cache = RedisCache(host="localhost", port=6379, db=0, ttl=3600)

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

result = analyze(df, config=config)

Clear Cache

from xelytics.cache import clear_cache

# Clear all caches
clear_cache(pattern="*")

# Clear specific patterns
clear_cache(pattern="stats:*")  # Only clear stats caches

8️⃣ CLI (Command-Line Interface)

Analyze without writing Python code.

# Basic analysis - outputs JSON
xelytics analyze data.csv

# Save to file
xelytics analyze data.csv --output results.json

# Set parameters
xelytics analyze data.csv \
  --format=json \
  --alpha 0.01 \
  --no-llm \
  --max-visualizations 20 \
  --datetime-column "date"

# Time series analysis
xelytics analyze data.csv \
  --enable-time-series \
  --datetime-column "date" \
  --forecast-periods 30

# Clustering
xelytics analyze data.csv \
  --enable-clustering \
  --clustering-algorithm kmeans \
  --max-clusters 5

# Show version
xelytics --version

# Help
xelytics --help

9️⃣ LLM Integration (Optional)

Enhance insights with AI narration.

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",  # openai, groq, or local
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

result = analyze(df, config=config)

# Insights now include AI-generated descriptions
for insight in result.insights:
    print(f"{insight.title}")
    print(f"  📝 {insight.narrative}")  # AI-generated explanation

Multiple LLM Providers

# OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

# Groq (fast, open source)
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="groq",
    llm_model="mixtral-8x7b",
    llm_api_key=os.getenv("GROQ_API_KEY"),
)

# Azure OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="azure",
    llm_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    llm_api_key=os.getenv("AZURE_OPENAI_KEY"),
)

Extension Registries and Custom Output

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    # Auto-sample if > 1M rows
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Or force sampling
    sampling_strategy="stratified",
    sample_size=100_000,
    
    # Parallel execution
    parallel_execution=True,
    max_workers=4,
)

result = analyze(df, config=config)

Chunked Processing for Very Large Files

from xelytics.engine import analyze_large_dataset

# Process 10M row file without loading into memory
result = analyze_large_dataset(
    source="huge_sales_data.csv",
    chunksize=50_000,
    sample_size=100_000,  # Take a sample for full analysis
    config=AnalysisConfig(),
)

⚙️ Configuration Reference

from xelytics import AnalysisConfig

config = AnalysisConfig(
    # General
    significance_level=0.05,
    mode="automated",  # automated or semi-automated
    
    # Columns
    include_columns=None,  # [list] Include only these columns
    exclude_columns=None,  # [list] Exclude these columns
    datetime_column=None,  # [str] Column name for time series
    
    # Time Series
    enable_time_series=False,
    decomposition_method="additive",  # additive, multiplicative, stl
    forecast_periods=0,
    forecast_methods=["arima", "exponential_smoothing"],
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,
    detect_change_points=False,
    
    # Clustering
    enable_clustering=False,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=10,
    k_selection_method="elbow",
    
    # Performance
    parallel_execution=True,
    max_workers=4,
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Caching
    enable_caching=False,
    cache_backend=None,
    
    # Reporting
    max_visualizations=15,
    run_custom_pipeline=False,
    custom_pipeline=None,
    
    # LLM
    enable_llm_insights=False,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=None,
    
    # Other
    random_seed=42,
    verbose=True,
)

Usage Examples

Configure Analysis

from xelytics import AnalysisConfig, analyze

config = AnalysisConfig(
    significance_level=0.01,
    enable_llm_insights=False,
    enable_time_series=True,
    datetime_column="date",
    forecast_periods=14,
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=5,
    parallel_execution=True,
    enable_caching=True,
    
    # Reporting
    max_visualizations=20,
    enable_llm_insights=True,
    llm_provider="openai",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

result = analyze(df, config=config)

# 4. EXPLORE RESULTS
print(f"\n✓ Analysis complete in {result.metadata.execution_time_ms}ms")
print(f"  • Tests: {result.metadata.tests_executed}")
print(f"  • Visualizations: {len(result.visualizations)}")
print(f"  • Insights: {len(result.insights)}")
print(f"  • Time Series Series: {len(result.time_series_analysis)}")
print(f"  • Clusters: {len(result.clusters)}")

print("\n📊 Key Insights:")
for i, insight in enumerate(result.insights[:5], 1):
    print(f"  {i}. {insight.title}")
    if hasattr(insight, 'narrative'):
        print(f"     {insight.narrative[:100]}...")

# 5. GENERATE REPORTS
print("\n📄 Generating reports...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# HTML Report
html_generator = HTMLReportGenerator(
    theme="light",
    logo_text="Sales Analytics",
    company_name="ACME Corp"
)
html = html_generator.generate(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
html_path = f"reports/sales_analysis_{timestamp}.html"
os.makedirs("reports", exist_ok=True)
with open(html_path, "w") as f:
    f.write(html)
print(f"  ✓ HTML: {html_path}")

# PDF Report
pdf_bytes = generate_pdf_report(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
pdf_path = f"reports/sales_analysis_{timestamp}.pdf"
with open(pdf_path, "wb") as f:
    f.write(pdf_bytes)
print(f"  ✓ PDF:  {pdf_path}")

# JSON Export
json_path = f"reports/sales_analysis_{timestamp}.json"
import json
with open(json_path, "w") as f:
    json.dump(result.to_dict(), f, indent=2)
print(f"  ✓ JSON: {json_path}")

print("\n✅ Analysis complete!")
print(f"Reports saved to: {os.path.abspath('reports')}")

Output:

📁 Loading data...
✓ Loaded 150,432 rows

⚙️  Configuring analysis...

🔍 Running analysis...

✓ Analysis complete in 3421ms
  • Tests: 47
  • Visualizations: 18
  • Insights: 12
  • Time Series Series: 2
  • Clusters: 5

📊 Key Insights:
  1. Significant correlation detected: total_amount vs. customer_age
  2. Strong seasonality in Q4 sales
  3. Customer segmentation: 5 distinct groups identified
  4. Outliers detected in unit_price column
  5. Increasing trend in repeat customer rate

📄 Generating reports...
  ✓ HTML: reports/sales_analysis_20250307_143021.html
  ✓ PDF:  reports/sales_analysis_20250307_143021.pdf
  ✓ JSON: reports/sales_analysis_20250307_143021.json

✅ Analysis complete!
Reports saved to: /home/user/reports

📈 Performance & Scaling

Dataset Size	Processing Time	Max Parallel Tasks
10K rows	1–2 seconds	3
100K rows	5–10 seconds	4
1M rows	30–60 seconds	4
10M rows	3–5 minutes	4 (chunked)
100M rows	10–30 minutes	4 (chunked + sampled)

Optimization Strategies:

✅ Automatic sampling for datasets > 1M rows
✅ Parallel execution (4 workers by default)
✅ Result caching (file or Redis)
✅ Progress callbacks for long-running analyses
✅ Memory-aware warnings (logs warning if > 1GB)

📊 Feature Comparison

Feature	v0.1.0	v0.2.0
Statistical Analysis	✅	✅
Automated test selection	✅	✅
Effect size calculation	✅	✅
Assumption checking	✅	✅
Time Series (NEW)	—	✅
Detection & decomposition	—	✅
ARIMA & ES forecasting	—	✅
Anomaly detection	—	✅
Change point detection	—	✅
Clustering (NEW)	—	✅
K-Means	—	✅
DBSCAN	—	✅
Hierarchical	—	✅
Cluster profiling	—	✅
Performance (NEW)	—	✅
Parallel execution	—	✅
Result caching	—	✅
Sampling strategies	—	✅
Chunked processing	—	✅
Connectors (NEW)	—	✅
PostgreSQL	—	✅
MySQL/MariaDB	—	✅
SQLite	—	✅
BigQuery	—	✅
Snowflake	—	✅
S3/Azure/GCS	—	✅
Export (NEW)	—	✅
HTML reports	—	✅
PDF export	—	✅
PowerPoint slides	—	✅
Jupyter notebooks	—	✅
JSON export	—	✅
Other Features
Data profiling	✅	✅
Rule-based insights	✅	✅
LLM narration	✅	✅
Custom pipelines	—	✅
Progress callbacks	—	✅
CLI interface	—	✅
Backward compatible	—	✅

🔧 Installation & Setup

System Requirements

Python: 3.9, 3.10, 3.11, 3.12
OS: Linux, macOS, Windows
RAM: 2GB minimum; 8GB+ recommended for large datasets

Basic Installation

# Minimal (core features only)
pip install -e .

# Development
pip install -e ".[dev]"

# Production (all features)
pip install -e ".[advanced,connectors,export,llm]"

# Everything (including dev tools)
pip install -e ".[advanced,connectors,export,llm,dev]"

Verify Installation

python -c "from xelytics import analyze; print('✓ Xelytics installed')"

# Check version
python -c "import xelytics; print(xelytics.__version__)"

# Test CLI
xelytics --version

📚 Documentation

Full documentation is available in the docs/ folder:

Topic	Location
🚀 Installation	docs/installation.md
📖 Quick Start	docs/quickstart.md
📊 Statistical Analysis	docs/guides/01_basic_analysis.md
⏱️ Time Series	docs/guides/02_time_series.md
🎯 Clustering	docs/guides/03_clustering.md
⚡ Performance	docs/guides/04_performance.md
🔗 Connectors	docs/guides/05_connectors.md
📄 Export & Reports	docs/guides/06_export_reports.md
🛠️ Custom Pipelines	docs/guides/07_custom_pipelines.md
💻 CLI Guide	docs/guides/08_cli.md
📡 Observability	docs/guides/09_observability.md
🧩 Extensibility	docs/guides/10_extensibility.md
🔍 API Reference	docs/api/
📋 Examples	examples/
📜 Migration Guide	docs/migration/v01_to_v02.md
📜 v0.2 → v0.3 Migration	MIGRATION_GUIDE_v0.2_to_v0.3.md
🏗️ Architecture	ARCHITECTURE.md
📑 API Contract	API_CONTRACT.md
📝 Comprehensive Docs	COMPREHENSIVE_DOCUMENTATION.md

🛠️ Development

Setup Development Environment

# Clone repository
git clone https://github.com/xelytics/xelytics-core.git
cd xelytics-core

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dev dependencies
pip install -e ".[dev,advanced,connectors,export]"

Running Tests

# All tests
pytest tests/ -v

# Specific test file
pytest tests/test_clustering.py -v

# Tests matching pattern
pytest tests/ -k "test_kmeans" -v

# With coverage report
pytest tests/ --cov=xelytics --cov-report=html

# Only unit tests (exclude slow integration tests)
pytest tests/ -m "not integration" -v

# Only fast tests
pytest tests/ -m "not slow" -v

Code Formatting & Linting

# Format code with Black
black xelytics/ tests/ examples/

# Check formatting
black --check xelytics/ tests/

# Lint with Ruff
ruff check xelytics/ tests/ --fix

# Type checking with mypy
mypy xelytics/

Use the CLI

# Build package
pip install build
python -m build

# Publish to PyPI (requires credentials)
pip install twine
python -m twine upload dist/*

🧪 Testing & Quality Assurance

Test Coverage: 85%+ (307 tests)

Test Categories:

Category	Count	Status
Unit Tests	200+	✅ Passing
Integration Tests	50+	✅ Passing
Performance Tests	20+	✅ Passing
Backward Compatibility Tests	8	✅ Passing (v0.1.0 code works in v0.2.0)
Example Scripts	5	✅ Working

Key Test Suites:

✅ test_core.py - Data ingestion, profiling, feature detection
✅ test_clustering.py - K-Means, DBSCAN, Hierarchical
✅ test_timeseries_advanced.py - Decomposition, forecasting, anomalies
✅ test_stats.py - Statistical tests, effect sizes, assumptions
✅ test_connectors_integration.py - Database connectivity
✅ test_export.py - HTML, PDF, PowerPoint, notebook export
✅ test_caching.py - File and Redis caching
✅ test_v02_backward_compatibility.py - v0.1.0 compatibility

Run Full Test Suite:

# Quick run (excludes slow tests)
pytest tests/ -m "not slow" --tb=short

# Full run (includes slow + integration)
pytest tests/ -v --tb=short

# With coverage
pytest tests/ --cov=xelytics --cov-report=term-missing

Architecture Evolution (v0.2.x → v0.3.0)

Added in v0.3.0

The v0.2.x architecture remains valid for simple DataFrame workflows: ingest data, detect schema/features, profile columns, run analysis modules, generate visualizations and insights, then export the result. v0.3.0 adds a planning layer in front of that pipeline rather than replacing it outright.

v0.2.x eager flow:
DataFrame -> ingestion -> profiling -> stats/time series/clustering -> insights -> exports

v0.3.0 lazy flow:
Dataset -> ExecutionPlan -> TransformGraph nodes -> executor -> analysis -> trace/profile/result

Layer	v0.2.x	v0.3.0
Public API	`analyze(df)`	`analyze(df)` plus `Xelytics().dataset(...).analyze().run()`
Data source	DataFrame or connector-loaded DataFrame	`Dataset`, `MaterializedDataset`, `LazyDataset`, connector-backed sources
Pipeline shape	Mostly linear, eager execution	DAG of plan nodes and transform nodes
Optimization	Parallel tasks, sampling, result cache	Execution planning, SQL pushdown, chunk-aware execution hooks, node cache
Metadata	`RunMetadata`	`RunMetadata` plus trace/profiling/lineage-capable metadata
Extensibility	Pipeline steps, exporters, LLM providers	Analyzer, transformation, and output registries

Compatibility guarantee: the v0.3.0 executor still materializes into the established AnalysisResult schema after planning. Existing code that reads summary, statistics, visualizations, insights, metadata, time_series_analysis, or clusters can continue to do so.

🏗️ Architecture

System Design

┌─────────────────────────────────┐
│    Public API Layer             │
│  analyze() / AnalysisConfig     │
└──────────────┬──────────────────┘
               │
┌──────────────▼──────────────────┐
│    Data Ingestion Layer         │
│  Connectors, DataFrames, Files  │
└──────────────┬──────────────────┘
               │
┌──────────────▼──────────────────┐
│    Processing Core              │
│  Type Detection, Sampling        │
│  Feature Detection, Profiling    │
└──────────────┬──────────────────┘
               │
       ┌───────┴─────────┬──────────────┐
       │                 │              │
   ┌───▼────┐  ┌────────▼──┐  ┌───────▼──┐
   │ Stats  │  │ TimeSeries│  │Clustering│
   │Engine  │  │ Engine    │  │ Engine   │
   └───┬────┘  └────────┬──┘  └───────┬──┘
       │                │              │
       └────────┬───────┴──────────────┘
                │
      ┌─────────▼──────────┐
      │  Visualization &   │
      │  Insight Generator │
      └─────────┬──────────┘
                │
      ┌─────────▼──────────┐
      │  Export Layer      │
      │  HTML/PDF/PPTX/etc │
      └────────────────────┘

Module Breakdown

xelytics-core/
├── xelytics/
│   ├── __init__.py               # Public API
│   ├── engine.py                 # Main analyze() function
│   ├── api.py                    # Chainable Xelytics API (v0.3.0)
│   ├── dataset.py                # Dataset abstraction: materialized/lazy/transformed (v0.3.0)
│   ├── exceptions.py             # Exception hierarchy
│   │
│   ├── core/                     # Data pipeline
│   │   ├── ingestion.py          # Type detection, validation
│   │   ├── profiler.py           # Column statistics
│   │   ├── features.py           # Feature detection
│   │   └── chunked.py            # Large dataset processing
│   │
│   ├── stats/                    # Statistical analysis
│   │   ├── engine.py             # Test selection & execution
│   │   ├── planner.py            # Analysis planning
│   │   └── ...
│   │
│   ├── timeseries/               # Time series (v0.2.0)
│   │   ├── detector.py           # Series detection
│   │   ├── decomposition.py      # Trend/seasonal separation
│   │   ├── forecasting.py        # ARIMA/ExpSmoothing
│   │   ├── anomaly.py            # Anomaly detection
│   │   └── change_points.py      # Change point detection
│   │
│   ├── clustering/               # Clustering (v0.2.0)
│   │   ├── kmeans.py             # K-Means
│   │   ├── dbscan.py             # DBSCAN
│   │   ├── hierarchical.py       # Hierarchical clustering
│   │   └── profiler.py           # Cluster profiling
│   │
│   ├── connectors/               # Data sources (v0.2.0)
│   │   ├── postgres.py           # PostgreSQL
│   │   ├── mysql.py              # MySQL/MariaDB
│   │   ├── database.py           # Base SQL class
│   │   ├── s3.py                 # AWS S3
│   │   ├── cloud.py              # Azure/GCS
│   │   └── ...
│   │
│   ├── export/                   # Report generation (v0.2.0)
│   │   ├── html.py               # HTML reports
│   │   ├── pdf.py                # PDF export
│   │   ├── pptx.py               # PowerPoint slides
│   │   ├── notebook.py           # Jupyter notebooks
│   │   └── ...
│   │
│   ├── cache/                    # Caching (v0.2.0)
│   │   ├── base.py               # Cache interface
│   │   ├── file.py               # File-based cache
│   │   └── redis.py              # Redis cache
│   │
│   ├── pipeline/                 # Custom pipelines (v0.2.0)
│   │   ├── __init__.py           # Pipeline class
│   │   └── steps.py              # Pre-built steps
│   │
│   ├── execution/                # Lazy execution planning (v0.3.0)
│   │   ├── plan.py               # ExecutionPlan and PlanNode
│   │   ├── builder.py            # PlanBuilder
│   │   ├── executor.py           # DAG executor with tracing/profiling
│   │   └── pushdown.py           # SQL pushdown helpers
│   │
│   ├── graph/                    # Transformation DAG (v0.3.0)
│   │   ├── graph.py              # TransformGraph
│   │   ├── node.py               # DataSourceNode, TransformNode, SinkNode
│   │   ├── cache.py              # NodeCache
│   │   └── lineage.py            # LineageTracker
│   │
│   ├── analyzers/                # Modular analyzers (v0.3.0)
│   │   ├── profiling.py          # ProfilingAnalyzer
│   │   ├── correlation.py        # CorrelationAnalyzer
│   │   ├── trend_anomaly.py      # TrendAnomalyAnalyzer
│   │   └── segmentation.py       # SegmentationAnalyzer
│   │
│   ├── observability/            # Tracing and profiling (v0.3.0)
│   │   ├── trace.py              # TraceCollector
│   │   └── profiler.py           # ExecutionProfiler
│   │
│   ├── extension/                # Plugin registries (v0.3.0)
│   │   ├── interfaces.py         # Analyzer, CustomTransform, OutputFormat
│   │   └── registry.py           # register_* decorators
│   │
│   ├── llm/                      # LLM integration
│   │   ├── openai.py             # OpenAI provider
│   │   ├── groq.py               # Groq provider
│   │   └── base.py               # Provider interface
│   │
│   ├── viz/                      # Visualizations
│   │   ├── generator.py          # Plotly spec generation
│   │   └── themes.py             # Color schemes
│   │
│   ├── insights/                 # Insight generation
│   │   ├── rules.py              # Rule-based insights
│   │   └── templates.py          # Insight templates
│   │
│   ├── schemas/                  # Type definitions
│   │   ├── config.py             # AnalysisConfig
│   │   └── outputs.py            # AnalysisResult & schemas
│   │
│   └── cli/                      # Command-line interface
│       └── main.py               # CLI entry point
│
├── tests/                        # 300+ tests
│   ├── test_core.py
│   ├── test_clustering.py
│   ├── test_timeseries_*.py
│   ├── test_connectors_integration.py
│   ├── test_export.py
│   └── ...
│
├── examples/                     # Example scripts
│   ├── quickstart.py
│   ├── forecasting_demo.py
│   └── ...
│
├── docs/                         # Full documentation
│   ├── guides/                   # Step-by-step guides
│   ├── api/                      # API reference
│   └── examples/                 # Example notebooks
│
└── pyproject.toml                # Dependencies & config

📋 API Classes & Functions

Core Classes

from xelytics import analyze

result = analyze(df)

Adopt this when you need lazy data binding, plan inspection, graph transforms, observability, or extension registries:

from xelytics import Xelytics

result = Xelytics().dataset(df).analyze().run()

Migration notes:

analyze(df), AnalysisConfig, AnalysisResult, connectors, cache backends, exporters, pipelines, time series modules, and clustering modules remain supported.
v0.3.0 adds optional result fields: correlation, trend_anomaly, segmentation, trace, and profiling.
v0.2.x eager Pipeline remains supported for preprocessing.
Prefer Dataset.transform() or TransformGraph for transformations that need lineage, node caching, or plan visibility.
Prefer Xelytics or build_plan() for new lazy and graph-aware workflows.

Documentation

Document	Purpose
examples/xelytics_core_v0_3_0_complete.ipynb	complete executable v0.3.0 notebook
docs/quickstart.md	copy-paste examples
docs/index.md	documentation index and feature matrix
docs/api/analyze.md	`analyze()`, `Xelytics`, and large-dataset API
docs/api/config.md	all `AnalysisConfig` fields
docs/api/result_schema.md	result dataclasses and serialization
docs/api/execution.md	`Dataset`, `ExecutionPlan`, `TransformGraph`, observability
docs/api/extensions.md	extension registry APIs
docs/guides/05_connectors.md	database and cloud source usage
docs/guides/06_export_reports.md	report export formats
MIGRATION_GUIDE_v0.2_to_v0.3.md	v0.2.x to v0.3.0 migration
ARCHITECTURE.md	package architecture
CHANGELOG.md	release history

Development

pip install -e ".[dev]"
pytest tests/

Focused v0.3.0 verification:

pytest tests/test_epic1_connectivity.py tests/test_epic2.py tests/test_epic3.py
pytest tests/test_epic4.py tests/test_epic5.py tests/test_epic6.py tests/test_epic7.py

The package supports Python 3.9 through 3.12.

Project Status

Xelytics-Core is beta software. v0.3.0 is compatibility-first: older v0.2.x DataFrame workflows remain valid while the package moves toward the lazy, graph-aware engine model. See CHANGELOG.md and API_CONTRACT.md for versioning and compatibility policy.

License

MIT, as declared in pyproject.toml.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.0

May 8, 2026

0.2.2

Mar 7, 2026

0.2.1

Mar 7, 2026

0.2.0

Mar 5, 2026

0.1.0

Jan 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xelytics_core-0.3.0.tar.gz (224.7 kB view details)

Uploaded May 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xelytics_core-0.3.0-py3-none-any.whl (182.4 kB view details)

Uploaded May 8, 2026 Python 3

File details

Details for the file xelytics_core-0.3.0.tar.gz.

File metadata

Download URL: xelytics_core-0.3.0.tar.gz
Upload date: May 8, 2026
Size: 224.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xelytics_core-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`944b9f600cc534f092af6890318398ca50616bde0329f52689b759ef982a4163`
MD5	`285e64fd22cc9019b3e8005daca895a1`
BLAKE2b-256	`6de154c790c8256e43977cbcbdfa55c1c718c2ffccf4ae22aa6b95819e01f671`

See more details on using hashes here.

File details

Details for the file xelytics_core-0.3.0-py3-none-any.whl.

File metadata

Download URL: xelytics_core-0.3.0-py3-none-any.whl
Upload date: May 8, 2026
Size: 182.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xelytics_core-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3a3ce0871de74aa1253fa610b3b6524192a03da1884546de9f73e3bff164472a`
MD5	`1cb676474b7a3d667d77b923d1f9a7dc`
BLAKE2b-256	`d259a0fb611e1fdeb1efc5c29fcca474c2c70b3bbe2b444e1771c3ce56b738cb`

See more details on using hashes here.

xelytics-core 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Xelytics-Core

What It Does

What's New in v0.3.0

Legacy API (v0.2.x Compatible)

Recommended v0.3.0 API

Core Principles

Quick Start

v0.2-Compatible API

Recommended v0.3.0 API

What Changed in v0.3.0

Implemented v0.3.0 Story Map

Feature Overview

Public API

2️⃣ Time Series Analysis (NEW in v0.2.0)

Time Series Detection

Time Series Decomposition

Forecasting

Anomaly Detection

Change Point Detection

3️⃣ Clustering & Segmentation (NEW in v0.2.0)

Basic Clustering

K-Means (with Automatic K Selection)

DBSCAN (Density-Based)

Hierarchical Clustering

4️⃣ Data Connectors (NEW in v0.2.0)

PostgreSQL

Cache APIs

Snowflake

S3 / Cloud Storage

5️⃣ Report Generation (NEW in v0.2.0)

HTML Report

Transformation Graph, Lineage, Trace, and Profiling

JSON Export

6️⃣ Custom Pipelines (NEW in v0.2.0)

7️⃣ Caching (NEW in v0.2.0)

File-Based Cache

Redis Cache (Distributed)

Clear Cache

8️⃣ CLI (Command-Line Interface)

9️⃣ LLM Integration (Optional)

Multiple LLM Providers

Extension Registries and Custom Output

Chunked Processing for Very Large Files

⚙️ Configuration Reference

Usage Examples

Configure Analysis

📈 Performance & Scaling

📊 Feature Comparison

🔧 Installation & Setup

System Requirements

Basic Installation

Verify Installation

📚 Documentation

🛠️ Development

Setup Development Environment

Running Tests

Code Formatting & Linting

Use the CLI

🧪 Testing & Quality Assurance

Architecture Evolution (v0.2.x → v0.3.0)

🏗️ Architecture

System Design

Module Breakdown

📋 API Classes & Functions

Core Classes

Documentation

Development

Project Status

License

Project details

Verified details