Pure analytics engine for statistical analysis and insight generation

These details have not been verified by PyPI

Project links

Project description

Xelytics-Core: Complete Analytics Engine

Enterprise-grade pure analytics engine for automated statistical analysis, time series forecasting, clustering, and insight generation.

Status: Phases 1–3 complete ✅ | Foundation, Time Series Analysis, and Clustering fully implemented and tested.

What It Does

Xelytics-Core is a zero-configuration analytics engine that analyzes your data and produces professional insights, statistical tests, interactive visualizations, and predictions — all with a single function call.

One-line analysis:

from xelytics import analyze
import pandas as pd

df = pd.read_csv("data.csv")
result = analyze(df)  # That's it!

for insight in result.insights:
    print(f"📊 {insight.title}: {insight.description}")

Output includes:

✅ 50+ statistical tests (parametric & non-parametric)
✅ Time series decomposition & forecasting (ARIMA, Exponential Smoothing)
✅ Anomaly detection & change point detection
✅ Clustering analysis (K-Means, DBSCAN, Hierarchical)
✅ Interactive Plotly visualizations
✅ Human-readable insights (with optional LLM narration)
✅ Professional HTML, PDF, PowerPoint, and Jupyter reports

Core Principles

Principle	Meaning
Zero Configuration	Works out-of-the-box with sensible defaults; optional parameters for advanced use
Pure Analytics	No HTTP, no databases, no authentication—just data in, results out
Type-Safe	All inputs and outputs are typed dataclasses with IDE autocomplete
Deterministic	Identical inputs always produce identical outputs
Backward Compatible	v0.1.0 code runs unchanged in v0.2.0+
Extensible	Custom pipelines, connectors, exporters, and LLM providers
Production-Ready	Parallel execution, result caching, error handling, comprehensive testing

🚀 Quick Start (5 Minutes)

Installation

# Minimal install
pip install -e .

# With all features
pip install -e ".[advanced,connectors,export,llm,dev]"

Basic Analysis

from xelytics import analyze
import pandas as pd

# Load your data
df = pd.read_csv("sales.csv")

# Run comprehensive analysis in one line
result = analyze(df)

# Explore results
print(f"Rows: {result.summary.row_count}")
print(f"Tests executed: {result.metadata.tests_executed}")

# View key findings
for insight in result.insights[:3]:
    print(f"  • {insight.title}")
    
# Export as JSON
import json
with open("analysis.json", "w") as f:
    json.dump(result.to_dict(), f, indent=2)

Output:

Rows: 1000
Tests executed: 47
  • Significant correlation detected: revenue vs. marketing_spend
  • Outliers detected in customer_age column
  • Data shows strong seasonality

📖 Comprehensive Features & Usage

1️⃣ Statistical Analysis

Automatically runs relevant statistical tests based on data types and distributions.

Basic Usage

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    significance_level=0.05,
    enable_llm_insights=False,
    max_visualizations=15,
)

result = analyze(df, config=config)

# View statistical tests
for test in result.statistics:
    print(f"{test.test_name}:")
    print(f"  p-value: {test.p_value:.4f}")
    print(f"  Significant: {test.significant}")
    print(f"  Effect size: {test.effect_size.value:.3f}")

Advanced: Custom Analysis Plan

# Define which columns to analyze
config = AnalysisConfig(
    include_columns=["age", "income", "purchase_frequency"],
    exclude_columns=["customer_id", "timestamp"],
    categorical_max_categories=50,  # Skip columns with >50 unique values
)

result = analyze(df, config=config)

Statistics Covered:

✅ Descriptive: mean, median, variance, skewness, kurtosis
✅ t-tests, ANOVA, Welch's test, Mann-Whitney U, Kruskal-Wallis
✅ Correlation: Pearson, Spearman, Kendall Tau
✅ Chi-square tests for categorical associations
✅ Effect sizes: Cohen's d, Cramér's V, Eta-squared
✅ Assumption checks: Normality (Shapiro-Wilk), Homogeneity of variance (Levene)

2️⃣ Time Series Analysis (NEW in v0.2.0)

Complete time series toolkit: detection, decomposition, forecasting, anomalies.

Time Series Detection

from xelytics import analyze, AnalysisConfig

# Option 1: Auto-detect time series columns
config = AnalysisConfig(enable_time_series=True)
result = analyze(df, config=config)

# Option 2: Specify datetime column
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
)
result = analyze(df, config=config)

# Check which columns were detected as time series
for ts in result.time_series_analysis:
    print(f"{ts.column_name}:")
    print(f"  Type: {ts.series_type.value}")
    print(f"  Frequency: {ts.frequency}")
    print(f"  Has trend: {ts.has_trend}")
    print(f"  Has seasonality: {ts.has_seasonality}")
    if ts.has_seasonality:
        print(f"  Seasonal period: {ts.seasonal_period}")

Time Series Decomposition

# Automatically decompose into trend, seasonal, residual
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    decomposition_method="additive",  # or "multiplicative", "stl"
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.decomposition:
        print(f"{ts.column_name} decomposition:")
        print(f"  Trend strength: {ts.decomposition.trend_strength:.3f}")
        print(f"  Seasonal strength: {ts.decomposition.seasonal_strength:.3f}")

Forecasting

# ARIMA and Exponential Smoothing forecasting
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    forecast_periods=30,  # Forecast next 30 periods
    forecast_methods=["arima", "exponential_smoothing"],
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.forecasts:
        print(f"\n{ts.column_name} - Next 30 periods forecast:")
        for forecast in ts.forecasts[:5]:  # Show first 5
            print(f"  Period {forecast.period}: {forecast.value:.2f} "
                  f"(95% CI: {forecast.lower_bound:.2f}-{forecast.upper_bound:.2f})")

Anomaly Detection

# Multiple detection methods: Z-score, IQR, MAD, Isolation Forest
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,  # 95th percentile threshold
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.anomalies:
        print(f"\n{ts.column_name} - Anomalies detected:")
        for anomaly in ts.anomalies[:3]:
            print(f"  Index {anomaly.index}: {anomaly.value:.2f} "
                  f"(severity: {anomaly.severity}, confidence: {anomaly.confidence:.2f})")

Change Point Detection

# Detect structural breaks (CUSUM algorithm)
config = AnalysisConfig(
    enable_time_series=True,
    datetime_column="date",
    detect_change_points=True,
    change_point_sensitivity=0.05,
)
result = analyze(df, config=config)

for ts in result.time_series_analysis:
    if ts.change_points:
        print(f"\n{ts.column_name} - Change points:")
        for cp in ts.change_points:
            print(f"  At index {cp.index}: magnitude={cp.magnitude:.2f}, "
                  f"confidence={cp.confidence:.2f}")

3️⃣ Clustering & Segmentation (NEW in v0.2.0)

Unsupervised learning for customer segmentation, market clustering, etc.

Basic Clustering

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=8,
    exclude_columns=["customer_id", "name"],
)
result = analyze(df, config=config)

# View clusters
print(f"Algorithm used: {result.clusters[0].algorithm}")
for cluster in result.clusters:
    print(f"\nCluster {cluster.cluster_id}:")
    print(f"  Size: {cluster.size} members ({cluster.size/result.summary.row_count*100:.1f}%)")
    print(f"  Silhouette score: {cluster.silhouette_score:.3f}")
    print(f"  Profile: {cluster.profile}")

K-Means (with Automatic K Selection)

# K-Means tries multiple K values and picks the best
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=10,
    k_selection_method="elbow",  # elbow, silhouette, gap_statistic
)
result = analyze(df, config=config)

# View metrics for each K
for cluster in result.clusters:
    print(f"K={cluster.algorithm_params['n_clusters']}: "
          f"silhouette={cluster.silhouette_score:.3f}")

DBSCAN (Density-Based)

# DBSCAN finds natural clusters and noise points
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="dbscan",
    dbscan_eps=0.5,  # Auto-estimated if not provided
    dbscan_min_samples=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    noise_label = "Noise" if cluster.cluster_id == -1 else f"Cluster {cluster.cluster_id}"
    print(f"{noise_label}: {cluster.size} points")

Hierarchical Clustering

# Produces dendrograms and tree-based clusters
config = AnalysisConfig(
    enable_clustering=True,
    clustering_algorithm="hierarchical",
    hierarchical_linkage="ward",  # ward, complete, average, single
    max_clusters=5,
)
result = analyze(df, config=config)

for cluster in result.clusters:
    print(f"Cluster {cluster.cluster_id}: {cluster.size} members")

4️⃣ Data Connectors (NEW in v0.2.0)

Analyze data directly from databases and cloud storage—no manual data export needed.

PostgreSQL

from xelytics.connectors import connect_to_source
from xelytics import analyze

connector = connect_to_source(
    source_type="postgresql",
    host="db.example.com",
    database="analytics",
    user="analyst",
    password=os.getenv("DB_PASSWORD"),
    port=5432,
)

try:
    connector.connect()
    df = connector.query("""
        SELECT customer_id, age, income, purchase_count, lifetime_value
        FROM customers
        WHERE signup_year >= 2023
    """)
finally:
    connector.disconnect()

result = analyze(df)

MySQL / MariaDB

connector = connect_to_source(
    source_type="mysql",
    host="db.example.com",
    database="analytics",
    user="analyst",
    password=os.getenv("DB_PASSWORD"),
)

df = connector.query("SELECT * FROM sales_data WHERE year = 2025")
result = analyze(df)

SQLite

connector = connect_to_source(
    source_type="sqlite",
    database="/path/to/analytics.db",
)

df = connector.query("SELECT * FROM daily_metrics")
result = analyze(df)

BigQuery

connector = connect_to_source(
    source_type="bigquery",
    project_id="my-project",
    credentials_path="/path/to/service-account.json",
)

df = connector.query("""
    SELECT * FROM `my-project.dataset.events`
    WHERE event_date >= '2025-01-01'
    LIMIT 100000
""")
result = analyze(df)

Snowflake

connector = connect_to_source(
    source_type="snowflake",
    account="xy12345",
    warehouse="COMPUTE",
    database="ANALYTICS",
    schema="PUBLIC",
    user=os.getenv("SNOWFLAKE_USER"),
    password=os.getenv("SNOWFLAKE_PASSWORD"),
)

df = connector.query("SELECT * FROM CUSTOMER_DATA")
result = analyze(df)

S3 / Cloud Storage

# Amazon S3
connector = connect_to_source(
    source_type="s3",
    bucket="my-analytics-bucket",
    key="data/sales.parquet",
    aws_access_key_id=os.getenv("AWS_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("AWS_SECRET_KEY"),
)
df = connector.query()  # Returns DataFrame
result = analyze(df)

# Azure Blob Storage
connector = connect_to_source(
    source_type="azure_blob",
    container_name="data",
    blob_name="sales.csv",
    connection_string=os.getenv("AZURE_CONN_STRING"),
)
df = connector.query()
result = analyze(df)

# Google Cloud Storage
connector = connect_to_source(
    source_type="gcs",
    bucket="my-bucket",
    key="data/sales.csv",
    credentials_path="/path/to/gcp-key.json",
)
df = connector.query()
result = analyze(df)

5️⃣ Report Generation (NEW in v0.2.0)

Generate professional, interactive reports in multiple formats.

HTML Report

from xelytics import analyze
from xelytics.export import HTMLReportGenerator

result = analyze(df)

generator = HTMLReportGenerator(
    theme="light",  # light or dark
    logo_text="ACME Corp",
    company_name="ACME Analytics",
)

html = generator.generate(
    result,
    title="Q1 2025 Sales Analysis",
    author="Data Science Team",
    include_raw_data=False,  # Don't embed full dataset
)

with open("report.html", "w") as f:
    f.write(html)

# Open in browser or embed
os.startfile("report.html")

PDF Report

from xelytics.export import generate_pdf_report

pdf_bytes = generate_pdf_report(
    result,
    title="Q1 2025 Sales Analysis",
    author="Data Science Team",
    orientation="portrait",  # or "landscape"
)

with open("report.pdf", "wb") as f:
    f.write(pdf_bytes)

PowerPoint Presentation

from xelytics.export import generate_pptx_report

pptx = generate_pptx_report(
    result,
    title="Q1 2025 Sales Analysis",
    author="Data Science Team",
    theme="office",  # office, modern, minimal
    include_speaker_notes=True,
)

pptx.save("report.pptx")

Jupyter Notebook

from xelytics.export import generate_notebook

notebook = generate_notebook(
    result,
    title="Q1 2025 Sales Analysis",
    include_code_cells=True,
    include_raw_result=True,
)

with open("analysis.ipynb", "w") as f:
    json.dump(notebook, f)

JSON Export

import json

# For programmatic access or storage
with open("analysis.json", "w") as f:
    json.dump(result.to_dict(), f, indent=2)

# Later, reconstruct from JSON
from xelytics.schemas.outputs import AnalysisResult
with open("analysis.json") as f:
    data = json.load(f)
    result = AnalysisResult(**data)

6️⃣ Custom Pipelines (NEW in v0.2.0)

Pre-process data with custom steps before analysis.

from xelytics.pipeline import Pipeline, normalize, pca, remove_outliers, correlation_analysis
from xelytics import analyze

# Build a custom pipeline
pipeline = Pipeline([
    remove_outliers(method="iqr", threshold=1.5),
    normalize(method="minmax"),
    pca(n_components=10),
    correlation_analysis(threshold=0.7),
])

# Apply before analysis
df_processed = pipeline.fit_transform(df)
result = analyze(df_processed)

# Or use in AnalysisConfig
config = AnalysisConfig(
    run_custom_pipeline=True,
    custom_pipeline=pipeline,
)
result = analyze(df, config=config)

7️⃣ Caching (NEW in v0.2.0)

Speed up repeated analyses on the same data.

File-Based Cache

from xelytics import analyze, AnalysisConfig
from xelytics.cache import FileCache

cache = FileCache(cache_dir="./cache")

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

# First run: takes full time
result1 = analyze(df, config=config)

# Subsequent runs on same data: instant
result2 = analyze(df, config=config)  # Retrieved from cache!

Redis Cache (Distributed)

from xelytics.cache import RedisCache

cache = RedisCache(host="localhost", port=6379, db=0, ttl=3600)

config = AnalysisConfig(
    enable_caching=True,
    cache_backend=cache,
)

result = analyze(df, config=config)

Clear Cache

from xelytics.cache import clear_cache

# Clear all caches
clear_cache(pattern="*")

# Clear specific patterns
clear_cache(pattern="stats:*")  # Only clear stats caches

8️⃣ CLI (Command-Line Interface)

Analyze without writing Python code.

# Basic analysis - outputs JSON
xelytics analyze data.csv

# Save to file
xelytics analyze data.csv --output results.json

# Set parameters
xelytics analyze data.csv \
  --format=json \
  --alpha 0.01 \
  --no-llm \
  --max-visualizations 20 \
  --datetime-column "date"

# Time series analysis
xelytics analyze data.csv \
  --enable-time-series \
  --datetime-column "date" \
  --forecast-periods 30

# Clustering
xelytics analyze data.csv \
  --enable-clustering \
  --clustering-algorithm kmeans \
  --max-clusters 5

# Show version
xelytics --version

# Help
xelytics --help

9️⃣ LLM Integration (Optional)

Enhance insights with AI narration.

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",  # openai, groq, or local
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

result = analyze(df, config=config)

# Insights now include AI-generated descriptions
for insight in result.insights:
    print(f"{insight.title}")
    print(f"  📝 {insight.narrative}")  # AI-generated explanation

Multiple LLM Providers

# OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

# Groq (fast, open source)
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="groq",
    llm_model="mixtral-8x7b",
    llm_api_key=os.getenv("GROQ_API_KEY"),
)

# Azure OpenAI
config = AnalysisConfig(
    enable_llm_insights=True,
    llm_provider="azure",
    llm_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    llm_api_key=os.getenv("AZURE_OPENAI_KEY"),
)

🔟 Large Dataset Support

Analyze datasets with millions of rows.

from xelytics import analyze, AnalysisConfig

config = AnalysisConfig(
    # Auto-sample if > 1M rows
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Or force sampling
    sampling_strategy="stratified",
    sample_size=100_000,
    
    # Parallel execution
    parallel_execution=True,
    max_workers=4,
)

result = analyze(df, config=config)

Chunked Processing for Very Large Files

from xelytics.engine import analyze_large_dataset

# Process 10M row file without loading into memory
result = analyze_large_dataset(
    source="huge_sales_data.csv",
    chunksize=50_000,
    sample_size=100_000,  # Take a sample for full analysis
    config=AnalysisConfig(),
)

⚙️ Configuration Reference

from xelytics import AnalysisConfig

config = AnalysisConfig(
    # General
    significance_level=0.05,
    mode="automated",  # automated or semi-automated
    
    # Columns
    include_columns=None,  # [list] Include only these columns
    exclude_columns=None,  # [list] Exclude these columns
    datetime_column=None,  # [str] Column name for time series
    
    # Time Series
    enable_time_series=False,
    decomposition_method="additive",  # additive, multiplicative, stl
    forecast_periods=0,
    forecast_methods=["arima", "exponential_smoothing"],
    anomaly_detection_method="isolation_forest",
    anomaly_sensitivity=0.95,
    detect_change_points=False,
    
    # Clustering
    enable_clustering=False,
    clustering_algorithm="auto",  # auto, kmeans, dbscan, hierarchical
    max_clusters=10,
    k_selection_method="elbow",
    
    # Performance
    parallel_execution=True,
    max_workers=4,
    sampling_strategy="auto",
    max_rows=1_000_000,
    
    # Caching
    enable_caching=False,
    cache_backend=None,
    
    # Reporting
    max_visualizations=15,
    run_custom_pipeline=False,
    custom_pipeline=None,
    
    # LLM
    enable_llm_insights=False,
    llm_provider="openai",
    llm_model="gpt-4",
    llm_api_key=None,
    
    # Other
    random_seed=42,
    verbose=True,
)

🏭 End-to-End Workflow Example

Complete analysis pipeline from data to report:

#!/usr/bin/env python3
"""Complete analysis workflow."""

import pandas as pd
import os
from datetime import datetime
from xelytics import analyze, AnalysisConfig
from xelytics.export import HTMLReportGenerator, generate_pdf_report
from xelytics.connectors import connect_to_source

# 1. LOAD DATA
print("📁 Loading data...")
connector = connect_to_source(
    source_type="postgresql",
    host="db.example.com",
    database="sales",
    user=os.getenv("DB_USER"),
    password=os.getenv("DB_PASSWORD"),
)

try:
    connector.connect()
    df = connector.query("""
        SELECT 
            order_id, customer_id, order_date,
            product_category, quantity, unit_price, total_amount,
            customer_age, customer_region, is_returning_customer
        FROM orders
        WHERE order_date >= '2024-01-01'
    """)
    print(f"✓ Loaded {len(df):,} rows")
finally:
    connector.disconnect()

# 2. CONFIGURE ANALYSIS
print("\n⚙️  Configuring analysis...")
config = AnalysisConfig(
    significance_level=0.05,
    
    # Time series
    enable_time_series=True,
    datetime_column="order_date",
    forecast_periods=30,
    
    # Clustering
    enable_clustering=True,
    clustering_algorithm="kmeans",
    max_clusters=5,
    
    # Performance
    parallel_execution=True,
    max_workers=4,
    
    # Cache for later
    enable_caching=True,
    
    # Reporting
    max_visualizations=20,
    enable_llm_insights=True,
    llm_provider="openai",
    llm_api_key=os.getenv("OPENAI_API_KEY"),
)

# 3. RUN ANALYSIS
print("\n🔍 Running analysis...")
result = analyze(df, config=config)

# 4. EXPLORE RESULTS
print(f"\n✓ Analysis complete in {result.metadata.execution_time_ms}ms")
print(f"  • Tests: {result.metadata.tests_executed}")
print(f"  • Visualizations: {len(result.visualizations)}")
print(f"  • Insights: {len(result.insights)}")
print(f"  • Time Series Series: {len(result.time_series_analysis)}")
print(f"  • Clusters: {len(result.clusters)}")

print("\n📊 Key Insights:")
for i, insight in enumerate(result.insights[:5], 1):
    print(f"  {i}. {insight.title}")
    if hasattr(insight, 'narrative'):
        print(f"     {insight.narrative[:100]}...")

# 5. GENERATE REPORTS
print("\n📄 Generating reports...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# HTML Report
html_generator = HTMLReportGenerator(
    theme="light",
    logo_text="Sales Analytics",
    company_name="ACME Corp"
)
html = html_generator.generate(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
html_path = f"reports/sales_analysis_{timestamp}.html"
os.makedirs("reports", exist_ok=True)
with open(html_path, "w") as f:
    f.write(html)
print(f"  ✓ HTML: {html_path}")

# PDF Report
pdf_bytes = generate_pdf_report(
    result,
    title="Sales Analysis Report",
    author="Data Science Team"
)
pdf_path = f"reports/sales_analysis_{timestamp}.pdf"
with open(pdf_path, "wb") as f:
    f.write(pdf_bytes)
print(f"  ✓ PDF:  {pdf_path}")

# JSON Export
json_path = f"reports/sales_analysis_{timestamp}.json"
import json
with open(json_path, "w") as f:
    json.dump(result.to_dict(), f, indent=2)
print(f"  ✓ JSON: {json_path}")

print("\n✅ Analysis complete!")
print(f"Reports saved to: {os.path.abspath('reports')}")

Output:

📁 Loading data...
✓ Loaded 150,432 rows

⚙️  Configuring analysis...

🔍 Running analysis...

✓ Analysis complete in 3421ms
  • Tests: 47
  • Visualizations: 18
  • Insights: 12
  • Time Series Series: 2
  • Clusters: 5

📊 Key Insights:
  1. Significant correlation detected: total_amount vs. customer_age
  2. Strong seasonality in Q4 sales
  3. Customer segmentation: 5 distinct groups identified
  4. Outliers detected in unit_price column
  5. Increasing trend in repeat customer rate

📄 Generating reports...
  ✓ HTML: reports/sales_analysis_20250307_143021.html
  ✓ PDF:  reports/sales_analysis_20250307_143021.pdf
  ✓ JSON: reports/sales_analysis_20250307_143021.json

✅ Analysis complete!
Reports saved to: /home/user/reports

📈 Performance & Scaling

Dataset Size	Processing Time	Max Parallel Tasks
10K rows	1–2 seconds	3
100K rows	5–10 seconds	4
1M rows	30–60 seconds	4
10M rows	3–5 minutes	4 (chunked)
100M rows	10–30 minutes	4 (chunked + sampled)

Optimization Strategies:

✅ Automatic sampling for datasets > 1M rows
✅ Parallel execution (4 workers by default)
✅ Result caching (file or Redis)
✅ Progress callbacks for long-running analyses
✅ Memory-aware warnings (logs warning if > 1GB)

📊 Feature Comparison

Feature	v0.1.0	v0.2.0
Statistical Analysis	✅	✅
Automated test selection	✅	✅
Effect size calculation	✅	✅
Assumption checking	✅	✅
Time Series (NEW)	—	✅
Detection & decomposition	—	✅
ARIMA & ES forecasting	—	✅
Anomaly detection	—	✅
Change point detection	—	✅
Clustering (NEW)	—	✅
K-Means	—	✅
DBSCAN	—	✅
Hierarchical	—	✅
Cluster profiling	—	✅
Performance (NEW)	—	✅
Parallel execution	—	✅
Result caching	—	✅
Sampling strategies	—	✅
Chunked processing	—	✅
Connectors (NEW)	—	✅
PostgreSQL	—	✅
MySQL/MariaDB	—	✅
SQLite	—	✅
BigQuery	—	✅
Snowflake	—	✅
S3/Azure/GCS	—	✅
Export (NEW)	—	✅
HTML reports	—	✅
PDF export	—	✅
PowerPoint slides	—	✅
Jupyter notebooks	—	✅
JSON export	—	✅
Other Features
Data profiling	✅	✅
Rule-based insights	✅	✅
LLM narration	✅	✅
Custom pipelines	—	✅
Progress callbacks	—	✅
CLI interface	—	✅
Backward compatible	—	✅

🔧 Installation & Setup

System Requirements

Python: 3.9, 3.10, 3.11, 3.12
OS: Linux, macOS, Windows
RAM: 2GB minimum; 8GB+ recommended for large datasets

Basic Installation

# Minimal (core features only)
pip install -e .

# Development
pip install -e ".[dev]"

# Production (all features)
pip install -e ".[advanced,connectors,export,llm]"

# Everything (including dev tools)
pip install -e ".[advanced,connectors,export,llm,dev]"

Verify Installation

python -c "from xelytics import analyze; print('✓ Xelytics installed')"

# Check version
python -c "import xelytics; print(xelytics.__version__)"

# Test CLI
xelytics --version

📚 Documentation

Full documentation is available in the docs/ folder:

Topic	Location
🚀 Installation	docs/installation.md
📖 Quick Start	docs/quickstart.md
📊 Statistical Analysis	docs/guides/01_basic_analysis.md
⏱️ Time Series	docs/guides/02_time_series.md
🎯 Clustering	docs/guides/03_clustering.md
⚡ Performance	docs/guides/04_performance.md
🔗 Connectors	docs/guides/05_connectors.md
📄 Export & Reports	docs/guides/06_export_reports.md
🛠️ Custom Pipelines	docs/guides/07_custom_pipelines.md
💻 CLI Guide	docs/guides/08_cli.md
🔍 API Reference	docs/api/
📋 Examples	examples/
📜 Migration Guide	docs/migration/v01_to_v02.md
📑 API Contract	API_CONTRACT.md
📝 Comprehensive Docs	COMPREHENSIVE_DOCUMENTATION.md

🛠️ Development

Setup Development Environment

# Clone repository
git clone https://github.com/xelytics/xelytics-core.git
cd xelytics-core

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dev dependencies
pip install -e ".[dev,advanced,connectors,export]"

Running Tests

# All tests
pytest tests/ -v

# Specific test file
pytest tests/test_clustering.py -v

# Tests matching pattern
pytest tests/ -k "test_kmeans" -v

# With coverage report
pytest tests/ --cov=xelytics --cov-report=html

# Only unit tests (exclude slow integration tests)
pytest tests/ -m "not integration" -v

# Only fast tests
pytest tests/ -m "not slow" -v

Code Formatting & Linting

# Format code with Black
black xelytics/ tests/ examples/

# Check formatting
black --check xelytics/ tests/

# Lint with Ruff
ruff check xelytics/ tests/ --fix

# Type checking with mypy
mypy xelytics/

Building & Publishing

# Build package
pip install build
python -m build

# Publish to PyPI (requires credentials)
pip install twine
python -m twine upload dist/*

🧪 Testing & Quality Assurance

Test Coverage: 85%+ (307 tests)

Test Categories:

Category	Count	Status
Unit Tests	200+	✅ Passing
Integration Tests	50+	✅ Passing
Performance Tests	20+	✅ Passing
Backward Compatibility Tests	8	✅ Passing (v0.1.0 code works in v0.2.0)
Example Scripts	5	✅ Working

Key Test Suites:

✅ test_core.py - Data ingestion, profiling, feature detection
✅ test_clustering.py - K-Means, DBSCAN, Hierarchical
✅ test_timeseries_advanced.py - Decomposition, forecasting, anomalies
✅ test_stats.py - Statistical tests, effect sizes, assumptions
✅ test_connectors_integration.py - Database connectivity
✅ test_export.py - HTML, PDF, PowerPoint, notebook export
✅ test_caching.py - File and Redis caching
✅ test_v02_backward_compatibility.py - v0.1.0 compatibility

Run Full Test Suite:

# Quick run (excludes slow tests)
pytest tests/ -m "not slow" --tb=short

# Full run (includes slow + integration)
pytest tests/ -v --tb=short

# With coverage
pytest tests/ --cov=xelytics --cov-report=term-missing

🏗️ Architecture

System Design

┌─────────────────────────────────┐
│    Public API Layer             │
│  analyze() / AnalysisConfig     │
└──────────────┬──────────────────┘
               │
┌──────────────▼──────────────────┐
│    Data Ingestion Layer         │
│  Connectors, DataFrames, Files  │
└──────────────┬──────────────────┘
               │
┌──────────────▼──────────────────┐
│    Processing Core              │
│  Type Detection, Sampling        │
│  Feature Detection, Profiling    │
└──────────────┬──────────────────┘
               │
       ┌───────┴─────────┬──────────────┐
       │                 │              │
   ┌───▼────┐  ┌────────▼──┐  ┌───────▼──┐
   │ Stats  │  │ TimeSeries│  │Clustering│
   │Engine  │  │ Engine    │  │ Engine   │
   └───┬────┘  └────────┬──┘  └───────┬──┘
       │                │              │
       └────────┬───────┴──────────────┘
                │
      ┌─────────▼──────────┐
      │  Visualization &   │
      │  Insight Generator │
      └─────────┬──────────┘
                │
      ┌─────────▼──────────┐
      │  Export Layer      │
      │  HTML/PDF/PPTX/etc │
      └────────────────────┘

Module Breakdown

xelytics-core/
├── xelytics/
│   ├── __init__.py               # Public API
│   ├── engine.py                 # Main analyze() function
│   ├── exceptions.py             # Exception hierarchy
│   │
│   ├── core/                     # Data pipeline
│   │   ├── ingestion.py          # Type detection, validation
│   │   ├── profiler.py           # Column statistics
│   │   ├── features.py           # Feature detection
│   │   └── chunked.py            # Large dataset processing
│   │
│   ├── stats/                    # Statistical analysis
│   │   ├── engine.py             # Test selection & execution
│   │   ├── planner.py            # Analysis planning
│   │   └── ...
│   │
│   ├── timeseries/               # Time series (v0.2.0)
│   │   ├── detector.py           # Series detection
│   │   ├── decomposition.py      # Trend/seasonal separation
│   │   ├── forecasting.py        # ARIMA/ExpSmoothing
│   │   ├── anomaly.py            # Anomaly detection
│   │   └── change_points.py      # Change point detection
│   │
│   ├── clustering/               # Clustering (v0.2.0)
│   │   ├── kmeans.py             # K-Means
│   │   ├── dbscan.py             # DBSCAN
│   │   ├── hierarchical.py       # Hierarchical clustering
│   │   └── profiler.py           # Cluster profiling
│   │
│   ├── connectors/               # Data sources (v0.2.0)
│   │   ├── postgres.py           # PostgreSQL
│   │   ├── mysql.py              # MySQL/MariaDB
│   │   ├── database.py           # Base SQL class
│   │   ├── s3.py                 # AWS S3
│   │   ├── cloud.py              # Azure/GCS
│   │   └── ...
│   │
│   ├── export/                   # Report generation (v0.2.0)
│   │   ├── html.py               # HTML reports
│   │   ├── pdf.py                # PDF export
│   │   ├── pptx.py               # PowerPoint slides
│   │   ├── notebook.py           # Jupyter notebooks
│   │   └── ...
│   │
│   ├── cache/                    # Caching (v0.2.0)
│   │   ├── base.py               # Cache interface
│   │   ├── file.py               # File-based cache
│   │   └── redis.py              # Redis cache
│   │
│   ├── pipeline/                 # Custom pipelines (v0.2.0)
│   │   ├── __init__.py           # Pipeline class
│   │   └── steps.py              # Pre-built steps
│   │
│   ├── llm/                      # LLM integration
│   │   ├── openai.py             # OpenAI provider
│   │   ├── groq.py               # Groq provider
│   │   └── base.py               # Provider interface
│   │
│   ├── viz/                      # Visualizations
│   │   ├── generator.py          # Plotly spec generation
│   │   └── themes.py             # Color schemes
│   │
│   ├── insights/                 # Insight generation
│   │   ├── rules.py              # Rule-based insights
│   │   └── templates.py          # Insight templates
│   │
│   ├── schemas/                  # Type definitions
│   │   ├── config.py             # AnalysisConfig
│   │   └── outputs.py            # AnalysisResult & schemas
│   │
│   └── cli/                      # Command-line interface
│       └── main.py               # CLI entry point
│
├── tests/                        # 300+ tests
│   ├── test_core.py
│   ├── test_clustering.py
│   ├── test_timeseries_*.py
│   ├── test_connectors_integration.py
│   ├── test_export.py
│   └── ...
│
├── examples/                     # Example scripts
│   ├── quickstart.py
│   ├── forecasting_demo.py
│   └── ...
│
├── docs/                         # Full documentation
│   ├── guides/                   # Step-by-step guides
│   ├── api/                      # API reference
│   └── examples/                 # Example notebooks
│
└── pyproject.toml                # Dependencies & config

📋 API Classes & Functions

Core Classes

# Main entry point
from xelytics import analyze, AnalysisConfig, AnalysisResult

# Configuration
config = AnalysisConfig(...)

# Run analysis
result: AnalysisResult = analyze(df, config=config)

# Access results
result.summary              # DatasetSummary
result.statistics           # List[StatisticalTestResult]
result.visualizations       # List[VisualizationSpec]
result.insights             # List[Insight]
result.time_series_analysis # List[TimeSeriesResult]
result.clusters             # List[ClusterResult]
result.metadata             # RunMetadata

Data Source Connectors

from xelytics.connectors import connect_to_source

connector = connect_to_source(source_type="postgresql", ...)
df = connector.query("SELECT * FROM table")

Export Functions

from xelytics.export import (
    HTMLReportGenerator,
    generate_pdf_report,
    generate_pptx_report,
    generate_notebook,
)

Caching

from xelytics.cache import FileCache, RedisCache, get_cache, clear_cache

cache = get_cache("file", cache_dir="./cache")

Time Series

from xelytics.timeseries import (
    analyze_time_series,
    decompose_time_series,
    forecast_time_series,
    detect_anomalies,
    detect_change_points,
)

Clustering

from xelytics.clustering import (
    analyze_clusters,
    cluster_kmeans,
    cluster_dbscan,
    cluster_hierarchical,
    profile_clusters,
)

🤝 Contributing

We welcome contributions! Here's how you can help:

Reporting Issues

Check existing issues
Create new issue with:
- Descriptive title
- Steps to reproduce
- Expected vs actual behavior
- Environment info (Python version, OS, xelytics version)

Submitting Changes

Fork the repository
Create branch: git checkout -b feature/my-feature
Make changes and add tests
Format code: black xelytics/ tests/
Run tests: pytest tests/
Commit: git commit -am 'Add my feature'
Push: git push origin feature/my-feature
Create Pull Request

Code Standards

Style: Black formatting, 100-char line length
Types: Type hints for all functions
Tests: Each feature needs tests (85%+ coverage target)
Docs: Docstrings for all public functions

📄 Changelog

v0.2.0-alpha.1 (February 2026) — Current

Phases Completed:

✅ Phase 1: Foundation & backward compatibility
✅ Phase 2: Time series analysis
✅ Phase 3: Clustering & profiling

Key Features Added:

Time series: detection, decomposition, forecasting, anomalies, change points
Clustering: K-Means, DBSCAN, Hierarchical, profiling
Connectors: PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, S3, Azure, GCS
Export: HTML, PDF, PowerPoint, Jupyter notebooks
Performance: Parallel execution, caching, sampling, chunked processing
CLI and custom pipelines

v0.1.0 → v0.2.0 Compatibility: ✅ 100% backward compatible — All v0.1.0 code works unchanged

See CHANGELOG.md for full history and API_CONTRACT.md for versioning policy.

🎓 Learning Resources

API Documentation: See docs/api/
Quick Start: docs/quickstart.md
Example Scripts: examples/
GitHub Discussions: Ask questions in GitHub Discussions
Issues: Report bugs in GitHub Issues

📞 Support

Channel	Purpose
📖 Documentation	How-to guides, API reference, examples
💬 GitHub Discussions	Q&A, feature ideas, best practices
🐛 GitHub Issues	Bug reports, feature requests
📧 Email	contact@xelytics.io

📜 License

MIT License — see LICENSE for details.

Copyright (c) 2026 Xelytics Team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

🙏 Acknowledgments

Built with ❤️ using:

pandas — Data manipulation
scikit-learn — Machine learning
statsmodels — Statistical modeling
plotly — Interactive visualizations
pingouin — Statistical functions

📊 Project Status

Component	v0.1.0	v0.2.0	Status
Core Analytics	Beta	Beta	✅ Stable
Time Series	—	Beta	✅ Working
Clustering	—	Beta	✅ Working
Connectors	—	Beta	✅ Working
Export	—	Beta	✅ Working
CLI	—	Beta	✅ Working

Next Milestones:

v0.2.1: Bug fixes, performance improvements
v0.3.0: Advanced forecasting (Prophet), deep learning integration
v1.0.0: API stabilization, user feedback incorporation

⭐ Star this repository if you find it useful!

Questions? Open an issue or start a discussion.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

May 8, 2026

0.2.2

Mar 7, 2026

This version

0.2.1

Mar 7, 2026

0.2.0

Mar 5, 2026

0.1.0

Jan 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xelytics_core-0.2.1.tar.gz (186.7 kB view details)

Uploaded Mar 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xelytics_core-0.2.1-py3-none-any.whl (149.0 kB view details)

Uploaded Mar 7, 2026 Python 3

File details

Details for the file xelytics_core-0.2.1.tar.gz.

File metadata

Download URL: xelytics_core-0.2.1.tar.gz
Upload date: Mar 7, 2026
Size: 186.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xelytics_core-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`9a2263a0b7a9f1c8a58fd432d75d561e14f9e50458b5570f7ac9b31bf2e7bea8`
MD5	`3479d26bfaf1379b64cce3ff4614b998`
BLAKE2b-256	`61c3645be4a24efe98d4d9bf66bf5594cf9e1a656200cf14858b45857bac7f9c`

See more details on using hashes here.

File details

Details for the file xelytics_core-0.2.1-py3-none-any.whl.

File metadata

Download URL: xelytics_core-0.2.1-py3-none-any.whl
Upload date: Mar 7, 2026
Size: 149.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xelytics_core-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37dc18b80e940211571a49ceb7606d1b0b698161edf44c7bd8f4b970c29461ce`
MD5	`143389c95fe8bd5e3a4c5e3b99b99bc2`
BLAKE2b-256	`683d8541bf220c4befd2fab219c570e9722132ffa89226bff7ef6d355ae06599`

See more details on using hashes here.

xelytics-core 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Xelytics-Core: Complete Analytics Engine

What It Does

Core Principles

🚀 Quick Start (5 Minutes)

Installation

Basic Analysis

📖 Comprehensive Features & Usage

1️⃣ Statistical Analysis

Basic Usage

Advanced: Custom Analysis Plan

2️⃣ Time Series Analysis (NEW in v0.2.0)

Time Series Detection

Time Series Decomposition

Forecasting

Anomaly Detection

Change Point Detection

3️⃣ Clustering & Segmentation (NEW in v0.2.0)

Basic Clustering

K-Means (with Automatic K Selection)

DBSCAN (Density-Based)

Hierarchical Clustering

4️⃣ Data Connectors (NEW in v0.2.0)

PostgreSQL

MySQL / MariaDB

SQLite

BigQuery

Snowflake

S3 / Cloud Storage

5️⃣ Report Generation (NEW in v0.2.0)

HTML Report

PDF Report

PowerPoint Presentation

Jupyter Notebook

JSON Export

6️⃣ Custom Pipelines (NEW in v0.2.0)

7️⃣ Caching (NEW in v0.2.0)

File-Based Cache

Redis Cache (Distributed)

Clear Cache

8️⃣ CLI (Command-Line Interface)

9️⃣ LLM Integration (Optional)

Multiple LLM Providers

🔟 Large Dataset Support

Chunked Processing for Very Large Files

⚙️ Configuration Reference

🏭 End-to-End Workflow Example

📈 Performance & Scaling

📊 Feature Comparison

🔧 Installation & Setup

System Requirements

Basic Installation

Verify Installation

📚 Documentation

🛠️ Development

Setup Development Environment

Running Tests

Code Formatting & Linting

Building & Publishing

🧪 Testing & Quality Assurance

🏗️ Architecture

System Design

Module Breakdown

📋 API Classes & Functions

Core Classes

Data Source Connectors

Export Functions

Caching

Time Series

Clustering

🤝 Contributing

Reporting Issues