Pure analytics engine for statistical analysis and insight generation
Project description
Xelytics-Core: Complete Analytics Engine
Enterprise-grade pure analytics engine for automated statistical analysis, time series forecasting, clustering, and insight generation.
Status: Phases 1โ3 complete โ | Foundation, Time Series Analysis, and Clustering fully implemented and tested.
What It Does
Xelytics-Core is a zero-configuration analytics engine that analyzes your data and produces professional insights, statistical tests, interactive visualizations, and predictions โ all with a single function call.
One-line analysis:
from xelytics import analyze
import pandas as pd
df = pd.read_csv("data.csv")
result = analyze(df) # That's it!
for insight in result.insights:
print(f"๐ {insight.title}: {insight.description}")
Output includes:
- โ 50+ statistical tests (parametric & non-parametric)
- โ Time series decomposition & forecasting (ARIMA, Exponential Smoothing)
- โ Anomaly detection & change point detection
- โ Clustering analysis (K-Means, DBSCAN, Hierarchical)
- โ Interactive Plotly visualizations
- โ Human-readable insights (with optional LLM narration)
- โ Professional HTML, PDF, PowerPoint, and Jupyter reports
Core Principles
| Principle | Meaning |
|---|---|
| Zero Configuration | Works out-of-the-box with sensible defaults; optional parameters for advanced use |
| Pure Analytics | No HTTP, no databases, no authenticationโjust data in, results out |
| Type-Safe | All inputs and outputs are typed dataclasses with IDE autocomplete |
| Deterministic | Identical inputs always produce identical outputs |
| Backward Compatible | v0.1.0 code runs unchanged in v0.2.0+ |
| Extensible | Custom pipelines, connectors, exporters, and LLM providers |
| Production-Ready | Parallel execution, result caching, error handling, comprehensive testing |
๐ Quick Start (5 Minutes)
Installation
# Minimal install
pip install -e .
# With all features
pip install -e ".[advanced,connectors,export,llm,dev]"
Basic Analysis
from xelytics import analyze
import pandas as pd
# Load your data
df = pd.read_csv("sales.csv")
# Run comprehensive analysis in one line
result = analyze(df)
# Explore results
print(f"Rows: {result.summary.row_count}")
print(f"Tests executed: {result.metadata.tests_executed}")
# View key findings
for insight in result.insights[:3]:
print(f" โข {insight.title}")
# Export as JSON
import json
with open("analysis.json", "w") as f:
json.dump(result.to_dict(), f, indent=2)
Output:
Rows: 1000
Tests executed: 47
โข Significant correlation detected: revenue vs. marketing_spend
โข Outliers detected in customer_age column
โข Data shows strong seasonality
๐ Comprehensive Features & Usage
1๏ธโฃ Statistical Analysis
Automatically runs relevant statistical tests based on data types and distributions.
Basic Usage
from xelytics import analyze, AnalysisConfig
config = AnalysisConfig(
significance_level=0.05,
enable_llm_insights=False,
max_visualizations=15,
)
result = analyze(df, config=config)
# View statistical tests
for test in result.statistics:
print(f"{test.test_name}:")
print(f" p-value: {test.p_value:.4f}")
print(f" Significant: {test.significant}")
print(f" Effect size: {test.effect_size.value:.3f}")
Advanced: Custom Analysis Plan
# Define which columns to analyze
config = AnalysisConfig(
include_columns=["age", "income", "purchase_frequency"],
exclude_columns=["customer_id", "timestamp"],
categorical_max_categories=50, # Skip columns with >50 unique values
)
result = analyze(df, config=config)
Statistics Covered:
- โ Descriptive: mean, median, variance, skewness, kurtosis
- โ t-tests, ANOVA, Welch's test, Mann-Whitney U, Kruskal-Wallis
- โ Correlation: Pearson, Spearman, Kendall Tau
- โ Chi-square tests for categorical associations
- โ Effect sizes: Cohen's d, Cramรฉr's V, Eta-squared
- โ Assumption checks: Normality (Shapiro-Wilk), Homogeneity of variance (Levene)
2๏ธโฃ Time Series Analysis (NEW in v0.2.0)
Complete time series toolkit: detection, decomposition, forecasting, anomalies.
Time Series Detection
from xelytics import analyze, AnalysisConfig
# Option 1: Auto-detect time series columns
config = AnalysisConfig(enable_time_series=True)
result = analyze(df, config=config)
# Option 2: Specify datetime column
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
)
result = analyze(df, config=config)
# Check which columns were detected as time series
for ts in result.time_series_analysis:
print(f"{ts.column_name}:")
print(f" Type: {ts.series_type.value}")
print(f" Frequency: {ts.frequency}")
print(f" Has trend: {ts.has_trend}")
print(f" Has seasonality: {ts.has_seasonality}")
if ts.has_seasonality:
print(f" Seasonal period: {ts.seasonal_period}")
Time Series Decomposition
# Automatically decompose into trend, seasonal, residual
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
decomposition_method="additive", # or "multiplicative", "stl"
)
result = analyze(df, config=config)
for ts in result.time_series_analysis:
if ts.decomposition:
print(f"{ts.column_name} decomposition:")
print(f" Trend strength: {ts.decomposition.trend_strength:.3f}")
print(f" Seasonal strength: {ts.decomposition.seasonal_strength:.3f}")
Forecasting
# ARIMA and Exponential Smoothing forecasting
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
forecast_periods=30, # Forecast next 30 periods
forecast_methods=["arima", "exponential_smoothing"],
)
result = analyze(df, config=config)
for ts in result.time_series_analysis:
if ts.forecasts:
print(f"\n{ts.column_name} - Next 30 periods forecast:")
for forecast in ts.forecasts[:5]: # Show first 5
print(f" Period {forecast.period}: {forecast.value:.2f} "
f"(95% CI: {forecast.lower_bound:.2f}-{forecast.upper_bound:.2f})")
Anomaly Detection
# Multiple detection methods: Z-score, IQR, MAD, Isolation Forest
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
anomaly_detection_method="isolation_forest",
anomaly_sensitivity=0.95, # 95th percentile threshold
)
result = analyze(df, config=config)
for ts in result.time_series_analysis:
if ts.anomalies:
print(f"\n{ts.column_name} - Anomalies detected:")
for anomaly in ts.anomalies[:3]:
print(f" Index {anomaly.index}: {anomaly.value:.2f} "
f"(severity: {anomaly.severity}, confidence: {anomaly.confidence:.2f})")
Change Point Detection
# Detect structural breaks (CUSUM algorithm)
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
detect_change_points=True,
change_point_sensitivity=0.05,
)
result = analyze(df, config=config)
for ts in result.time_series_analysis:
if ts.change_points:
print(f"\n{ts.column_name} - Change points:")
for cp in ts.change_points:
print(f" At index {cp.index}: magnitude={cp.magnitude:.2f}, "
f"confidence={cp.confidence:.2f}")
3๏ธโฃ Clustering & Segmentation (NEW in v0.2.0)
Unsupervised learning for customer segmentation, market clustering, etc.
Basic Clustering
from xelytics import analyze, AnalysisConfig
config = AnalysisConfig(
enable_clustering=True,
clustering_algorithm="auto", # auto, kmeans, dbscan, hierarchical
max_clusters=8,
exclude_columns=["customer_id", "name"],
)
result = analyze(df, config=config)
# View clusters
print(f"Algorithm used: {result.clusters[0].algorithm}")
for cluster in result.clusters:
print(f"\nCluster {cluster.cluster_id}:")
print(f" Size: {cluster.size} members ({cluster.size/result.summary.row_count*100:.1f}%)")
print(f" Silhouette score: {cluster.silhouette_score:.3f}")
print(f" Profile: {cluster.profile}")
K-Means (with Automatic K Selection)
# K-Means tries multiple K values and picks the best
config = AnalysisConfig(
enable_clustering=True,
clustering_algorithm="kmeans",
max_clusters=10,
k_selection_method="elbow", # elbow, silhouette, gap_statistic
)
result = analyze(df, config=config)
# View metrics for each K
for cluster in result.clusters:
print(f"K={cluster.algorithm_params['n_clusters']}: "
f"silhouette={cluster.silhouette_score:.3f}")
DBSCAN (Density-Based)
# DBSCAN finds natural clusters and noise points
config = AnalysisConfig(
enable_clustering=True,
clustering_algorithm="dbscan",
dbscan_eps=0.5, # Auto-estimated if not provided
dbscan_min_samples=5,
)
result = analyze(df, config=config)
for cluster in result.clusters:
noise_label = "Noise" if cluster.cluster_id == -1 else f"Cluster {cluster.cluster_id}"
print(f"{noise_label}: {cluster.size} points")
Hierarchical Clustering
# Produces dendrograms and tree-based clusters
config = AnalysisConfig(
enable_clustering=True,
clustering_algorithm="hierarchical",
hierarchical_linkage="ward", # ward, complete, average, single
max_clusters=5,
)
result = analyze(df, config=config)
for cluster in result.clusters:
print(f"Cluster {cluster.cluster_id}: {cluster.size} members")
4๏ธโฃ Data Connectors (NEW in v0.2.0)
Analyze data directly from databases and cloud storageโno manual data export needed.
PostgreSQL
from xelytics.connectors import connect_to_source
from xelytics import analyze
connector = connect_to_source(
source_type="postgresql",
host="db.example.com",
database="analytics",
user="analyst",
password=os.getenv("DB_PASSWORD"),
port=5432,
)
try:
connector.connect()
df = connector.query("""
SELECT customer_id, age, income, purchase_count, lifetime_value
FROM customers
WHERE signup_year >= 2023
""")
finally:
connector.disconnect()
result = analyze(df)
MySQL / MariaDB
connector = connect_to_source(
source_type="mysql",
host="db.example.com",
database="analytics",
user="analyst",
password=os.getenv("DB_PASSWORD"),
)
df = connector.query("SELECT * FROM sales_data WHERE year = 2025")
result = analyze(df)
SQLite
connector = connect_to_source(
source_type="sqlite",
database="/path/to/analytics.db",
)
df = connector.query("SELECT * FROM daily_metrics")
result = analyze(df)
BigQuery
connector = connect_to_source(
source_type="bigquery",
project_id="my-project",
credentials_path="/path/to/service-account.json",
)
df = connector.query("""
SELECT * FROM `my-project.dataset.events`
WHERE event_date >= '2025-01-01'
LIMIT 100000
""")
result = analyze(df)
Snowflake
connector = connect_to_source(
source_type="snowflake",
account="xy12345",
warehouse="COMPUTE",
database="ANALYTICS",
schema="PUBLIC",
user=os.getenv("SNOWFLAKE_USER"),
password=os.getenv("SNOWFLAKE_PASSWORD"),
)
df = connector.query("SELECT * FROM CUSTOMER_DATA")
result = analyze(df)
S3 / Cloud Storage
# Amazon S3
connector = connect_to_source(
source_type="s3",
bucket="my-analytics-bucket",
key="data/sales.parquet",
aws_access_key_id=os.getenv("AWS_ACCESS_KEY"),
aws_secret_access_key=os.getenv("AWS_SECRET_KEY"),
)
df = connector.query() # Returns DataFrame
result = analyze(df)
# Azure Blob Storage
connector = connect_to_source(
source_type="azure_blob",
container_name="data",
blob_name="sales.csv",
connection_string=os.getenv("AZURE_CONN_STRING"),
)
df = connector.query()
result = analyze(df)
# Google Cloud Storage
connector = connect_to_source(
source_type="gcs",
bucket="my-bucket",
key="data/sales.csv",
credentials_path="/path/to/gcp-key.json",
)
df = connector.query()
result = analyze(df)
5๏ธโฃ Report Generation (NEW in v0.2.0)
Generate professional, interactive reports in multiple formats.
HTML Report
from xelytics import analyze
from xelytics.export import HTMLReportGenerator
result = analyze(df)
generator = HTMLReportGenerator(
theme="light", # light or dark
logo_text="ACME Corp",
company_name="ACME Analytics",
)
html = generator.generate(
result,
title="Q1 2025 Sales Analysis",
author="Data Science Team",
include_raw_data=False, # Don't embed full dataset
)
with open("report.html", "w") as f:
f.write(html)
# Open in browser or embed
os.startfile("report.html")
PDF Report
from xelytics.export import generate_pdf_report
pdf_bytes = generate_pdf_report(
result,
title="Q1 2025 Sales Analysis",
author="Data Science Team",
orientation="portrait", # or "landscape"
)
with open("report.pdf", "wb") as f:
f.write(pdf_bytes)
PowerPoint Presentation
from xelytics.export import generate_pptx_report
pptx = generate_pptx_report(
result,
title="Q1 2025 Sales Analysis",
author="Data Science Team",
theme="office", # office, modern, minimal
include_speaker_notes=True,
)
pptx.save("report.pptx")
Jupyter Notebook
from xelytics.export import generate_notebook
notebook = generate_notebook(
result,
title="Q1 2025 Sales Analysis",
include_code_cells=True,
include_raw_result=True,
)
with open("analysis.ipynb", "w") as f:
json.dump(notebook, f)
JSON Export
import json
# For programmatic access or storage
with open("analysis.json", "w") as f:
json.dump(result.to_dict(), f, indent=2)
# Later, reconstruct from JSON
from xelytics.schemas.outputs import AnalysisResult
with open("analysis.json") as f:
data = json.load(f)
result = AnalysisResult(**data)
6๏ธโฃ Custom Pipelines (NEW in v0.2.0)
Pre-process data with custom steps before analysis.
from xelytics.pipeline import Pipeline, normalize, pca, remove_outliers, correlation_analysis
from xelytics import analyze
# Build a custom pipeline
pipeline = Pipeline([
remove_outliers(method="iqr", threshold=1.5),
normalize(method="minmax"),
pca(n_components=10),
correlation_analysis(threshold=0.7),
])
# Apply before analysis
df_processed = pipeline.fit_transform(df)
result = analyze(df_processed)
# Or use in AnalysisConfig
config = AnalysisConfig(
run_custom_pipeline=True,
custom_pipeline=pipeline,
)
result = analyze(df, config=config)
7๏ธโฃ Caching (NEW in v0.2.0)
Speed up repeated analyses on the same data.
File-Based Cache
from xelytics import analyze, AnalysisConfig
from xelytics.cache import FileCache
cache = FileCache(cache_dir="./cache")
config = AnalysisConfig(
enable_caching=True,
cache_backend=cache,
)
# First run: takes full time
result1 = analyze(df, config=config)
# Subsequent runs on same data: instant
result2 = analyze(df, config=config) # Retrieved from cache!
Redis Cache (Distributed)
from xelytics.cache import RedisCache
cache = RedisCache(host="localhost", port=6379, db=0, ttl=3600)
config = AnalysisConfig(
enable_caching=True,
cache_backend=cache,
)
result = analyze(df, config=config)
Clear Cache
from xelytics.cache import clear_cache
# Clear all caches
clear_cache(pattern="*")
# Clear specific patterns
clear_cache(pattern="stats:*") # Only clear stats caches
8๏ธโฃ CLI (Command-Line Interface)
Analyze without writing Python code.
# Basic analysis - outputs JSON
xelytics analyze data.csv
# Save to file
xelytics analyze data.csv --output results.json
# Set parameters
xelytics analyze data.csv \
--format=json \
--alpha 0.01 \
--no-llm \
--max-visualizations 20 \
--datetime-column "date"
# Time series analysis
xelytics analyze data.csv \
--enable-time-series \
--datetime-column "date" \
--forecast-periods 30
# Clustering
xelytics analyze data.csv \
--enable-clustering \
--clustering-algorithm kmeans \
--max-clusters 5
# Show version
xelytics --version
# Help
xelytics --help
9๏ธโฃ LLM Integration (Optional)
Enhance insights with AI narration.
from xelytics import analyze, AnalysisConfig
config = AnalysisConfig(
enable_llm_insights=True,
llm_provider="openai", # openai, groq, or local
llm_model="gpt-4",
llm_api_key=os.getenv("OPENAI_API_KEY"),
)
result = analyze(df, config=config)
# Insights now include AI-generated descriptions
for insight in result.insights:
print(f"{insight.title}")
print(f" ๐ {insight.narrative}") # AI-generated explanation
Multiple LLM Providers
# OpenAI
config = AnalysisConfig(
enable_llm_insights=True,
llm_provider="openai",
llm_model="gpt-4",
llm_api_key=os.getenv("OPENAI_API_KEY"),
)
# Groq (fast, open source)
config = AnalysisConfig(
enable_llm_insights=True,
llm_provider="groq",
llm_model="mixtral-8x7b",
llm_api_key=os.getenv("GROQ_API_KEY"),
)
# Azure OpenAI
config = AnalysisConfig(
enable_llm_insights=True,
llm_provider="azure",
llm_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
llm_api_key=os.getenv("AZURE_OPENAI_KEY"),
)
๐ Large Dataset Support
Analyze datasets with millions of rows.
from xelytics import analyze, AnalysisConfig
config = AnalysisConfig(
# Auto-sample if > 1M rows
sampling_strategy="auto",
max_rows=1_000_000,
# Or force sampling
sampling_strategy="stratified",
sample_size=100_000,
# Parallel execution
parallel_execution=True,
max_workers=4,
)
result = analyze(df, config=config)
Chunked Processing for Very Large Files
from xelytics.engine import analyze_large_dataset
# Process 10M row file without loading into memory
result = analyze_large_dataset(
source="huge_sales_data.csv",
chunksize=50_000,
sample_size=100_000, # Take a sample for full analysis
config=AnalysisConfig(),
)
โ๏ธ Configuration Reference
from xelytics import AnalysisConfig
config = AnalysisConfig(
# General
significance_level=0.05,
mode="automated", # automated or semi-automated
# Columns
include_columns=None, # [list] Include only these columns
exclude_columns=None, # [list] Exclude these columns
datetime_column=None, # [str] Column name for time series
# Time Series
enable_time_series=False,
decomposition_method="additive", # additive, multiplicative, stl
forecast_periods=0,
forecast_methods=["arima", "exponential_smoothing"],
anomaly_detection_method="isolation_forest",
anomaly_sensitivity=0.95,
detect_change_points=False,
# Clustering
enable_clustering=False,
clustering_algorithm="auto", # auto, kmeans, dbscan, hierarchical
max_clusters=10,
k_selection_method="elbow",
# Performance
parallel_execution=True,
max_workers=4,
sampling_strategy="auto",
max_rows=1_000_000,
# Caching
enable_caching=False,
cache_backend=None,
# Reporting
max_visualizations=15,
run_custom_pipeline=False,
custom_pipeline=None,
# LLM
enable_llm_insights=False,
llm_provider="openai",
llm_model="gpt-4",
llm_api_key=None,
# Other
random_seed=42,
verbose=True,
)
๐ญ End-to-End Workflow Example
Complete analysis pipeline from data to report:
#!/usr/bin/env python3
"""Complete analysis workflow."""
import pandas as pd
import os
from datetime import datetime
from xelytics import analyze, AnalysisConfig
from xelytics.export import HTMLReportGenerator, generate_pdf_report
from xelytics.connectors import connect_to_source
# 1. LOAD DATA
print("๐ Loading data...")
connector = connect_to_source(
source_type="postgresql",
host="db.example.com",
database="sales",
user=os.getenv("DB_USER"),
password=os.getenv("DB_PASSWORD"),
)
try:
connector.connect()
df = connector.query("""
SELECT
order_id, customer_id, order_date,
product_category, quantity, unit_price, total_amount,
customer_age, customer_region, is_returning_customer
FROM orders
WHERE order_date >= '2024-01-01'
""")
print(f"โ Loaded {len(df):,} rows")
finally:
connector.disconnect()
# 2. CONFIGURE ANALYSIS
print("\nโ๏ธ Configuring analysis...")
config = AnalysisConfig(
significance_level=0.05,
# Time series
enable_time_series=True,
datetime_column="order_date",
forecast_periods=30,
# Clustering
enable_clustering=True,
clustering_algorithm="kmeans",
max_clusters=5,
# Performance
parallel_execution=True,
max_workers=4,
# Cache for later
enable_caching=True,
# Reporting
max_visualizations=20,
enable_llm_insights=True,
llm_provider="openai",
llm_api_key=os.getenv("OPENAI_API_KEY"),
)
# 3. RUN ANALYSIS
print("\n๐ Running analysis...")
result = analyze(df, config=config)
# 4. EXPLORE RESULTS
print(f"\nโ Analysis complete in {result.metadata.execution_time_ms}ms")
print(f" โข Tests: {result.metadata.tests_executed}")
print(f" โข Visualizations: {len(result.visualizations)}")
print(f" โข Insights: {len(result.insights)}")
print(f" โข Time Series Series: {len(result.time_series_analysis)}")
print(f" โข Clusters: {len(result.clusters)}")
print("\n๐ Key Insights:")
for i, insight in enumerate(result.insights[:5], 1):
print(f" {i}. {insight.title}")
if hasattr(insight, 'narrative'):
print(f" {insight.narrative[:100]}...")
# 5. GENERATE REPORTS
print("\n๐ Generating reports...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
# HTML Report
html_generator = HTMLReportGenerator(
theme="light",
logo_text="Sales Analytics",
company_name="ACME Corp"
)
html = html_generator.generate(
result,
title="Sales Analysis Report",
author="Data Science Team"
)
html_path = f"reports/sales_analysis_{timestamp}.html"
os.makedirs("reports", exist_ok=True)
with open(html_path, "w") as f:
f.write(html)
print(f" โ HTML: {html_path}")
# PDF Report
pdf_bytes = generate_pdf_report(
result,
title="Sales Analysis Report",
author="Data Science Team"
)
pdf_path = f"reports/sales_analysis_{timestamp}.pdf"
with open(pdf_path, "wb") as f:
f.write(pdf_bytes)
print(f" โ PDF: {pdf_path}")
# JSON Export
json_path = f"reports/sales_analysis_{timestamp}.json"
import json
with open(json_path, "w") as f:
json.dump(result.to_dict(), f, indent=2)
print(f" โ JSON: {json_path}")
print("\nโ
Analysis complete!")
print(f"Reports saved to: {os.path.abspath('reports')}")
Output:
๐ Loading data...
โ Loaded 150,432 rows
โ๏ธ Configuring analysis...
๐ Running analysis...
โ Analysis complete in 3421ms
โข Tests: 47
โข Visualizations: 18
โข Insights: 12
โข Time Series Series: 2
โข Clusters: 5
๐ Key Insights:
1. Significant correlation detected: total_amount vs. customer_age
2. Strong seasonality in Q4 sales
3. Customer segmentation: 5 distinct groups identified
4. Outliers detected in unit_price column
5. Increasing trend in repeat customer rate
๐ Generating reports...
โ HTML: reports/sales_analysis_20250307_143021.html
โ PDF: reports/sales_analysis_20250307_143021.pdf
โ JSON: reports/sales_analysis_20250307_143021.json
โ
Analysis complete!
Reports saved to: /home/user/reports
๐ Performance & Scaling
| Dataset Size | Processing Time | Max Parallel Tasks |
|---|---|---|
| 10K rows | 1โ2 seconds | 3 |
| 100K rows | 5โ10 seconds | 4 |
| 1M rows | 30โ60 seconds | 4 |
| 10M rows | 3โ5 minutes | 4 (chunked) |
| 100M rows | 10โ30 minutes | 4 (chunked + sampled) |
Optimization Strategies:
- โ Automatic sampling for datasets > 1M rows
- โ Parallel execution (4 workers by default)
- โ Result caching (file or Redis)
- โ Progress callbacks for long-running analyses
- โ Memory-aware warnings (logs warning if > 1GB)
๐ Feature Comparison
| Feature | v0.1.0 | v0.2.0 |
|---|---|---|
| Statistical Analysis | โ | โ |
| Automated test selection | โ | โ |
| Effect size calculation | โ | โ |
| Assumption checking | โ | โ |
| Time Series (NEW) | โ | โ |
| Detection & decomposition | โ | โ |
| ARIMA & ES forecasting | โ | โ |
| Anomaly detection | โ | โ |
| Change point detection | โ | โ |
| Clustering (NEW) | โ | โ |
| K-Means | โ | โ |
| DBSCAN | โ | โ |
| Hierarchical | โ | โ |
| Cluster profiling | โ | โ |
| Performance (NEW) | โ | โ |
| Parallel execution | โ | โ |
| Result caching | โ | โ |
| Sampling strategies | โ | โ |
| Chunked processing | โ | โ |
| Connectors (NEW) | โ | โ |
| PostgreSQL | โ | โ |
| MySQL/MariaDB | โ | โ |
| SQLite | โ | โ |
| BigQuery | โ | โ |
| Snowflake | โ | โ |
| S3/Azure/GCS | โ | โ |
| Export (NEW) | โ | โ |
| HTML reports | โ | โ |
| PDF export | โ | โ |
| PowerPoint slides | โ | โ |
| Jupyter notebooks | โ | โ |
| JSON export | โ | โ |
| Other Features | ||
| Data profiling | โ | โ |
| Rule-based insights | โ | โ |
| LLM narration | โ | โ |
| Custom pipelines | โ | โ |
| Progress callbacks | โ | โ |
| CLI interface | โ | โ |
| Backward compatible | โ | โ |
๐ง Installation & Setup
System Requirements
- Python: 3.9, 3.10, 3.11, 3.12
- OS: Linux, macOS, Windows
- RAM: 2GB minimum; 8GB+ recommended for large datasets
Basic Installation
# Minimal (core features only)
pip install -e .
# Development
pip install -e ".[dev]"
# Production (all features)
pip install -e ".[advanced,connectors,export,llm]"
# Everything (including dev tools)
pip install -e ".[advanced,connectors,export,llm,dev]"
Verify Installation
python -c "from xelytics import analyze; print('โ Xelytics installed')"
# Check version
python -c "import xelytics; print(xelytics.__version__)"
# Test CLI
xelytics --version
๐ Documentation
Full documentation is available in the docs/ folder:
| Topic | Location |
|---|---|
| ๐ Installation | docs/installation.md |
| ๐ Quick Start | docs/quickstart.md |
| ๐ Statistical Analysis | docs/guides/01_basic_analysis.md |
| โฑ๏ธ Time Series | docs/guides/02_time_series.md |
| ๐ฏ Clustering | docs/guides/03_clustering.md |
| โก Performance | docs/guides/04_performance.md |
| ๐ Connectors | docs/guides/05_connectors.md |
| ๐ Export & Reports | docs/guides/06_export_reports.md |
| ๐ ๏ธ Custom Pipelines | docs/guides/07_custom_pipelines.md |
| ๐ป CLI Guide | docs/guides/08_cli.md |
| ๐ API Reference | docs/api/ |
| ๐ Examples | examples/ |
| ๐ Migration Guide | docs/migration/v01_to_v02.md |
| ๐ API Contract | API_CONTRACT.md |
| ๐ Comprehensive Docs | COMPREHENSIVE_DOCUMENTATION.md |
๐ ๏ธ Development
Setup Development Environment
# Clone repository
git clone https://github.com/xelytics/xelytics-core.git
cd xelytics-core
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install dev dependencies
pip install -e ".[dev,advanced,connectors,export]"
Running Tests
# All tests
pytest tests/ -v
# Specific test file
pytest tests/test_clustering.py -v
# Tests matching pattern
pytest tests/ -k "test_kmeans" -v
# With coverage report
pytest tests/ --cov=xelytics --cov-report=html
# Only unit tests (exclude slow integration tests)
pytest tests/ -m "not integration" -v
# Only fast tests
pytest tests/ -m "not slow" -v
Code Formatting & Linting
# Format code with Black
black xelytics/ tests/ examples/
# Check formatting
black --check xelytics/ tests/
# Lint with Ruff
ruff check xelytics/ tests/ --fix
# Type checking with mypy
mypy xelytics/
Building & Publishing
# Build package
pip install build
python -m build
# Publish to PyPI (requires credentials)
pip install twine
python -m twine upload dist/*
๐งช Testing & Quality Assurance
Test Coverage: 85%+ (307 tests)
Test Categories:
| Category | Count | Status |
|---|---|---|
| Unit Tests | 200+ | โ Passing |
| Integration Tests | 50+ | โ Passing |
| Performance Tests | 20+ | โ Passing |
| Backward Compatibility Tests | 8 | โ Passing (v0.1.0 code works in v0.2.0) |
| Example Scripts | 5 | โ Working |
Key Test Suites:
- โ
test_core.py- Data ingestion, profiling, feature detection - โ
test_clustering.py- K-Means, DBSCAN, Hierarchical - โ
test_timeseries_advanced.py- Decomposition, forecasting, anomalies - โ
test_stats.py- Statistical tests, effect sizes, assumptions - โ
test_connectors_integration.py- Database connectivity - โ
test_export.py- HTML, PDF, PowerPoint, notebook export - โ
test_caching.py- File and Redis caching - โ
test_v02_backward_compatibility.py- v0.1.0 compatibility
Run Full Test Suite:
# Quick run (excludes slow tests)
pytest tests/ -m "not slow" --tb=short
# Full run (includes slow + integration)
pytest tests/ -v --tb=short
# With coverage
pytest tests/ --cov=xelytics --cov-report=term-missing
๐๏ธ Architecture
System Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Public API Layer โ
โ analyze() / AnalysisConfig โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โ Data Ingestion Layer โ
โ Connectors, DataFrames, Files โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โ Processing Core โ
โ Type Detection, Sampling โ
โ Feature Detection, Profiling โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโดโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ โ โ
โโโโโผโโโโโ โโโโโโโโโโผโโโ โโโโโโโโโผโโโ
โ Stats โ โ TimeSeriesโ โClusteringโ
โEngine โ โ Engine โ โ Engine โ
โโโโโฌโโโโโ โโโโโโโโโโฌโโโ โโโโโโโโโฌโโโ
โ โ โ
โโโโโโโโโโฌโโโโโโโโดโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโผโโโโโโโโโโโ
โ Visualization & โ
โ Insight Generator โ
โโโโโโโโโโโฌโโโโโโโโโโโ
โ
โโโโโโโโโโโผโโโโโโโโโโโ
โ Export Layer โ
โ HTML/PDF/PPTX/etc โ
โโโโโโโโโโโโโโโโโโโโโโ
Module Breakdown
xelytics-core/
โโโ xelytics/
โ โโโ __init__.py # Public API
โ โโโ engine.py # Main analyze() function
โ โโโ exceptions.py # Exception hierarchy
โ โ
โ โโโ core/ # Data pipeline
โ โ โโโ ingestion.py # Type detection, validation
โ โ โโโ profiler.py # Column statistics
โ โ โโโ features.py # Feature detection
โ โ โโโ chunked.py # Large dataset processing
โ โ
โ โโโ stats/ # Statistical analysis
โ โ โโโ engine.py # Test selection & execution
โ โ โโโ planner.py # Analysis planning
โ โ โโโ ...
โ โ
โ โโโ timeseries/ # Time series (v0.2.0)
โ โ โโโ detector.py # Series detection
โ โ โโโ decomposition.py # Trend/seasonal separation
โ โ โโโ forecasting.py # ARIMA/ExpSmoothing
โ โ โโโ anomaly.py # Anomaly detection
โ โ โโโ change_points.py # Change point detection
โ โ
โ โโโ clustering/ # Clustering (v0.2.0)
โ โ โโโ kmeans.py # K-Means
โ โ โโโ dbscan.py # DBSCAN
โ โ โโโ hierarchical.py # Hierarchical clustering
โ โ โโโ profiler.py # Cluster profiling
โ โ
โ โโโ connectors/ # Data sources (v0.2.0)
โ โ โโโ postgres.py # PostgreSQL
โ โ โโโ mysql.py # MySQL/MariaDB
โ โ โโโ database.py # Base SQL class
โ โ โโโ s3.py # AWS S3
โ โ โโโ cloud.py # Azure/GCS
โ โ โโโ ...
โ โ
โ โโโ export/ # Report generation (v0.2.0)
โ โ โโโ html.py # HTML reports
โ โ โโโ pdf.py # PDF export
โ โ โโโ pptx.py # PowerPoint slides
โ โ โโโ notebook.py # Jupyter notebooks
โ โ โโโ ...
โ โ
โ โโโ cache/ # Caching (v0.2.0)
โ โ โโโ base.py # Cache interface
โ โ โโโ file.py # File-based cache
โ โ โโโ redis.py # Redis cache
โ โ
โ โโโ pipeline/ # Custom pipelines (v0.2.0)
โ โ โโโ __init__.py # Pipeline class
โ โ โโโ steps.py # Pre-built steps
โ โ
โ โโโ llm/ # LLM integration
โ โ โโโ openai.py # OpenAI provider
โ โ โโโ groq.py # Groq provider
โ โ โโโ base.py # Provider interface
โ โ
โ โโโ viz/ # Visualizations
โ โ โโโ generator.py # Plotly spec generation
โ โ โโโ themes.py # Color schemes
โ โ
โ โโโ insights/ # Insight generation
โ โ โโโ rules.py # Rule-based insights
โ โ โโโ templates.py # Insight templates
โ โ
โ โโโ schemas/ # Type definitions
โ โ โโโ config.py # AnalysisConfig
โ โ โโโ outputs.py # AnalysisResult & schemas
โ โ
โ โโโ cli/ # Command-line interface
โ โโโ main.py # CLI entry point
โ
โโโ tests/ # 300+ tests
โ โโโ test_core.py
โ โโโ test_clustering.py
โ โโโ test_timeseries_*.py
โ โโโ test_connectors_integration.py
โ โโโ test_export.py
โ โโโ ...
โ
โโโ examples/ # Example scripts
โ โโโ quickstart.py
โ โโโ forecasting_demo.py
โ โโโ ...
โ
โโโ docs/ # Full documentation
โ โโโ guides/ # Step-by-step guides
โ โโโ api/ # API reference
โ โโโ examples/ # Example notebooks
โ
โโโ pyproject.toml # Dependencies & config
๐ API Classes & Functions
Core Classes
# Main entry point
from xelytics import analyze, AnalysisConfig, AnalysisResult
# Configuration
config = AnalysisConfig(...)
# Run analysis
result: AnalysisResult = analyze(df, config=config)
# Access results
result.summary # DatasetSummary
result.statistics # List[StatisticalTestResult]
result.visualizations # List[VisualizationSpec]
result.insights # List[Insight]
result.time_series_analysis # List[TimeSeriesResult]
result.clusters # List[ClusterResult]
result.metadata # RunMetadata
Data Source Connectors
from xelytics.connectors import connect_to_source
connector = connect_to_source(source_type="postgresql", ...)
df = connector.query("SELECT * FROM table")
Export Functions
from xelytics.export import (
HTMLReportGenerator,
generate_pdf_report,
generate_pptx_report,
generate_notebook,
)
Caching
from xelytics.cache import FileCache, RedisCache, get_cache, clear_cache
cache = get_cache("file", cache_dir="./cache")
Time Series
from xelytics.timeseries import (
analyze_time_series,
decompose_time_series,
forecast_time_series,
detect_anomalies,
detect_change_points,
)
Clustering
from xelytics.clustering import (
analyze_clusters,
cluster_kmeans,
cluster_dbscan,
cluster_hierarchical,
profile_clusters,
)
๐ค Contributing
We welcome contributions! Here's how you can help:
Reporting Issues
- Check existing issues
- Create new issue with:
- Descriptive title
- Steps to reproduce
- Expected vs actual behavior
- Environment info (Python version, OS, xelytics version)
Submitting Changes
- Fork the repository
- Create branch:
git checkout -b feature/my-feature - Make changes and add tests
- Format code:
black xelytics/ tests/ - Run tests:
pytest tests/ - Commit:
git commit -am 'Add my feature' - Push:
git push origin feature/my-feature - Create Pull Request
Code Standards
- Style: Black formatting, 100-char line length
- Types: Type hints for all functions
- Tests: Each feature needs tests (85%+ coverage target)
- Docs: Docstrings for all public functions
๐ Changelog
v0.2.0-alpha.1 (February 2026) โ Current
Phases Completed:
- โ Phase 1: Foundation & backward compatibility
- โ Phase 2: Time series analysis
- โ Phase 3: Clustering & profiling
Key Features Added:
- Time series: detection, decomposition, forecasting, anomalies, change points
- Clustering: K-Means, DBSCAN, Hierarchical, profiling
- Connectors: PostgreSQL, MySQL, SQLite, BigQuery, Snowflake, S3, Azure, GCS
- Export: HTML, PDF, PowerPoint, Jupyter notebooks
- Performance: Parallel execution, caching, sampling, chunked processing
- CLI and custom pipelines
v0.1.0 โ v0.2.0 Compatibility: โ 100% backward compatible โ All v0.1.0 code works unchanged
See CHANGELOG.md for full history and API_CONTRACT.md for versioning policy.
๐ Learning Resources
- API Documentation: See docs/api/
- Quick Start: docs/quickstart.md
- Example Scripts: examples/
- GitHub Discussions: Ask questions in GitHub Discussions
- Issues: Report bugs in GitHub Issues
๐ Support
| Channel | Purpose |
|---|---|
| ๐ Documentation | How-to guides, API reference, examples |
| ๐ฌ GitHub Discussions | Q&A, feature ideas, best practices |
| ๐ GitHub Issues | Bug reports, feature requests |
| ๐ง Email | contact@xelytics.io |
๐ License
MIT License โ see LICENSE for details.
Copyright (c) 2026 Xelytics Team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
๐ Acknowledgments
Built with โค๏ธ using:
- pandas โ Data manipulation
- scikit-learn โ Machine learning
- statsmodels โ Statistical modeling
- plotly โ Interactive visualizations
- pingouin โ Statistical functions
๐ Project Status
| Component | v0.1.0 | v0.2.0 | Status |
|---|---|---|---|
| Core Analytics | Beta | Beta | โ Stable |
| Time Series | โ | Beta | โ Working |
| Clustering | โ | Beta | โ Working |
| Connectors | โ | Beta | โ Working |
| Export | โ | Beta | โ Working |
| CLI | โ | Beta | โ Working |
Next Milestones:
- v0.2.1: Bug fixes, performance improvements
- v0.3.0: Advanced forecasting (Prophet), deep learning integration
- v1.0.0: API stabilization, user feedback incorporation
โญ Star this repository if you find it useful!
Questions? Open an issue or start a discussion.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xelytics_core-0.2.1.tar.gz.
File metadata
- Download URL: xelytics_core-0.2.1.tar.gz
- Upload date:
- Size: 186.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a2263a0b7a9f1c8a58fd432d75d561e14f9e50458b5570f7ac9b31bf2e7bea8
|
|
| MD5 |
3479d26bfaf1379b64cce3ff4614b998
|
|
| BLAKE2b-256 |
61c3645be4a24efe98d4d9bf66bf5594cf9e1a656200cf14858b45857bac7f9c
|
File details
Details for the file xelytics_core-0.2.1-py3-none-any.whl.
File metadata
- Download URL: xelytics_core-0.2.1-py3-none-any.whl
- Upload date:
- Size: 149.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37dc18b80e940211571a49ceb7606d1b0b698161edf44c7bd8f4b970c29461ce
|
|
| MD5 |
143389c95fe8bd5e3a4c5e3b99b99bc2
|
|
| BLAKE2b-256 |
683d8541bf220c4befd2fab219c570e9722132ffa89226bff7ef6d355ae06599
|