Pure analytics engine with lazy execution, graph-based transformations, and extensible analysis
Project description
Xelytics-Core
Python package for automated analytics with a lazy, graph-aware execution engine.
Status: v0.3.0 documentation update in progress | v0.2.x APIs remain supported while the lazy graph execution model becomes the recommended path.
What It Does
Xelytics-Core is a zero-configuration analytics engine that analyzes your data and produces professional insights, statistical tests, interactive visualizations, and predictions โ all with a single function call.
One-line analysis:
from xelytics import analyze
import pandas as pd
df = pd.read_csv("data.csv")
result = analyze(df) # That's it!
for insight in result.insights:
print(f"๐ {insight.title}: {insight.description}")
Output includes:
- โ 50+ statistical tests (parametric & non-parametric)
- โ Time series decomposition & forecasting (ARIMA, Exponential Smoothing)
- โ Anomaly detection & change point detection
- โ Clustering analysis (K-Means, DBSCAN, Hierarchical)
- โ Interactive Plotly visualizations
- โ Human-readable insights (with optional LLM narration)
- โ Professional HTML, PDF, PowerPoint, and Jupyter reports
What's New in v0.3.0
Added in v0.3.0
v0.3.0 evolves Xelytics-Core from an eager, mostly linear DataFrame analysis pipeline into a lazy, graph-aware analytics engine. The existing analyze(df) workflow is still supported for v0.2.x compatibility, while new projects should prefer the chainable Xelytics API when they need lazy data binding, execution planning, SQL pushdown, lineage, or plugin extension points.
| Area | v0.2.x Behavior | v0.3.0 Behavior | Compatibility |
|---|---|---|---|
| Entry point | analyze(df) runs the pipeline directly |
Xelytics().dataset(df).analyze().run() builds then executes a plan |
analyze(df) remains supported |
| Data model | DataFrame-first, connector results usually materialized | Unified Dataset abstraction with materialized and lazy datasets |
Existing DataFrame inputs still work |
| Execution | Eager pipeline with optional parallel tasks | Lazy ExecutionPlan DAG with scan, transform, analysis nodes |
Eager behavior is preserved through legacy API |
| SQL sources | Query first, then analyze returned DataFrame | Filter/project nodes can be pushed into SQL where supported | Connector APIs remain available |
| Transformations | Custom pipeline steps execute before analysis | Transformations can be represented as graph nodes | Pipeline remains supported |
| Caching | Result/intermediate cache for analysis stages | Node-level cache support for transformation graph nodes | Existing cache backends remain supported |
| Metadata | Run metadata plus optional sampling/parallel fields | Adds trace, profiling, lineage, cache, and analyzer outputs | Existing result fields remain stable |
| Extensibility | Pipelines, exporters, LLM providers | Registries for analyzers, transformations, and output formats | Existing extension patterns remain valid |
Legacy API (v0.2.x Compatible)
Legacy API (still supported)
from xelytics import analyze
result = analyze(df)
Recommended v0.3.0 API
Added in v0.3.0
from xelytics import Xelytics
result = (
Xelytics()
.dataset(df)
.filter("revenue > 1000")
.analyze()
.run()
)
load_dataframe(df) is also available as an explicit DataFrame-loading name in the current implementation. Documentation uses dataset(df) for the recommended v0.3.0 abstraction.
Core Principles
| Principle | Meaning |
|---|---|
advanced |
advanced time series dependencies such as ruptures and pmdarima |
connectors |
database, cloud storage, Excel, and Parquet connector dependencies |
export |
PDF, PowerPoint, notebook, and static chart export dependencies |
llm |
OpenAI and Groq provider dependencies |
large_data |
Dask dataframe support |
dev |
test, lint, type-check, and formatting tools |
Quick Start
v0.2-Compatible API
Use this for simple one-shot DataFrame analysis.
import pandas as pd
from xelytics import AnalysisConfig, analyze
df = pd.read_csv("sales.csv")
config = AnalysisConfig(
enable_llm_insights=False,
generate_visualizations=False,
)
result = analyze(df, config=config)
print(result.summary.row_count)
print(result.metadata.tests_executed)
for insight in result.insights[:5]:
print(f"{insight.severity.value}: {insight.title}")
result.export_to("analysis.json")
Recommended v0.3.0 API
Use the chainable API when you want to bind data first, record operations, and
execute only when .run() is called.
import pandas as pd
from xelytics import AnalysisConfig, Xelytics
df = pd.read_csv("sales.csv")
result = (
Xelytics(config=AnalysisConfig(enable_llm_insights=False))
.dataset(df)
.filter("revenue > 1000")
.analyze()
.run()
)
print(result.summary.row_count)
print(result.trace.print_trace() if result.trace else "No trace")
load_dataframe(df) and from_dataset(dataset) are also available aliases for
explicit binding.
The full runnable notebook for this release is examples/xelytics_core_v0_3_0_complete.ipynb. It uses generated data and local files only, so it can be executed without API keys or database credentials.
What Changed in v0.3.0
| Area | v0.2.x | v0.3.0 |
|---|---|---|
| Entry point | analyze(df) |
analyze(df) still works; Xelytics().dataset(df).analyze().run() is recommended for lazy workflows |
| Data model | DataFrame-first | Dataset, MaterializedDataset, LazyDataset, and TransformedDataset |
| Execution | eager pipeline | ExecutionPlan, PlanNode, PlanBuilder, and DAG execution |
| Connectors | mostly materialized DataFrames | database connectors can back lazy datasets |
| SQL behavior | query then analyze | filter/project/limit plan nodes can use SQL pushdown when supported |
| Transformations | eager Pipeline preprocessing |
TransformGraph, graph nodes, node cache, and lineage APIs |
| Analysis outputs | stats, visualizations, insights, time series, clustering | adds correlation, trend_anomaly, and segmentation analyzer outputs |
| Observability | logs and metadata | TraceCollector and ExecutionProfiler attached to results |
| Extensibility | pipelines/exporters/providers | registries for analyzers, transformations, and output formats |
| Compatibility | v0.2.x public API | no public v0.2.x API removed |
See MIGRATION_GUIDE_v0.2_to_v0.3.md for the full migration guide.
Implemented v0.3.0 Story Map
The v0.3.0 implementation is organized around the story set in aidlc-docs/inception/v0.3.0.
| Epic | Implemented surface | Main modules |
|---|---|---|
| Epic 1: Data Connectivity Engine | source abstraction, schema inference, lazy data binding, connector timeouts/retries/sampling hints | xelytics.dataset, xelytics.schemas.schema, xelytics.connectors, xelytics.schemas.config |
| Epic 2: Execution Engine | execution plans, lazy execution, SQL pushdown helpers, chunked planning support | xelytics.execution, xelytics.engine |
| Epic 3: Transformation Graph Engine | graph nodes, graph execution, node cache, schema hooks, lineage records | xelytics.graph |
| Epic 4: Analysis and Insight Engine | profiling, correlation, trend/anomaly, segmentation, ranked and deduplicated insights | xelytics.analyzers, xelytics.insights |
| Epic 5: Output Layer and Python API | structured JSON, optional visualizations, chainable Xelytics API, result export |
xelytics.api, xelytics.schemas.outputs, xelytics.export |
| Epic 6: Observability and Debugging | execution logs, trace collection, node profiling, trace/profile serialization | xelytics.observability, xelytics.engine |
| Epic 7: Extensibility System | custom analyzer, transformation, and output-format registries | xelytics.extension |
Regression coverage for these surfaces lives in tests/test_epic1_connectivity.py
through tests/test_epic7.py, plus compatibility tests for earlier APIs.
Feature Overview
| Capability | Status |
|---|---|
| Automatic statistical test planning and execution | supported |
| Dataset summaries and column profiling | supported |
| Rule-based insights and ranked insights | supported |
| Plotly-compatible visualization specs | supported |
| Time series detection, decomposition, forecasting, anomalies, and change points | supported through xelytics.timeseries and v0.3 analyzer outputs |
| K-Means, DBSCAN, hierarchical clustering, and cluster profiling | supported |
| PostgreSQL, MySQL, SQLite, Snowflake, BigQuery, S3, Azure Blob, GCS, and file connectors | supported through optional extras |
| File and Redis caching | supported |
| Large dataset summary and sample analysis | supported through analyze_large_dataset() |
| HTML, PDF, PowerPoint, Jupyter notebook, and JSON export | supported through xelytics.export |
| CLI for CSV and Excel analysis | supported through the xelytics command |
| Optional LLM provider integrations | OpenAI and Groq dependencies available through llm extra |
Public API
The stable top-level imports are:
# Define which columns to analyze
config = AnalysisConfig(
include_columns=["age", "income", "purchase_frequency"],
exclude_columns=["customer_id", "timestamp"],
categorical_max_categories=50, # Skip columns with >50 unique values
)
result = analyze(df, config=config)
Statistics Covered:
- โ Descriptive: mean, median, variance, skewness, kurtosis
- โ t-tests, ANOVA, Welch's test, Mann-Whitney U, Kruskal-Wallis
- โ Correlation: Pearson, Spearman, Kendall Tau
- โ Chi-square tests for categorical associations
- โ Effect sizes: Cohen's d, Cramรฉr's V, Eta-squared
- โ Assumption checks: Normality (Shapiro-Wilk), Homogeneity of variance (Levene)
2๏ธโฃ Time Series Analysis (NEW in v0.2.0)
Complete time series toolkit: detection, decomposition, forecasting, anomalies.
Time Series Detection
from xelytics import analyze, AnalysisConfig
# Option 1: Auto-detect time series columns
config = AnalysisConfig(enable_time_series=True)
result = analyze(df, config=config)
# Option 2: Specify datetime column
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
)
result = analyze(df, config=config)
# Check which columns were detected as time series
for ts in result.time_series_analysis:
print(f"{ts.column_name}:")
print(f" Type: {ts.series_type.value}")
print(f" Frequency: {ts.frequency}")
print(f" Has trend: {ts.has_trend}")
print(f" Has seasonality: {ts.has_seasonality}")
if ts.has_seasonality:
print(f" Seasonal period: {ts.seasonal_period}")
Time Series Decomposition
# Automatically decompose into trend, seasonal, residual
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
decomposition_method="additive", # or "multiplicative", "stl"
)
result = analyze(df, config=config)
for ts in result.time_series_analysis:
if ts.decomposition:
print(f"{ts.column_name} decomposition:")
print(f" Trend strength: {ts.decomposition.trend_strength:.3f}")
print(f" Seasonal strength: {ts.decomposition.seasonal_strength:.3f}")
Forecasting
# ARIMA and Exponential Smoothing forecasting
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
forecast_periods=30, # Forecast next 30 periods
forecast_methods=["arima", "exponential_smoothing"],
)
result = analyze(df, config=config)
for ts in result.time_series_analysis:
if ts.forecasts:
print(f"\n{ts.column_name} - Next 30 periods forecast:")
for forecast in ts.forecasts[:5]: # Show first 5
print(f" Period {forecast.period}: {forecast.value:.2f} "
f"(95% CI: {forecast.lower_bound:.2f}-{forecast.upper_bound:.2f})")
Anomaly Detection
# Multiple detection methods: Z-score, IQR, MAD, Isolation Forest
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
anomaly_detection_method="isolation_forest",
anomaly_sensitivity=0.95, # 95th percentile threshold
)
result = analyze(df, config=config)
for ts in result.time_series_analysis:
if ts.anomalies:
print(f"\n{ts.column_name} - Anomalies detected:")
for anomaly in ts.anomalies[:3]:
print(f" Index {anomaly.index}: {anomaly.value:.2f} "
f"(severity: {anomaly.severity}, confidence: {anomaly.confidence:.2f})")
Change Point Detection
# Detect structural breaks (CUSUM algorithm)
config = AnalysisConfig(
enable_time_series=True,
datetime_column="date",
detect_change_points=True,
change_point_sensitivity=0.05,
)
result = analyze(df, config=config)
for ts in result.time_series_analysis:
if ts.change_points:
print(f"\n{ts.column_name} - Change points:")
for cp in ts.change_points:
print(f" At index {cp.index}: magnitude={cp.magnitude:.2f}, "
f"confidence={cp.confidence:.2f}")
3๏ธโฃ Clustering & Segmentation (NEW in v0.2.0)
Unsupervised learning for customer segmentation, market clustering, etc.
Basic Clustering
from xelytics import analyze, AnalysisConfig
config = AnalysisConfig(
enable_clustering=True,
clustering_algorithm="auto", # auto, kmeans, dbscan, hierarchical
max_clusters=8,
exclude_columns=["customer_id", "name"],
)
result = analyze(df, config=config)
# View clusters
print(f"Algorithm used: {result.clusters[0].algorithm}")
for cluster in result.clusters:
print(f"\nCluster {cluster.cluster_id}:")
print(f" Size: {cluster.size} members ({cluster.size/result.summary.row_count*100:.1f}%)")
print(f" Silhouette score: {cluster.silhouette_score:.3f}")
print(f" Profile: {cluster.profile}")
K-Means (with Automatic K Selection)
# K-Means tries multiple K values and picks the best
config = AnalysisConfig(
enable_clustering=True,
clustering_algorithm="kmeans",
max_clusters=10,
k_selection_method="elbow", # elbow, silhouette, gap_statistic
)
result = analyze(df, config=config)
# View metrics for each K
for cluster in result.clusters:
print(f"K={cluster.algorithm_params['n_clusters']}: "
f"silhouette={cluster.silhouette_score:.3f}")
DBSCAN (Density-Based)
# DBSCAN finds natural clusters and noise points
config = AnalysisConfig(
enable_clustering=True,
clustering_algorithm="dbscan",
dbscan_eps=0.5, # Auto-estimated if not provided
dbscan_min_samples=5,
)
result = analyze(df, config=config)
for cluster in result.clusters:
noise_label = "Noise" if cluster.cluster_id == -1 else f"Cluster {cluster.cluster_id}"
print(f"{noise_label}: {cluster.size} points")
Hierarchical Clustering
# Produces dendrograms and tree-based clusters
config = AnalysisConfig(
enable_clustering=True,
clustering_algorithm="hierarchical",
hierarchical_linkage="ward", # ward, complete, average, single
max_clusters=5,
)
result = analyze(df, config=config)
for cluster in result.clusters:
print(f"Cluster {cluster.cluster_id}: {cluster.size} members")
4๏ธโฃ Data Connectors (NEW in v0.2.0)
Analyze data directly from databases and cloud storageโno manual data export needed.
PostgreSQL
from pathlib import Path
from xelytics.connectors import connect_to_source
output_dir = Path(".cache/xelytics_readme")
output_dir.mkdir(parents=True, exist_ok=True)
csv_path = output_dir / "sales.csv"
df.to_csv(csv_path, index=False)
file_dataset = connect_to_source("file", path=str(csv_path))
print(file_dataset.to_pandas().head())
Database pattern:
from xelytics import AnalysisConfig, Xelytics
result = (
Xelytics(config=AnalysisConfig(enable_llm_insights=False))
.connect(
"postgresql",
host="localhost",
database="analytics",
user="reader",
password="secret",
query="SELECT * FROM sales",
)
.filter("revenue > 1000")
.analyze()
.run()
)
Cache APIs
| API | Purpose |
|---|---|
Cache(backend="file", **kwargs) |
Direct cache instance |
Cache.get(key) |
Read cached value |
Cache.set(key, value, ttl=None) |
Store cached value |
Cache.delete(key) |
Delete key |
Cache.clear(pattern=None) |
Clear backend |
Cache.cached(ttl=None) |
Decorator for function caching |
get_cache(backend, **kwargs) |
Create/get global cache |
clear_cache(pattern=None) |
Clear global cache |
NodeCache.get(node_id, input_dfs, func) |
Read transform-node output |
NodeCache.set(node_id, input_dfs, func, result) |
Store transform-node output |
connector = connect_to_source(
source_type="bigquery",
project_id="my-project",
credentials_path="/path/to/service-account.json",
)
df = connector.query("""
SELECT * FROM `my-project.dataset.events`
WHERE event_date >= '2025-01-01'
LIMIT 100000
""")
result = analyze(df)
Snowflake
connector = connect_to_source(
source_type="snowflake",
account="xy12345",
warehouse="COMPUTE",
database="ANALYTICS",
schema="PUBLIC",
user=os.getenv("SNOWFLAKE_USER"),
password=os.getenv("SNOWFLAKE_PASSWORD"),
)
df = connector.query("SELECT * FROM CUSTOMER_DATA")
result = analyze(df)
S3 / Cloud Storage
# Amazon S3
connector = connect_to_source(
source_type="s3",
bucket="my-analytics-bucket",
key="data/sales.parquet",
aws_access_key_id=os.getenv("AWS_ACCESS_KEY"),
aws_secret_access_key=os.getenv("AWS_SECRET_KEY"),
)
df = connector.query() # Returns DataFrame
result = analyze(df)
# Azure Blob Storage
connector = connect_to_source(
source_type="azure_blob",
container_name="data",
blob_name="sales.csv",
connection_string=os.getenv("AZURE_CONN_STRING"),
)
df = connector.query()
result = analyze(df)
# Google Cloud Storage
connector = connect_to_source(
source_type="gcs",
bucket="my-bucket",
key="data/sales.csv",
credentials_path="/path/to/gcp-key.json",
)
df = connector.query()
result = analyze(df)
5๏ธโฃ Report Generation (NEW in v0.2.0)
Generate professional, interactive reports in multiple formats.
HTML Report
from xelytics.pipeline import Pipeline, correlation_analysis, normalize, pca, remove_outliers
pipeline = Pipeline(name="demo")
pipeline.add_step(
remove_outliers,
name="remove_outliers",
inputs=["df"],
outputs=["df"],
columns=["revenue"],
method="iqr",
threshold=3.0,
)
pipeline.add_step(
normalize,
name="normalize",
inputs=["df"],
outputs=["normalized"],
columns=["revenue", "cost"],
method="minmax",
)
context = pipeline.execute({"df": df})
print(context["normalized"].head())
print(pca(df[["revenue", "cost"]], n_components=2).head())
print(correlation_analysis(df[["revenue", "cost", "profit"]]))
Transformation Graph, Lineage, Trace, and Profiling
from xelytics.dataset import MaterializedDataset
from xelytics.graph.graph import TransformGraph
from xelytics.graph.lineage import LineageTracker
from xelytics.graph.node import DataSourceNode, TransformNode
from xelytics.observability.profiler import ExecutionProfiler
from xelytics.observability.trace import TraceCollector, TraceEntry
graph = TransformGraph()
graph.add_node(DataSourceNode(id="source", dataset=MaterializedDataset(df)))
graph.add_node(
TransformNode(
id="filter",
name="filter",
func=lambda frame: frame.query("revenue > 1000"),
inputs=["source"],
)
)
graph.add_edge("source", "filter")
graph.validate()
graph_df = graph.run()
lineage = LineageTracker()
lineage.record_execution("filter", {"source": "hash-a"}, "hash-b", 12.5)
print(lineage.get_record("filter"))
lineage.clear()
trace = TraceCollector()
trace.add(TraceEntry(step_name="demo", row_count=len(graph_df)))
print(trace.print_trace())
profiler = ExecutionProfiler()
profiler.start("node")
profiler.stop("node", operation="demo", rows_fetched=len(graph_df))
print(profiler.print_profile())
JSON Export
import json
# For programmatic access or storage
with open("analysis.json", "w") as f:
json.dump(result.to_dict(), f, indent=2)
# Later, reconstruct from JSON
from xelytics.schemas.outputs import AnalysisResult
with open("analysis.json") as f:
data = json.load(f)
result = AnalysisResult(**data)
6๏ธโฃ Custom Pipelines (NEW in v0.2.0)
Pre-process data with custom steps before analysis.
from xelytics.pipeline import Pipeline, normalize, pca, remove_outliers, correlation_analysis
from xelytics import analyze
# Build a custom pipeline
pipeline = Pipeline([
remove_outliers(method="iqr", threshold=1.5),
normalize(method="minmax"),
pca(n_components=10),
correlation_analysis(threshold=0.7),
])
# Apply before analysis
df_processed = pipeline.fit_transform(df)
result = analyze(df_processed)
# Or use in AnalysisConfig
config = AnalysisConfig(
run_custom_pipeline=True,
custom_pipeline=pipeline,
)
result = analyze(df, config=config)
7๏ธโฃ Caching (NEW in v0.2.0)
Speed up repeated analyses on the same data.
File-Based Cache
from xelytics import analyze, AnalysisConfig
from xelytics.cache import FileCache
cache = FileCache(cache_dir="./cache")
config = AnalysisConfig(
enable_caching=True,
cache_backend=cache,
)
# First run: takes full time
result1 = analyze(df, config=config)
# Subsequent runs on same data: instant
result2 = analyze(df, config=config) # Retrieved from cache!
Redis Cache (Distributed)
from xelytics.cache import RedisCache
cache = RedisCache(host="localhost", port=6379, db=0, ttl=3600)
config = AnalysisConfig(
enable_caching=True,
cache_backend=cache,
)
result = analyze(df, config=config)
Clear Cache
from xelytics.cache import clear_cache
# Clear all caches
clear_cache(pattern="*")
# Clear specific patterns
clear_cache(pattern="stats:*") # Only clear stats caches
8๏ธโฃ CLI (Command-Line Interface)
Analyze without writing Python code.
# Basic analysis - outputs JSON
xelytics analyze data.csv
# Save to file
xelytics analyze data.csv --output results.json
# Set parameters
xelytics analyze data.csv \
--format=json \
--alpha 0.01 \
--no-llm \
--max-visualizations 20 \
--datetime-column "date"
# Time series analysis
xelytics analyze data.csv \
--enable-time-series \
--datetime-column "date" \
--forecast-periods 30
# Clustering
xelytics analyze data.csv \
--enable-clustering \
--clustering-algorithm kmeans \
--max-clusters 5
# Show version
xelytics --version
# Help
xelytics --help
9๏ธโฃ LLM Integration (Optional)
Enhance insights with AI narration.
from xelytics import analyze, AnalysisConfig
config = AnalysisConfig(
enable_llm_insights=True,
llm_provider="openai", # openai, groq, or local
llm_model="gpt-4",
llm_api_key=os.getenv("OPENAI_API_KEY"),
)
result = analyze(df, config=config)
# Insights now include AI-generated descriptions
for insight in result.insights:
print(f"{insight.title}")
print(f" ๐ {insight.narrative}") # AI-generated explanation
Multiple LLM Providers
# OpenAI
config = AnalysisConfig(
enable_llm_insights=True,
llm_provider="openai",
llm_model="gpt-4",
llm_api_key=os.getenv("OPENAI_API_KEY"),
)
# Groq (fast, open source)
config = AnalysisConfig(
enable_llm_insights=True,
llm_provider="groq",
llm_model="mixtral-8x7b",
llm_api_key=os.getenv("GROQ_API_KEY"),
)
# Azure OpenAI
config = AnalysisConfig(
enable_llm_insights=True,
llm_provider="azure",
llm_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
llm_api_key=os.getenv("AZURE_OPENAI_KEY"),
)
Extension Registries and Custom Output
from xelytics import analyze, AnalysisConfig
config = AnalysisConfig(
# Auto-sample if > 1M rows
sampling_strategy="auto",
max_rows=1_000_000,
# Or force sampling
sampling_strategy="stratified",
sample_size=100_000,
# Parallel execution
parallel_execution=True,
max_workers=4,
)
result = analyze(df, config=config)
Chunked Processing for Very Large Files
from xelytics.engine import analyze_large_dataset
# Process 10M row file without loading into memory
result = analyze_large_dataset(
source="huge_sales_data.csv",
chunksize=50_000,
sample_size=100_000, # Take a sample for full analysis
config=AnalysisConfig(),
)
โ๏ธ Configuration Reference
from xelytics import AnalysisConfig
config = AnalysisConfig(
# General
significance_level=0.05,
mode="automated", # automated or semi-automated
# Columns
include_columns=None, # [list] Include only these columns
exclude_columns=None, # [list] Exclude these columns
datetime_column=None, # [str] Column name for time series
# Time Series
enable_time_series=False,
decomposition_method="additive", # additive, multiplicative, stl
forecast_periods=0,
forecast_methods=["arima", "exponential_smoothing"],
anomaly_detection_method="isolation_forest",
anomaly_sensitivity=0.95,
detect_change_points=False,
# Clustering
enable_clustering=False,
clustering_algorithm="auto", # auto, kmeans, dbscan, hierarchical
max_clusters=10,
k_selection_method="elbow",
# Performance
parallel_execution=True,
max_workers=4,
sampling_strategy="auto",
max_rows=1_000_000,
# Caching
enable_caching=False,
cache_backend=None,
# Reporting
max_visualizations=15,
run_custom_pipeline=False,
custom_pipeline=None,
# LLM
enable_llm_insights=False,
llm_provider="openai",
llm_model="gpt-4",
llm_api_key=None,
# Other
random_seed=42,
verbose=True,
)
Usage Examples
Configure Analysis
from xelytics import AnalysisConfig, analyze
config = AnalysisConfig(
significance_level=0.01,
enable_llm_insights=False,
enable_time_series=True,
datetime_column="date",
forecast_periods=14,
enable_clustering=True,
clustering_algorithm="kmeans",
max_clusters=5,
parallel_execution=True,
enable_caching=True,
# Reporting
max_visualizations=20,
enable_llm_insights=True,
llm_provider="openai",
llm_api_key=os.getenv("OPENAI_API_KEY"),
)
result = analyze(df, config=config)
# 4. EXPLORE RESULTS
print(f"\nโ Analysis complete in {result.metadata.execution_time_ms}ms")
print(f" โข Tests: {result.metadata.tests_executed}")
print(f" โข Visualizations: {len(result.visualizations)}")
print(f" โข Insights: {len(result.insights)}")
print(f" โข Time Series Series: {len(result.time_series_analysis)}")
print(f" โข Clusters: {len(result.clusters)}")
print("\n๐ Key Insights:")
for i, insight in enumerate(result.insights[:5], 1):
print(f" {i}. {insight.title}")
if hasattr(insight, 'narrative'):
print(f" {insight.narrative[:100]}...")
# 5. GENERATE REPORTS
print("\n๐ Generating reports...")
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
# HTML Report
html_generator = HTMLReportGenerator(
theme="light",
logo_text="Sales Analytics",
company_name="ACME Corp"
)
html = html_generator.generate(
result,
title="Sales Analysis Report",
author="Data Science Team"
)
html_path = f"reports/sales_analysis_{timestamp}.html"
os.makedirs("reports", exist_ok=True)
with open(html_path, "w") as f:
f.write(html)
print(f" โ HTML: {html_path}")
# PDF Report
pdf_bytes = generate_pdf_report(
result,
title="Sales Analysis Report",
author="Data Science Team"
)
pdf_path = f"reports/sales_analysis_{timestamp}.pdf"
with open(pdf_path, "wb") as f:
f.write(pdf_bytes)
print(f" โ PDF: {pdf_path}")
# JSON Export
json_path = f"reports/sales_analysis_{timestamp}.json"
import json
with open(json_path, "w") as f:
json.dump(result.to_dict(), f, indent=2)
print(f" โ JSON: {json_path}")
print("\nโ
Analysis complete!")
print(f"Reports saved to: {os.path.abspath('reports')}")
Output:
๐ Loading data...
โ Loaded 150,432 rows
โ๏ธ Configuring analysis...
๐ Running analysis...
โ Analysis complete in 3421ms
โข Tests: 47
โข Visualizations: 18
โข Insights: 12
โข Time Series Series: 2
โข Clusters: 5
๐ Key Insights:
1. Significant correlation detected: total_amount vs. customer_age
2. Strong seasonality in Q4 sales
3. Customer segmentation: 5 distinct groups identified
4. Outliers detected in unit_price column
5. Increasing trend in repeat customer rate
๐ Generating reports...
โ HTML: reports/sales_analysis_20250307_143021.html
โ PDF: reports/sales_analysis_20250307_143021.pdf
โ JSON: reports/sales_analysis_20250307_143021.json
โ
Analysis complete!
Reports saved to: /home/user/reports
๐ Performance & Scaling
| Dataset Size | Processing Time | Max Parallel Tasks |
|---|---|---|
| 10K rows | 1โ2 seconds | 3 |
| 100K rows | 5โ10 seconds | 4 |
| 1M rows | 30โ60 seconds | 4 |
| 10M rows | 3โ5 minutes | 4 (chunked) |
| 100M rows | 10โ30 minutes | 4 (chunked + sampled) |
Optimization Strategies:
- โ Automatic sampling for datasets > 1M rows
- โ Parallel execution (4 workers by default)
- โ Result caching (file or Redis)
- โ Progress callbacks for long-running analyses
- โ Memory-aware warnings (logs warning if > 1GB)
๐ Feature Comparison
| Feature | v0.1.0 | v0.2.0 |
|---|---|---|
| Statistical Analysis | โ | โ |
| Automated test selection | โ | โ |
| Effect size calculation | โ | โ |
| Assumption checking | โ | โ |
| Time Series (NEW) | โ | โ |
| Detection & decomposition | โ | โ |
| ARIMA & ES forecasting | โ | โ |
| Anomaly detection | โ | โ |
| Change point detection | โ | โ |
| Clustering (NEW) | โ | โ |
| K-Means | โ | โ |
| DBSCAN | โ | โ |
| Hierarchical | โ | โ |
| Cluster profiling | โ | โ |
| Performance (NEW) | โ | โ |
| Parallel execution | โ | โ |
| Result caching | โ | โ |
| Sampling strategies | โ | โ |
| Chunked processing | โ | โ |
| Connectors (NEW) | โ | โ |
| PostgreSQL | โ | โ |
| MySQL/MariaDB | โ | โ |
| SQLite | โ | โ |
| BigQuery | โ | โ |
| Snowflake | โ | โ |
| S3/Azure/GCS | โ | โ |
| Export (NEW) | โ | โ |
| HTML reports | โ | โ |
| PDF export | โ | โ |
| PowerPoint slides | โ | โ |
| Jupyter notebooks | โ | โ |
| JSON export | โ | โ |
| Other Features | ||
| Data profiling | โ | โ |
| Rule-based insights | โ | โ |
| LLM narration | โ | โ |
| Custom pipelines | โ | โ |
| Progress callbacks | โ | โ |
| CLI interface | โ | โ |
| Backward compatible | โ | โ |
๐ง Installation & Setup
System Requirements
- Python: 3.9, 3.10, 3.11, 3.12
- OS: Linux, macOS, Windows
- RAM: 2GB minimum; 8GB+ recommended for large datasets
Basic Installation
# Minimal (core features only)
pip install -e .
# Development
pip install -e ".[dev]"
# Production (all features)
pip install -e ".[advanced,connectors,export,llm]"
# Everything (including dev tools)
pip install -e ".[advanced,connectors,export,llm,dev]"
Verify Installation
python -c "from xelytics import analyze; print('โ Xelytics installed')"
# Check version
python -c "import xelytics; print(xelytics.__version__)"
# Test CLI
xelytics --version
๐ Documentation
Full documentation is available in the docs/ folder:
| Topic | Location |
|---|---|
| ๐ Installation | docs/installation.md |
| ๐ Quick Start | docs/quickstart.md |
| ๐ Statistical Analysis | docs/guides/01_basic_analysis.md |
| โฑ๏ธ Time Series | docs/guides/02_time_series.md |
| ๐ฏ Clustering | docs/guides/03_clustering.md |
| โก Performance | docs/guides/04_performance.md |
| ๐ Connectors | docs/guides/05_connectors.md |
| ๐ Export & Reports | docs/guides/06_export_reports.md |
| ๐ ๏ธ Custom Pipelines | docs/guides/07_custom_pipelines.md |
| ๐ป CLI Guide | docs/guides/08_cli.md |
| ๐ก Observability | docs/guides/09_observability.md |
| ๐งฉ Extensibility | docs/guides/10_extensibility.md |
| ๐ API Reference | docs/api/ |
| ๐ Examples | examples/ |
| ๐ Migration Guide | docs/migration/v01_to_v02.md |
| ๐ v0.2 โ v0.3 Migration | MIGRATION_GUIDE_v0.2_to_v0.3.md |
| ๐๏ธ Architecture | ARCHITECTURE.md |
| ๐ API Contract | API_CONTRACT.md |
| ๐ Comprehensive Docs | COMPREHENSIVE_DOCUMENTATION.md |
๐ ๏ธ Development
Setup Development Environment
# Clone repository
git clone https://github.com/xelytics/xelytics-core.git
cd xelytics-core
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install dev dependencies
pip install -e ".[dev,advanced,connectors,export]"
Running Tests
# All tests
pytest tests/ -v
# Specific test file
pytest tests/test_clustering.py -v
# Tests matching pattern
pytest tests/ -k "test_kmeans" -v
# With coverage report
pytest tests/ --cov=xelytics --cov-report=html
# Only unit tests (exclude slow integration tests)
pytest tests/ -m "not integration" -v
# Only fast tests
pytest tests/ -m "not slow" -v
Code Formatting & Linting
# Format code with Black
black xelytics/ tests/ examples/
# Check formatting
black --check xelytics/ tests/
# Lint with Ruff
ruff check xelytics/ tests/ --fix
# Type checking with mypy
mypy xelytics/
Use the CLI
# Build package
pip install build
python -m build
# Publish to PyPI (requires credentials)
pip install twine
python -m twine upload dist/*
๐งช Testing & Quality Assurance
Test Coverage: 85%+ (307 tests)
Test Categories:
| Category | Count | Status |
|---|---|---|
| Unit Tests | 200+ | โ Passing |
| Integration Tests | 50+ | โ Passing |
| Performance Tests | 20+ | โ Passing |
| Backward Compatibility Tests | 8 | โ Passing (v0.1.0 code works in v0.2.0) |
| Example Scripts | 5 | โ Working |
Key Test Suites:
- โ
test_core.py- Data ingestion, profiling, feature detection - โ
test_clustering.py- K-Means, DBSCAN, Hierarchical - โ
test_timeseries_advanced.py- Decomposition, forecasting, anomalies - โ
test_stats.py- Statistical tests, effect sizes, assumptions - โ
test_connectors_integration.py- Database connectivity - โ
test_export.py- HTML, PDF, PowerPoint, notebook export - โ
test_caching.py- File and Redis caching - โ
test_v02_backward_compatibility.py- v0.1.0 compatibility
Run Full Test Suite:
# Quick run (excludes slow tests)
pytest tests/ -m "not slow" --tb=short
# Full run (includes slow + integration)
pytest tests/ -v --tb=short
# With coverage
pytest tests/ --cov=xelytics --cov-report=term-missing
Architecture Evolution (v0.2.x โ v0.3.0)
Added in v0.3.0
The v0.2.x architecture remains valid for simple DataFrame workflows: ingest data, detect schema/features, profile columns, run analysis modules, generate visualizations and insights, then export the result. v0.3.0 adds a planning layer in front of that pipeline rather than replacing it outright.
v0.2.x eager flow:
DataFrame -> ingestion -> profiling -> stats/time series/clustering -> insights -> exports
v0.3.0 lazy flow:
Dataset -> ExecutionPlan -> TransformGraph nodes -> executor -> analysis -> trace/profile/result
| Layer | v0.2.x | v0.3.0 |
|---|---|---|
| Public API | analyze(df) |
analyze(df) plus Xelytics().dataset(...).analyze().run() |
| Data source | DataFrame or connector-loaded DataFrame | Dataset, MaterializedDataset, LazyDataset, connector-backed sources |
| Pipeline shape | Mostly linear, eager execution | DAG of plan nodes and transform nodes |
| Optimization | Parallel tasks, sampling, result cache | Execution planning, SQL pushdown, chunk-aware execution hooks, node cache |
| Metadata | RunMetadata |
RunMetadata plus trace/profiling/lineage-capable metadata |
| Extensibility | Pipeline steps, exporters, LLM providers | Analyzer, transformation, and output registries |
Compatibility guarantee: the v0.3.0 executor still materializes into the established AnalysisResult schema after planning. Existing code that reads summary, statistics, visualizations, insights, metadata, time_series_analysis, or clusters can continue to do so.
๐๏ธ Architecture
System Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Public API Layer โ
โ analyze() / AnalysisConfig โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โ Data Ingestion Layer โ
โ Connectors, DataFrames, Files โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ
โ Processing Core โ
โ Type Detection, Sampling โ
โ Feature Detection, Profiling โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโดโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ โ โ
โโโโโผโโโโโ โโโโโโโโโโผโโโ โโโโโโโโโผโโโ
โ Stats โ โ TimeSeriesโ โClusteringโ
โEngine โ โ Engine โ โ Engine โ
โโโโโฌโโโโโ โโโโโโโโโโฌโโโ โโโโโโโโโฌโโโ
โ โ โ
โโโโโโโโโโฌโโโโโโโโดโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโผโโโโโโโโโโโ
โ Visualization & โ
โ Insight Generator โ
โโโโโโโโโโโฌโโโโโโโโโโโ
โ
โโโโโโโโโโโผโโโโโโโโโโโ
โ Export Layer โ
โ HTML/PDF/PPTX/etc โ
โโโโโโโโโโโโโโโโโโโโโโ
Module Breakdown
xelytics-core/
โโโ xelytics/
โ โโโ __init__.py # Public API
โ โโโ engine.py # Main analyze() function
โ โโโ api.py # Chainable Xelytics API (v0.3.0)
โ โโโ dataset.py # Dataset abstraction: materialized/lazy/transformed (v0.3.0)
โ โโโ exceptions.py # Exception hierarchy
โ โ
โ โโโ core/ # Data pipeline
โ โ โโโ ingestion.py # Type detection, validation
โ โ โโโ profiler.py # Column statistics
โ โ โโโ features.py # Feature detection
โ โ โโโ chunked.py # Large dataset processing
โ โ
โ โโโ stats/ # Statistical analysis
โ โ โโโ engine.py # Test selection & execution
โ โ โโโ planner.py # Analysis planning
โ โ โโโ ...
โ โ
โ โโโ timeseries/ # Time series (v0.2.0)
โ โ โโโ detector.py # Series detection
โ โ โโโ decomposition.py # Trend/seasonal separation
โ โ โโโ forecasting.py # ARIMA/ExpSmoothing
โ โ โโโ anomaly.py # Anomaly detection
โ โ โโโ change_points.py # Change point detection
โ โ
โ โโโ clustering/ # Clustering (v0.2.0)
โ โ โโโ kmeans.py # K-Means
โ โ โโโ dbscan.py # DBSCAN
โ โ โโโ hierarchical.py # Hierarchical clustering
โ โ โโโ profiler.py # Cluster profiling
โ โ
โ โโโ connectors/ # Data sources (v0.2.0)
โ โ โโโ postgres.py # PostgreSQL
โ โ โโโ mysql.py # MySQL/MariaDB
โ โ โโโ database.py # Base SQL class
โ โ โโโ s3.py # AWS S3
โ โ โโโ cloud.py # Azure/GCS
โ โ โโโ ...
โ โ
โ โโโ export/ # Report generation (v0.2.0)
โ โ โโโ html.py # HTML reports
โ โ โโโ pdf.py # PDF export
โ โ โโโ pptx.py # PowerPoint slides
โ โ โโโ notebook.py # Jupyter notebooks
โ โ โโโ ...
โ โ
โ โโโ cache/ # Caching (v0.2.0)
โ โ โโโ base.py # Cache interface
โ โ โโโ file.py # File-based cache
โ โ โโโ redis.py # Redis cache
โ โ
โ โโโ pipeline/ # Custom pipelines (v0.2.0)
โ โ โโโ __init__.py # Pipeline class
โ โ โโโ steps.py # Pre-built steps
โ โ
โ โโโ execution/ # Lazy execution planning (v0.3.0)
โ โ โโโ plan.py # ExecutionPlan and PlanNode
โ โ โโโ builder.py # PlanBuilder
โ โ โโโ executor.py # DAG executor with tracing/profiling
โ โ โโโ pushdown.py # SQL pushdown helpers
โ โ
โ โโโ graph/ # Transformation DAG (v0.3.0)
โ โ โโโ graph.py # TransformGraph
โ โ โโโ node.py # DataSourceNode, TransformNode, SinkNode
โ โ โโโ cache.py # NodeCache
โ โ โโโ lineage.py # LineageTracker
โ โ
โ โโโ analyzers/ # Modular analyzers (v0.3.0)
โ โ โโโ profiling.py # ProfilingAnalyzer
โ โ โโโ correlation.py # CorrelationAnalyzer
โ โ โโโ trend_anomaly.py # TrendAnomalyAnalyzer
โ โ โโโ segmentation.py # SegmentationAnalyzer
โ โ
โ โโโ observability/ # Tracing and profiling (v0.3.0)
โ โ โโโ trace.py # TraceCollector
โ โ โโโ profiler.py # ExecutionProfiler
โ โ
โ โโโ extension/ # Plugin registries (v0.3.0)
โ โ โโโ interfaces.py # Analyzer, CustomTransform, OutputFormat
โ โ โโโ registry.py # register_* decorators
โ โ
โ โโโ llm/ # LLM integration
โ โ โโโ openai.py # OpenAI provider
โ โ โโโ groq.py # Groq provider
โ โ โโโ base.py # Provider interface
โ โ
โ โโโ viz/ # Visualizations
โ โ โโโ generator.py # Plotly spec generation
โ โ โโโ themes.py # Color schemes
โ โ
โ โโโ insights/ # Insight generation
โ โ โโโ rules.py # Rule-based insights
โ โ โโโ templates.py # Insight templates
โ โ
โ โโโ schemas/ # Type definitions
โ โ โโโ config.py # AnalysisConfig
โ โ โโโ outputs.py # AnalysisResult & schemas
โ โ
โ โโโ cli/ # Command-line interface
โ โโโ main.py # CLI entry point
โ
โโโ tests/ # 300+ tests
โ โโโ test_core.py
โ โโโ test_clustering.py
โ โโโ test_timeseries_*.py
โ โโโ test_connectors_integration.py
โ โโโ test_export.py
โ โโโ ...
โ
โโโ examples/ # Example scripts
โ โโโ quickstart.py
โ โโโ forecasting_demo.py
โ โโโ ...
โ
โโโ docs/ # Full documentation
โ โโโ guides/ # Step-by-step guides
โ โโโ api/ # API reference
โ โโโ examples/ # Example notebooks
โ
โโโ pyproject.toml # Dependencies & config
๐ API Classes & Functions
Core Classes
from xelytics import analyze
result = analyze(df)
Adopt this when you need lazy data binding, plan inspection, graph transforms, observability, or extension registries:
from xelytics import Xelytics
result = Xelytics().dataset(df).analyze().run()
Migration notes:
analyze(df),AnalysisConfig,AnalysisResult, connectors, cache backends, exporters, pipelines, time series modules, and clustering modules remain supported.- v0.3.0 adds optional result fields:
correlation,trend_anomaly,segmentation,trace, andprofiling. - v0.2.x eager
Pipelineremains supported for preprocessing. - Prefer
Dataset.transform()orTransformGraphfor transformations that need lineage, node caching, or plan visibility. - Prefer
Xelyticsorbuild_plan()for new lazy and graph-aware workflows.
Documentation
| Document | Purpose |
|---|---|
| examples/xelytics_core_v0_3_0_complete.ipynb | complete executable v0.3.0 notebook |
| docs/quickstart.md | copy-paste examples |
| docs/index.md | documentation index and feature matrix |
| docs/api/analyze.md | analyze(), Xelytics, and large-dataset API |
| docs/api/config.md | all AnalysisConfig fields |
| docs/api/result_schema.md | result dataclasses and serialization |
| docs/api/execution.md | Dataset, ExecutionPlan, TransformGraph, observability |
| docs/api/extensions.md | extension registry APIs |
| docs/guides/05_connectors.md | database and cloud source usage |
| docs/guides/06_export_reports.md | report export formats |
| MIGRATION_GUIDE_v0.2_to_v0.3.md | v0.2.x to v0.3.0 migration |
| ARCHITECTURE.md | package architecture |
| CHANGELOG.md | release history |
Development
pip install -e ".[dev]"
pytest tests/
Focused v0.3.0 verification:
pytest tests/test_epic1_connectivity.py tests/test_epic2.py tests/test_epic3.py
pytest tests/test_epic4.py tests/test_epic5.py tests/test_epic6.py tests/test_epic7.py
The package supports Python 3.9 through 3.12.
Project Status
Xelytics-Core is beta software. v0.3.0 is compatibility-first: older v0.2.x DataFrame workflows remain valid while the package moves toward the lazy, graph-aware engine model. See CHANGELOG.md and API_CONTRACT.md for versioning and compatibility policy.
License
MIT, as declared in pyproject.toml.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xelytics_core-0.3.0.tar.gz.
File metadata
- Download URL: xelytics_core-0.3.0.tar.gz
- Upload date:
- Size: 224.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
944b9f600cc534f092af6890318398ca50616bde0329f52689b759ef982a4163
|
|
| MD5 |
285e64fd22cc9019b3e8005daca895a1
|
|
| BLAKE2b-256 |
6de154c790c8256e43977cbcbdfa55c1c718c2ffccf4ae22aa6b95819e01f671
|
File details
Details for the file xelytics_core-0.3.0-py3-none-any.whl.
File metadata
- Download URL: xelytics_core-0.3.0-py3-none-any.whl
- Upload date:
- Size: 182.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a3ce0871de74aa1253fa610b3b6524192a03da1884546de9f73e3bff164472a
|
|
| MD5 |
1cb676474b7a3d667d77b923d1f9a7dc
|
|
| BLAKE2b-256 |
d259a0fb611e1fdeb1efc5c29fcca474c2c70b3bbe2b444e1771c3ce56b738cb
|