Skip to main content

File to Analysis — Automatically perform descriptive statistical analysis and visualization from any data source

Project description

f2a — File to Analysis

One line of code → Full statistical analysis + interactive HTML report. 24+ file formats, HuggingFace datasets, 6 languages, 20+ analysis modules, 50+ visualizations.

PyPI Python License: MIT Tests

f2a Overview Report

f2a Clustering Analysis

Generated from f2a.analyze("lerobot/roboturk") — a single line of code.


Live Sample Report

📊 View Sample Report (lerobot/roboturk) ← GitHub Pages (recommended)

A fully self-contained interactive HTML report generated from the lerobot/roboturk dataset.

Alternative: Download raw HTML and open in your browser.


Installation

pip install f2a

For advanced analyses (UMAP, ADF tests):

pip install f2a[advanced]

Quick Start

import f2a

# ── Local files ──────────────────────────────────────
report = f2a.analyze("data/sales.csv")
report.show()                    # Print summary to console
report.to_html("output/")       # Save interactive HTML report

# ── HuggingFace datasets ────────────────────────────
report = f2a.analyze("https://huggingface.co/datasets/imdb")
report = f2a.analyze("hf://imdb")
report = f2a.analyze("imdb")    # org/dataset pattern auto-detected

# ── Access results ───────────────────────────────────
report.stats.summary             # Descriptive statistics (DataFrame)
report.stats.correlation_matrix  # Correlation matrix
report.stats.advanced_stats      # Advanced analysis results
report.schema.columns            # Column type information
report.to_dict()                 # Everything as a dictionary

Example: Analyzing a HuggingFace Dataset

import f2a

report = f2a.analyze("https://huggingface.co/datasets/lerobot/roboturk")
shape: (187507, 11) | subsets: 1
  default/train: (187507, 11)
report.show()
╔══════════════════════════════════════════════════════════╗
║  f2a Analysis Report — lerobot/roboturk                 ║
╠══════════════════════════════════════════════════════════╣
║  Rows: 187,507  ·  Columns: 11                         ║
║  Numeric: 9  ·  Categorical: 0  ·  Text: 0             ║
║  Datetime: 0  ·  Boolean: 0                             ║
╚══════════════════════════════════════════════════════════╝
# Save interactive HTML report (2.5 MB self-contained file)
path = report.to_html("output/")
print(path)
# → output/lerobot_roboturk_20260317_090024_report.html

📊 View this report live

# Access statistics programmatically
report.stats.summary
#          timestamp   episode_index  frame_index  ...
# count   187507.00      187507.00    187507.00   ...
# mean         ...           ...          ...     ...
# std          ...           ...          ...     ...

report.stats.correlation_matrix
#                   timestamp  episode_index  frame_index  ...
# timestamp          1.000000      0.978193     0.054412  ...
# episode_index      0.978193      1.000000    -0.003887  ...

# Advanced analysis results
report.stats.advanced_stats.keys()
# dict_keys(['advanced_distribution', 'advanced_correlation', 'clustering',
#            'dimreduction', 'feature_insights', 'advanced_anomaly', ...])

Multi-Subset HuggingFace Datasets

Datasets with multiple configs and splits are automatically discovered and analyzed.

report = f2a.analyze("FINAL-Bench/ALL-Bench-Leaderboard")
print(f"Total: {report.shape[0]} rows across {len(report.subsets)} subsets")

for s in report.subsets:
    print(f"  {s.subset}/{s.split}: {s.shape}")

# Load specific subset
report = f2a.analyze("FINAL-Bench/ALL-Bench-Leaderboard", config="agent", split="train")

The HTML report generates tabbed navigation — each subset/split gets its own analysis page.


HTML Report Features

report.to_html() generates a single self-contained HTML file (no external dependencies) with:

📑 Two-Depth Tab Navigation

[Subset/Split Tabs]
  └── [Basic] | [Advanced]
        ├── Basic: 13 analysis sections
        └── Advanced: 10 advanced analysis sections

🎯 Interactive Elements

Feature Description
Metric Tooltips Hover any table header to see a detailed explanation of the metric
Method Info Modals Click the ⓘ button on each section to see a detailed beginner-friendly explanation
Image Zoom Modal Click any chart to view full-size with zoom/pan/drag support
Draggable Tables Wide tables support horizontal drag-scrolling with sticky first column
6-Language i18n English, Korean, Chinese, Japanese, German, French — switch in the header
Dark/Light Theme Automatic system preference detection + manual toggle
Responsive Layout Works on desktop, tablet, and mobile

📖 Beginner-Friendly Descriptions

Every section and every metric includes:

  • Detailed modal descriptions with HTML formatting, examples, and analogies
  • Beginner tips (초심자 팁 / Anfänger-Tipp / Conseil débutant / 初心者向けヒント / 初学者提示)
  • Interpretation guidance — what does this number actually mean?
  • All descriptions are fully translated into 6 languages (not machine-translated placeholders)

Analysis Modules

Basic Analysis (13 sections)

Section Key Metrics
Overview Row/column count, type distribution, memory usage
Data Quality Completeness, uniqueness, consistency, validity (0–100%)
Preprocessing Applied steps, before/after comparison
Descriptive Statistics Mean, median, std, SE, CV, MAD, min/max, quartiles, IQR, skewness, kurtosis
Distribution Analysis Shapiro-Wilk, D'Agostino, KS, Anderson-Darling normality tests
Correlation Analysis Pearson, Spearman, Kendall matrices, Cramér's V, VIF
Missing Data Per-column missing ratio, row distribution, pattern analysis
Outlier Detection IQR method, Z-score method, per-column outlier stats
Categorical Analysis Frequency, entropy, normalized entropy, chi-square independence
Feature Importance Variance ranking, mean absolute correlation, mutual information
PCA Explained variance, scree plot, loadings heatmap, biplot
Duplicates Exact duplicate rows, column-wise uniqueness
Warnings High correlation, high missing ratio, constant columns

Advanced Analysis (10 sections)

Section Techniques
Advanced Distribution Best-fit distribution selection (7 candidates), power transform analysis, Jarque-Bera test, ECDF, KDE bandwidth optimization
Advanced Correlation Partial correlation, mutual information matrix, bootstrap confidence intervals, correlation network graph
Clustering K-Means (elbow method), DBSCAN, hierarchical clustering (dendrogram), cluster profiling
Dimensionality Reduction t-SNE, UMAP (optional), Factor Analysis
Feature Insights Interaction detection, monotonic relationships, optimal binning, cardinality analysis, data leakage detection
Anomaly Detection Isolation Forest, Local Outlier Factor (LOF), Mahalanobis distance, ensemble consensus
Statistical Tests Levene, Kruskal-Wallis, Mann-Whitney U, chi-square goodness-of-fit, Grubbs test, ADF stationarity
Insight Engine Auto-generated prioritized natural-language insights
Cross Analysis Outlier × cluster intersection, Simpson's paradox detection
ML Readiness Multi-dimensional ML-readiness scoring, encoding recommendations, data type suggestions

Visualizations (50+)

Category Charts
Distribution Histogram + KDE, boxplots, violin plots, Q-Q plots
Correlation Heatmap (Pearson/Spearman/Kendall), partial correlation heatmap, MI heatmap, bootstrap CI plot, network graph
Missing Missing matrix, bar chart, heatmap
Outlier Box plots with outlier markers, scatter plots
Categorical Bar charts, frequency tables
PCA Scree plot, cumulative variance, loadings heatmap, biplot
Clustering Elbow curve, silhouette plot, cluster scatter, dendrogram, cluster profiles
Advanced Distribution ECDF, power transform comparison, KDE bandwidth grid
Dimensionality Reduction t-SNE scatter, Factor Analysis loadings
Anomaly Isolation Forest scores, LOF scores, Mahalanobis distances, consensus heatmap
Quality Radar chart (4 dimensions), per-column quality bars
Insights Insight summary cards, cross-analysis Venn diagrams

All charts are inline base64 PNG — no external image files needed.


Supported Formats (24+)

Category Formats
Delimited .csv .tsv .txt .dat .tab .fwf
JSON .json .jsonl .ndjson
Spreadsheet .xlsx .xls .xlsm .xlsb
OpenDocument .ods
Columnar .parquet .pq .feather .ftr .arrow .ipc .orc
HDF5 .hdf .hdf5 .h5
Statistical .dta (Stata) .sas7bdat .xpt (SAS) .sav .zsav (SPSS)
Database .sqlite .sqlite3 .db .duckdb
Pickle .pkl .pickle
Markup .xml .html .htm
HuggingFace hf:// URL, full URL, or org/dataset pattern

Configuration

from f2a import AnalysisConfig

# ── Preset configs ───────────────────────────────────
config = AnalysisConfig.fast()        # Skip PCA, feature importance, advanced
config = AnalysisConfig.minimal()     # Descriptive + missing only
config = AnalysisConfig.basic_only()  # All basic on, all advanced off

# ── Custom config ────────────────────────────────────
config = AnalysisConfig(
    advanced=True,
    clustering=True,
    advanced_anomaly=True,
    statistical_tests=True,
    insight_engine=True,
    cross_analysis=True,
    ml_readiness=True,
    outlier_method="zscore",          # "iqr" (default) or "zscore"
    outlier_threshold=3.0,            # Z-score cutoff
    correlation_threshold=0.9,        # High-correlation warning threshold
    pca_max_components=10,
    max_cluster_k=10,                 # Max K for K-Means elbow search
    tsne_perplexity=30.0,
    bootstrap_iterations=1000,
    max_sample_for_advanced=5000,     # Subsample for expensive analyses
)

report = f2a.analyze("data.csv", config=config)

Config Options

Option Default Description
descriptive True Basic descriptive statistics
distribution True Distribution & normality tests
correlation True Correlation matrices
outlier True Outlier detection
categorical True Categorical variable analysis
feature_importance True Feature importance ranking
pca True PCA analysis
duplicates True Duplicate detection
quality_score True Data quality scoring
advanced True Master toggle for all advanced analyses
advanced_distribution True Best-fit distribution, ECDF, power transform
advanced_correlation True Partial correlation, MI matrix, bootstrap CI
clustering True K-Means, DBSCAN, hierarchical
advanced_dimreduction True t-SNE, UMAP, Factor Analysis
feature_insights True Interaction & leakage detection
advanced_anomaly True Isolation Forest, LOF, Mahalanobis
statistical_tests True Levene, Kruskal-Wallis, Grubbs, ADF
insight_engine True Auto-generated insights
cross_analysis True Cross-dimensional analysis
column_role True Column role detection
ml_readiness True ML readiness scoring

API Reference

f2a.analyze(source, **kwargs) → AnalysisReport

Parameter Type Description
source str File path, URL, or HuggingFace dataset identifier
config AnalysisConfig Analysis configuration (optional)
config str HuggingFace dataset config/subset name (optional)
split str HuggingFace dataset split name (optional)

AnalysisReport

Attribute / Method Type Description
.shape tuple[int, int] (total_rows, columns)
.schema SchemaInfo Column types and metadata
.stats StatsResult All statistical results
.stats.summary DataFrame Descriptive statistics table
.stats.correlation_matrix DataFrame Correlation matrix
.stats.advanced_stats dict Advanced analysis results
.subsets list[SubsetReport] Per-subset results (multi-subset HF datasets)
.warnings list[str] Analysis warnings
.show() Print summary to console
.to_html(output_dir) Path Save interactive HTML report
.to_dict() dict Export all results as dictionary

Project Structure

f2a/
├── __init__.py              # Public API: analyze(), AnalysisConfig
├── _version.py
├── core/
│   ├── analyzer.py          # Main analysis orchestrator
│   ├── config.py            # AnalysisConfig dataclass
│   ├── loader.py            # 24+ format data loader
│   ├── preprocessor.py      # Data preprocessing pipeline
│   └── schema.py            # Schema inference
├── stats/                   # 20 analysis modules
│   ├── descriptive.py       # Mean, median, std, quartiles, etc.
│   ├── distribution.py      # Normality tests, skew/kurtosis
│   ├── correlation.py       # Pearson, Spearman, Kendall, VIF
│   ├── missing.py           # Missing data analysis
│   ├── outlier.py           # IQR / Z-score outlier detection
│   ├── categorical.py       # Frequency, entropy, chi-square
│   ├── feature_importance.py
│   ├── pca_analysis.py
│   ├── duplicates.py
│   ├── quality.py           # 4-dimension quality scoring
│   ├── advanced_distribution.py
│   ├── advanced_correlation.py
│   ├── advanced_anomaly.py  # Isolation Forest, LOF, Mahalanobis
│   ├── advanced_dimreduction.py  # t-SNE, UMAP, Factor Analysis
│   ├── clustering.py        # K-Means, DBSCAN, hierarchical
│   ├── feature_insights.py  # Interaction, leakage detection
│   ├── statistical_tests.py # Levene, KW, Mann-Whitney, ADF
│   ├── insight_engine.py    # Auto insight generation
│   ├── cross_analysis.py    # Cross-dimensional analysis
│   ├── column_role.py       # Column role inference
│   └── ml_readiness.py      # ML readiness scoring
├── viz/                     # 15 visualization modules
│   ├── plots.py             # Base plot utilities
│   ├── theme.py             # Consistent theming
│   ├── dist_plots.py
│   ├── corr_plots.py
│   ├── missing_plots.py
│   ├── outlier_plots.py
│   ├── categorical_plots.py
│   ├── pca_plots.py
│   ├── quality_plots.py
│   ├── cluster_plots.py
│   ├── advanced_dist_plots.py
│   ├── advanced_corr_plots.py
│   ├── advanced_anomaly_plots.py
│   ├── dimreduction_plots.py
│   ├── insight_plots.py
│   └── cross_plots.py
├── report/
│   ├── generator.py         # HTML report generator
│   └── i18n.py              # 6-language translations
└── utils/
    ├── exceptions.py
    ├── logging.py
    ├── type_inference.py
    └── validators.py

Internationalization (i18n)

The HTML report supports 6 languages with a language selector in the header:

Language Code Description Quality
🇺🇸 English en Full detailed descriptions with beginner tips
🇰🇷 Korean ko Full detailed descriptions with 초심자 팁
🇨🇳 Chinese zh Full detailed descriptions with 初学者提示
🇯🇵 Japanese ja Full detailed descriptions with 初心者向けヒント
🇩🇪 German de Full detailed descriptions with Anfänger-Tipp
🇫🇷 French fr Full detailed descriptions with Conseil débutant

Each language includes:

  • ~120 metric tooltip translations — hover any table header
  • ~50 section modal descriptions — click the ⓘ button on each section
  • All UI labels, buttons, and messages

Requirements

  • Python ≥ 3.10
  • Core: pandas, numpy, matplotlib, seaborn, scipy, scikit-learn
  • Formats: datasets (HuggingFace), openpyxl, pyarrow, pyreadstat, tables, odfpy, lxml, duckdb
  • UI: rich, jinja2
  • Optional: networkx, umap-learn, statsmodels (install with pip install f2a[advanced])

Development

# Clone and install
git clone https://github.com/CocoRoF/f2a.git
cd f2a
pip install -e ".[dev]"

# Run tests (88 tests)
pytest git_action/tests/ -q

# Lint
ruff check f2a/

License

MIT License — See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

f2a-1.1.0.tar.gz (2.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

f2a-1.1.0-py3-none-any.whl (246.3 kB view details)

Uploaded Python 3

File details

Details for the file f2a-1.1.0.tar.gz.

File metadata

  • Download URL: f2a-1.1.0.tar.gz
  • Upload date:
  • Size: 2.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for f2a-1.1.0.tar.gz
Algorithm Hash digest
SHA256 f6534df4b91ec2cf0a0837be16371bf41ab87afc53e6852e3680a802303d35af
MD5 781792a30e583f09128827c283d58917
BLAKE2b-256 0f23c43212644bca6a7acf187a5ffe3ca2ec1017371244a1efffad1c04b21666

See more details on using hashes here.

Provenance

The following attestation bundles were made for f2a-1.1.0.tar.gz:

Publisher: publish.yml on CocoRoF/f2a

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file f2a-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: f2a-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 246.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for f2a-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb20ea5885a604195b2e2c4c2d9f7b4b80a1d9757ab9265b47facf5b04b5a756
MD5 c3b389f19978c3e821a79052b1ccf9a6
BLAKE2b-256 b4e5ff1f64c64e4e98c71d5f1bc71988a8a45e6ea6ec40bae829d54acc5fbc5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for f2a-1.1.0-py3-none-any.whl:

Publisher: publish.yml on CocoRoF/f2a

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page