File to Analysis — Automatically perform descriptive statistical analysis and visualization from any data source

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

f2a — File to Analysis

One line of code → Full statistical analysis + interactive HTML report. 24+ file formats, HuggingFace datasets, 6 languages, 20+ analysis modules, 50+ visualizations.

f2a Overview Report

f2a Clustering Analysis

Generated from f2a.analyze("lerobot/roboturk") — a single line of code.

Live Sample Report

📊 View Sample Report (lerobot/roboturk) ← GitHub Pages (recommended)

A fully self-contained interactive HTML report generated from the lerobot/roboturk dataset.

Alternative: Download raw HTML and open in your browser.

Installation

pip install f2a

For advanced analyses (UMAP, ADF tests):

pip install f2a[advanced]

Quick Start

import f2a

# ── Local files ──────────────────────────────────────
report = f2a.analyze("data/sales.csv")
report.show()                    # Print summary to console
report.to_html("output/")       # Save interactive HTML report

# ── HuggingFace datasets ────────────────────────────
report = f2a.analyze("https://huggingface.co/datasets/imdb")
report = f2a.analyze("hf://imdb")
report = f2a.analyze("imdb")    # org/dataset pattern auto-detected

# ── Access results ───────────────────────────────────
report.stats.summary             # Descriptive statistics (DataFrame)
report.stats.correlation_matrix  # Correlation matrix
report.stats.advanced_stats      # Advanced analysis results
report.schema.columns            # Column type information
report.to_dict()                 # Everything as a dictionary

Example: Analyzing a HuggingFace Dataset

import f2a

report = f2a.analyze("https://huggingface.co/datasets/lerobot/roboturk")

shape: (187507, 11) | subsets: 1
  default/train: (187507, 11)

report.show()

╔══════════════════════════════════════════════════════════╗
║  f2a Analysis Report — lerobot/roboturk                 ║
╠══════════════════════════════════════════════════════════╣
║  Rows: 187,507  ·  Columns: 11                         ║
║  Numeric: 9  ·  Categorical: 0  ·  Text: 0             ║
║  Datetime: 0  ·  Boolean: 0                             ║
╚══════════════════════════════════════════════════════════╝

# Save interactive HTML report (2.5 MB self-contained file)
path = report.to_html("output/")
print(path)
# → output/lerobot_roboturk_20260317_090024_report.html

📊 View this report live

# Access statistics programmatically
report.stats.summary
#          timestamp   episode_index  frame_index  ...
# count   187507.00      187507.00    187507.00   ...
# mean         ...           ...          ...     ...
# std          ...           ...          ...     ...

report.stats.correlation_matrix
#                   timestamp  episode_index  frame_index  ...
# timestamp          1.000000      0.978193     0.054412  ...
# episode_index      0.978193      1.000000    -0.003887  ...

# Advanced analysis results
report.stats.advanced_stats.keys()
# dict_keys(['advanced_distribution', 'advanced_correlation', 'clustering',
#            'dimreduction', 'feature_insights', 'advanced_anomaly', ...])

Multi-Subset HuggingFace Datasets

Datasets with multiple configs and splits are automatically discovered and analyzed.

report = f2a.analyze("FINAL-Bench/ALL-Bench-Leaderboard")
print(f"Total: {report.shape[0]} rows across {len(report.subsets)} subsets")

for s in report.subsets:
    print(f"  {s.subset}/{s.split}: {s.shape}")

# Load specific subset
report = f2a.analyze("FINAL-Bench/ALL-Bench-Leaderboard", config="agent", split="train")

The HTML report generates tabbed navigation — each subset/split gets its own analysis page.

HTML Report Features

report.to_html() generates a single self-contained HTML file (no external dependencies) with:

📑 Two-Depth Tab Navigation

[Subset/Split Tabs]
  └── [Basic] | [Advanced]
        ├── Basic: 13 analysis sections
        └── Advanced: 10 advanced analysis sections

🎯 Interactive Elements

Feature	Description
Metric Tooltips	Hover any table header to see a detailed explanation of the metric
Method Info Modals	Click the ⓘ button on each section to see a detailed beginner-friendly explanation
Image Zoom Modal	Click any chart to view full-size with zoom/pan/drag support
Draggable Tables	Wide tables support horizontal drag-scrolling with sticky first column
6-Language i18n	English, Korean, Chinese, Japanese, German, French — switch in the header
Dark/Light Theme	Automatic system preference detection + manual toggle
Responsive Layout	Works on desktop, tablet, and mobile

📖 Beginner-Friendly Descriptions

Every section and every metric includes:

Detailed modal descriptions with HTML formatting, examples, and analogies
Beginner tips (초심자 팁 / Anfänger-Tipp / Conseil débutant / 初心者向けヒント / 初学者提示)
Interpretation guidance — what does this number actually mean?
All descriptions are fully translated into 6 languages (not machine-translated placeholders)

Analysis Modules

Basic Analysis (13 sections)

Section	Key Metrics
Overview	Row/column count, type distribution, memory usage
Data Quality	Completeness, uniqueness, consistency, validity (0–100%)
Preprocessing	Applied steps, before/after comparison
Descriptive Statistics	Mean, median, std, SE, CV, MAD, min/max, quartiles, IQR, skewness, kurtosis
Distribution Analysis	Shapiro-Wilk, D'Agostino, KS, Anderson-Darling normality tests
Correlation Analysis	Pearson, Spearman, Kendall matrices, Cramér's V, VIF
Missing Data	Per-column missing ratio, row distribution, pattern analysis
Outlier Detection	IQR method, Z-score method, per-column outlier stats
Categorical Analysis	Frequency, entropy, normalized entropy, chi-square independence
Feature Importance	Variance ranking, mean absolute correlation, mutual information
PCA	Explained variance, scree plot, loadings heatmap, biplot
Duplicates	Exact duplicate rows, column-wise uniqueness
Warnings	High correlation, high missing ratio, constant columns

Advanced Analysis (10 sections)

Section	Techniques
Advanced Distribution	Best-fit distribution selection (7 candidates), power transform analysis, Jarque-Bera test, ECDF, KDE bandwidth optimization
Advanced Correlation	Partial correlation, mutual information matrix, bootstrap confidence intervals, correlation network graph
Clustering	K-Means (elbow method), DBSCAN, hierarchical clustering (dendrogram), cluster profiling
Dimensionality Reduction	t-SNE, UMAP (optional), Factor Analysis
Feature Insights	Interaction detection, monotonic relationships, optimal binning, cardinality analysis, data leakage detection
Anomaly Detection	Isolation Forest, Local Outlier Factor (LOF), Mahalanobis distance, ensemble consensus
Statistical Tests	Levene, Kruskal-Wallis, Mann-Whitney U, chi-square goodness-of-fit, Grubbs test, ADF stationarity
Insight Engine	Auto-generated prioritized natural-language insights
Cross Analysis	Outlier × cluster intersection, Simpson's paradox detection
ML Readiness	Multi-dimensional ML-readiness scoring, encoding recommendations, data type suggestions

Visualizations (50+)

Category	Charts
Distribution	Histogram + KDE, boxplots, violin plots, Q-Q plots
Correlation	Heatmap (Pearson/Spearman/Kendall), partial correlation heatmap, MI heatmap, bootstrap CI plot, network graph
Missing	Missing matrix, bar chart, heatmap
Outlier	Box plots with outlier markers, scatter plots
Categorical	Bar charts, frequency tables
PCA	Scree plot, cumulative variance, loadings heatmap, biplot
Clustering	Elbow curve, silhouette plot, cluster scatter, dendrogram, cluster profiles
Advanced Distribution	ECDF, power transform comparison, KDE bandwidth grid
Dimensionality Reduction	t-SNE scatter, Factor Analysis loadings
Anomaly	Isolation Forest scores, LOF scores, Mahalanobis distances, consensus heatmap
Quality	Radar chart (4 dimensions), per-column quality bars
Insights	Insight summary cards, cross-analysis Venn diagrams

All charts are inline base64 PNG — no external image files needed.

Supported Formats (24+)

Category	Formats
Delimited	`.csv` `.tsv` `.txt` `.dat` `.tab` `.fwf`
JSON	`.json` `.jsonl` `.ndjson`
Spreadsheet	`.xlsx` `.xls` `.xlsm` `.xlsb`
OpenDocument	`.ods`
Columnar	`.parquet` `.pq` `.feather` `.ftr` `.arrow` `.ipc` `.orc`
HDF5	`.hdf` `.hdf5` `.h5`
Statistical	`.dta` (Stata) `.sas7bdat` `.xpt` (SAS) `.sav` `.zsav` (SPSS)
Database	`.sqlite` `.sqlite3` `.db` `.duckdb`
Pickle	`.pkl` `.pickle`
Markup	`.xml` `.html` `.htm`
HuggingFace	`hf://` URL, full URL, or `org/dataset` pattern

Configuration

from f2a import AnalysisConfig

# ── Preset configs ───────────────────────────────────
config = AnalysisConfig.fast()        # Skip PCA, feature importance, advanced
config = AnalysisConfig.minimal()     # Descriptive + missing only
config = AnalysisConfig.basic_only()  # All basic on, all advanced off

# ── Custom config ────────────────────────────────────
config = AnalysisConfig(
    advanced=True,
    clustering=True,
    advanced_anomaly=True,
    statistical_tests=True,
    insight_engine=True,
    cross_analysis=True,
    ml_readiness=True,
    outlier_method="zscore",          # "iqr" (default) or "zscore"
    outlier_threshold=3.0,            # Z-score cutoff
    correlation_threshold=0.9,        # High-correlation warning threshold
    pca_max_components=10,
    max_cluster_k=10,                 # Max K for K-Means elbow search
    tsne_perplexity=30.0,
    bootstrap_iterations=1000,
    max_sample_for_advanced=5000,     # Subsample for expensive analyses
)

report = f2a.analyze("data.csv", config=config)

Config Options

Option	Default	Description
`descriptive`	`True`	Basic descriptive statistics
`distribution`	`True`	Distribution & normality tests
`correlation`	`True`	Correlation matrices
`outlier`	`True`	Outlier detection
`categorical`	`True`	Categorical variable analysis
`feature_importance`	`True`	Feature importance ranking
`pca`	`True`	PCA analysis
`duplicates`	`True`	Duplicate detection
`quality_score`	`True`	Data quality scoring
`advanced`	`True`	Master toggle for all advanced analyses
`advanced_distribution`	`True`	Best-fit distribution, ECDF, power transform
`advanced_correlation`	`True`	Partial correlation, MI matrix, bootstrap CI
`clustering`	`True`	K-Means, DBSCAN, hierarchical
`advanced_dimreduction`	`True`	t-SNE, UMAP, Factor Analysis
`feature_insights`	`True`	Interaction & leakage detection
`advanced_anomaly`	`True`	Isolation Forest, LOF, Mahalanobis
`statistical_tests`	`True`	Levene, Kruskal-Wallis, Grubbs, ADF
`insight_engine`	`True`	Auto-generated insights
`cross_analysis`	`True`	Cross-dimensional analysis
`column_role`	`True`	Column role detection
`ml_readiness`	`True`	ML readiness scoring

API Reference

`f2a.analyze(source, **kwargs) → AnalysisReport`

Parameter	Type	Description
`source`	`str`	File path, URL, or HuggingFace dataset identifier
`config`	`AnalysisConfig`	Analysis configuration (optional)
`config`	`str`	HuggingFace dataset config/subset name (optional)
`split`	`str`	HuggingFace dataset split name (optional)

`AnalysisReport`

Attribute / Method	Type	Description
`.shape`	`tuple[int, int]`	`(total_rows, columns)`
`.schema`	`SchemaInfo`	Column types and metadata
`.stats`	`StatsResult`	All statistical results
`.stats.summary`	`DataFrame`	Descriptive statistics table
`.stats.correlation_matrix`	`DataFrame`	Correlation matrix
`.stats.advanced_stats`	`dict`	Advanced analysis results
`.subsets`	`list[SubsetReport]`	Per-subset results (multi-subset HF datasets)
`.warnings`	`list[str]`	Analysis warnings
`.show()`	—	Print summary to console
`.to_html(output_dir)`	`Path`	Save interactive HTML report
`.to_dict()`	`dict`	Export all results as dictionary

Project Structure

f2a/
├── __init__.py              # Public API: analyze(), AnalysisConfig
├── _version.py
├── core/
│   ├── analyzer.py          # Main analysis orchestrator
│   ├── config.py            # AnalysisConfig dataclass
│   ├── loader.py            # 24+ format data loader
│   ├── preprocessor.py      # Data preprocessing pipeline
│   └── schema.py            # Schema inference
├── stats/                   # 20 analysis modules
│   ├── descriptive.py       # Mean, median, std, quartiles, etc.
│   ├── distribution.py      # Normality tests, skew/kurtosis
│   ├── correlation.py       # Pearson, Spearman, Kendall, VIF
│   ├── missing.py           # Missing data analysis
│   ├── outlier.py           # IQR / Z-score outlier detection
│   ├── categorical.py       # Frequency, entropy, chi-square
│   ├── feature_importance.py
│   ├── pca_analysis.py
│   ├── duplicates.py
│   ├── quality.py           # 4-dimension quality scoring
│   ├── advanced_distribution.py
│   ├── advanced_correlation.py
│   ├── advanced_anomaly.py  # Isolation Forest, LOF, Mahalanobis
│   ├── advanced_dimreduction.py  # t-SNE, UMAP, Factor Analysis
│   ├── clustering.py        # K-Means, DBSCAN, hierarchical
│   ├── feature_insights.py  # Interaction, leakage detection
│   ├── statistical_tests.py # Levene, KW, Mann-Whitney, ADF
│   ├── insight_engine.py    # Auto insight generation
│   ├── cross_analysis.py    # Cross-dimensional analysis
│   ├── column_role.py       # Column role inference
│   └── ml_readiness.py      # ML readiness scoring
├── viz/                     # 15 visualization modules
│   ├── plots.py             # Base plot utilities
│   ├── theme.py             # Consistent theming
│   ├── dist_plots.py
│   ├── corr_plots.py
│   ├── missing_plots.py
│   ├── outlier_plots.py
│   ├── categorical_plots.py
│   ├── pca_plots.py
│   ├── quality_plots.py
│   ├── cluster_plots.py
│   ├── advanced_dist_plots.py
│   ├── advanced_corr_plots.py
│   ├── advanced_anomaly_plots.py
│   ├── dimreduction_plots.py
│   ├── insight_plots.py
│   └── cross_plots.py
├── report/
│   ├── generator.py         # HTML report generator
│   └── i18n.py              # 6-language translations
└── utils/
    ├── exceptions.py
    ├── logging.py
    ├── type_inference.py
    └── validators.py

Internationalization (i18n)

The HTML report supports 6 languages with a language selector in the header:

Language	Code	Description Quality
🇺🇸 English	`en`	Full detailed descriptions with beginner tips
🇰🇷 Korean	`ko`	Full detailed descriptions with 초심자 팁
🇨🇳 Chinese	`zh`	Full detailed descriptions with 初学者提示
🇯🇵 Japanese	`ja`	Full detailed descriptions with 初心者向けヒント
🇩🇪 German	`de`	Full detailed descriptions with Anfänger-Tipp
🇫🇷 French	`fr`	Full detailed descriptions with Conseil débutant

Each language includes:

~120 metric tooltip translations — hover any table header
~50 section modal descriptions — click the ⓘ button on each section
All UI labels, buttons, and messages

Requirements

Python ≥ 3.10
Core: pandas, numpy, matplotlib, seaborn, scipy, scikit-learn
Formats: datasets (HuggingFace), openpyxl, pyarrow, pyreadstat, tables, odfpy, lxml, duckdb
UI: rich, jinja2
Optional: networkx, umap-learn, statsmodels (install with pip install f2a[advanced])

Development

# Clone and install
git clone https://github.com/CocoRoF/f2a.git
cd f2a
pip install -e ".[dev]"

# Run tests (88 tests)
pytest git_action/tests/ -q

# Lint
ruff check f2a/

License

MIT License — See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CocoRoF

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.1.0

Mar 18, 2026

1.0.3

Mar 17, 2026

This version

0.2.0

Mar 17, 2026

0.1.4

Mar 16, 2026

0.1.3

Mar 16, 2026

0.1.2

Mar 16, 2026

0.1.1

Mar 13, 2026

0.1.0

Mar 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

f2a-0.2.0.tar.gz (2.4 MB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

f2a-0.2.0-py3-none-any.whl (246.3 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file f2a-0.2.0.tar.gz.

File metadata

Download URL: f2a-0.2.0.tar.gz
Upload date: Mar 17, 2026
Size: 2.4 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for f2a-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`f32fb0169ae586779afb4648825ea68c408c5e4d1ba9f684977bab010debbc5c`
MD5	`73c5109f482fccb742d5c2f47ae826e4`
BLAKE2b-256	`0c0bcbd0e1fd416eddde32181975b184e3611656a8bef0b8f805d77b0b265345`

See more details on using hashes here.

Provenance

The following attestation bundles were made for f2a-0.2.0.tar.gz:

Publisher: publish.yml on CocoRoF/f2a

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: f2a-0.2.0.tar.gz
- Subject digest: f32fb0169ae586779afb4648825ea68c408c5e4d1ba9f684977bab010debbc5c
- Sigstore transparency entry: 1113471732
- Sigstore integration time: Mar 17, 2026
Source repository:
- Permalink: CocoRoF/f2a@7a3f4e9c8bccb0a18696964098d8f9067b37bef8
- Branch / Tag: refs/heads/deploy
- Owner: https://github.com/CocoRoF
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7a3f4e9c8bccb0a18696964098d8f9067b37bef8
- Trigger Event: push

File details

Details for the file f2a-0.2.0-py3-none-any.whl.

File metadata

Download URL: f2a-0.2.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 246.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for f2a-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`afec9417cb017656858c42907ae2799ca27be19467b419e4f206f6b731ff560d`
MD5	`6adbec4ea56f22c3d7b575d60912f7a6`
BLAKE2b-256	`d8004e8e080d3f10b51d949ca9b9b8747ee4d6154b0d62e9cc50845f8e592dd9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for f2a-0.2.0-py3-none-any.whl:

Publisher: publish.yml on CocoRoF/f2a

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: f2a-0.2.0-py3-none-any.whl
- Subject digest: afec9417cb017656858c42907ae2799ca27be19467b419e4f206f6b731ff560d
- Sigstore transparency entry: 1113471741
- Sigstore integration time: Mar 17, 2026
Source repository:
- Permalink: CocoRoF/f2a@7a3f4e9c8bccb0a18696964098d8f9067b37bef8
- Branch / Tag: refs/heads/deploy
- Owner: https://github.com/CocoRoF
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7a3f4e9c8bccb0a18696964098d8f9067b37bef8
- Trigger Event: push

f2a 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

f2a — File to Analysis

Live Sample Report

Installation

Quick Start

Example: Analyzing a HuggingFace Dataset

Multi-Subset HuggingFace Datasets

HTML Report Features

📑 Two-Depth Tab Navigation

🎯 Interactive Elements

📖 Beginner-Friendly Descriptions

Analysis Modules

Basic Analysis (13 sections)

Advanced Analysis (10 sections)

Visualizations (50+)

Supported Formats (24+)

Configuration

Config Options

API Reference

f2a.analyze(source, **kwargs) → AnalysisReport

AnalysisReport

Project Structure

Internationalization (i18n)

Requirements

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`f2a.analyze(source, **kwargs) → AnalysisReport`

`AnalysisReport`