Skip to main content

FactoMineR-compatible multivariate exploratory data analysis for Python

Project description

FactoMinePy

CI License: MIT Python Status

⚠️ Experimental — use with caution. This is an independent Python port of the R package FactoMineR. It is not affiliated with or endorsed by the authors of FactoMineR. The port is still pre-release; APIs may change and some option-level features differ from R (see the status table and known-limitations below). Every analytic method is parity-checked against live R, but for production work or published research treat results as preliminary and cross-check against the original R package.

A from-primitives reimplementation in pure NumPy/SciPy/Pandas of the R package FactoMineR for multivariate exploratory data analysis (PCA, CA, MCA, HCPC, dimdesc/catdes/condes).

This package is not a wrapper around R; every method is reimplemented from the published FactoMineR documentation and R source, then validated numerically against R FactoMineR (currently 2.14 on CRAN) via a checked-in fixture harness. R FactoMineR remains the canonical reference implementation; this port aims for byte-identical fixture output and column-by-column schema parity, but is not a drop-in replacement.

Status

Dev release (0.3.0.dev0). Every analytically meaningful R FactoMineR 2.14 method is now live and parity-verified against live R: PCA, CA, MCA, FAMD, the MFA family (MFA / HMFA / DMFA), GPA, HCPC, CaGalt, the regression family (LinearModel / AovSum / RegBest), textual, the predict.* family, reconst, estim_ncp, the dimdesc / catdes / condes / descfreq descriptors, the svd_triplet / tab_disjonctif utilities, and matplotlib + plotly plotting backends. The deterministic methods are numerically parity-verified; GPA is rotation-invariant-verified (R's GPA is stochastic); the plotting backends are structurally verified (plus vertex-exact ellipses). Remaining gaps are at the option level (noted per row below), not whole methods. The supported-methods table below is the source of truth for exactly what works and at what parity bar.

FactoMineR method Python equivalent Live R-parity verified Notes
PCA factominer.PCA active + supplementary individuals, quanti.sup, quali.sup
CA factominer.CA symmetric biplot, supplementary rows/columns
MCA factominer.MCA indicator + Burt methods (both parity-verified); active + supplementary variables (quanti_sup correlations, quali_sup category barycenters with v.test/eta²). Burt is not yet combined with quali_sup
HCPC factominer.HCPC hierarchical clustering on PCA/CA/MCA, k-means consolidation
dimdesc factominer.dimdesc quantitative + categorical description per axis
catdes factominer.catdes Cla/Mod, Mod/Cla, Global, hypergeometric v-test; quanti_var Eta²; per-level quanti with sd in category / Overall sd / n
condes factominer.condes correlation tests for a continuous target
descfreq factominer.descfreq describe the rows of a frequency table by their over/under-represented columns (hypergeometric test); the CA analogue of catdes
predict.PCA / .MCA / .FAMD / .MFA factominer.predict project new (held-out) individuals onto a fitted model — coord, cos2, dist. Parity-verified vs live R for all four model types
reconst factominer.reconst low-rank reconstruction of the original table from a fitted PCA or CA result (reconst(res, ncp)). MFA reconstruction (all-quanti groups only) not yet exposed
estim_ncp factominer.estim_ncp estimate the number of PCA dimensions by GCV or the smoothing criterion
plot.PCA / .CA / .MCA / .HCPC factominer.plot.plot() structural + ellipse matplotlib backend; factor maps, biplot, scree, contributions, dendrogram, habillage. Confidence/concentration ellipses (coord.ellipse) are vertex-parity-verified against R
FAMD factominer.FAMD mixed quantitative + qualitative data; active variables + supplementary variables (sup_var: sup-quanti correlations, sup-quali barycenters with v.test/eta², var.coord.sup summary). Supplementary individuals (ind_sup) not yet supported
MFA factominer.MFA Multiple Factor Analysis: groups of variables (types s/c/n), each normalized by its first eigenvalue. Parity-verified: eig, ind (incl. partial coords coord.partiel), quanti.var, quali.var, the group block (coord/contrib/cos2/dist2/correlation + Lg/RV), partial.axes, and inertia.ratio. Active groups, uniform row weights; supplementary groups and frequency/mixed (f/m) groups are not yet supported
HMFA factominer.HMFA Hierarchical MFA: nested groups via H (per-level group counts), each level adding a 1/λ₁ normalization. Parity-verified: eig, ind, quanti.var, quali.var, group.coord (one matrix per hierarchy level), and group.canonical. Active groups (types s/c/n), uniform row weights
DMFA factominer.DMFA Dual MFA: studies how the variable cloud varies across the levels of a grouping factor (num_fact). Parity-verified: eig, ind, var, quanti.sup, the group block (coord/coord.n/cos2 — the v_sᵀ Cov_j v_s / λ_s trace), and the per-group cor.dim.gr / var.partiel diagnostics. Supplementary qualitatives not yet supported
GPA factominer.GPA ⚠️ rotation-invariant Generalized Procrustes Analysis, including unequal-width configurations. RV / RVs / simi and the PANOVA per-object/per-config sum-of-squares tables are parity-verified exactly; consensus / Xfin (and correlations) match R up to a global rotation/reflection (R's GPA is stochastic). Missing values not yet supported
CaGalt factominer.CaGalt Correspondence Analysis on Generalized Aggregated Lumped Tables: relates a frequency table Y to contextual covariates X. Parity-verified for type="s"/"c" (quantitative covariates): eig, ind, freq, quanti.var (coord/cor/cos2). type="n" (qualitative covariates, needs a row-weighted MCA) and the bootstrap confidence ellipses are not yet supported
LinearModel / AovSum factominer.LinearModel / factominer.AovSum linear model with contr.sum (sum-to-zero) contrasts: the Type-III/II ANOVA table (Ftest) and the per-level coefficient table (Ttest), plus r.squared/sigma/fstatistic/aic/bic. Stepwise selection (aic/bic) not yet implemented
RegBest factominer.RegBest best-subset linear regression: the lowest-RSS subset of each size, with selection by "r2" / "Cp" / "adjr2". Predictors must be numeric
textual factominer.textual tokenize a free-text column into a document × word contingency table (cont_table) + a word-frequency summary (nb_words); feeds CA / descfreq
svd.triplet / tab.disjonctif factominer.svd_triplet / factominer.tab_disjonctif the row/column-weighted SVD primitive and the disjunctive (one-hot) coder, exposed as standalone utilities
Plotly backend factominer.plot.plot(..., backend="plotly") structural mirrors the matplotlib surface (ind/var/biplot/scree/contrib, CA/MCA maps, HCPC factor map + dendrogram); shares the _data geometry layer. Needs pip install 'factominer[plotly]'

Every analytic FactoMineR method in scope is now live and parity-verified; no methods remain stubbed. Remaining gaps are at the option level (noted per row above) rather than whole methods — see ROADMAP.md.

Install

pip install factominer
# matplotlib backend ships by default; for the optional plotly backend:
pip install 'factominer[plotly]'

Quickstart

from factominer import PCA, HCPC, dimdesc
from factominer.datasets import load_decathlon

decathlon = load_decathlon()
res = PCA(decathlon, scale_unit=True, ncp=5,
          quanti_sup=["Rank", "Points"],
          quali_sup=["Competition"])

print(res.summary())
print(res.eig)             # eigenvalue table (DataFrame)
print(res.ind.coord)       # individual coordinates
print(res.var.contrib)     # variable contributions

# Describe each axis
desc = dimdesc(res, axes=[0, 1])
print(desc[0]["quanti"])

# Cluster on the principal components
clust = HCPC(res, nb_clust=3)
print(clust.data_clust.head())

# Plot
import matplotlib.pyplot as plt
from factominer.plot import plot
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
plot(res, choix="ind", habillage="Competition", ax=ax[0])
plot(res, choix="var", ax=ax[1])
plt.show()

Migrating from R

See docs/migrating-from-r.md for a side-by-side cheat sheet (R call → Python call → result attribute mapping → semantic differences).

The most important semantic differences:

  1. Argument names use snake_case. scale.unit=TRUEscale_unit=True, quanti.sup=11:12quanti_sup=[10, 11] (and column names like "Rank" work too).
  2. Indices are 0-based. ind.sup=1:3 (R) → ind_sup=[0, 1, 2] (Python).
  3. Sign convention. SVD is sign-ambiguous; we apply a deterministic rule (first absolute-max coordinate of each axis is positive). Coordinates may differ from R by a sign; the interpretation (clusters, distances, contributions) is identical. See factominer._sign.
  4. Result objects. res$eig (R) → res.eig (Python). res$var$coordres.var.coord. All result tables are pandas.DataFrame.
  5. Plotting is explicit. graph=TRUE does not exist; you call factominer.plot.plot(res, ...) yourself. No magic on print(res).

Numerical fidelity

For every live method, the package ships parity tests that assert column-by-column equivalence against R FactoMineR 2.14 (current CRAN) within tight tolerances:

  • Eigenvalues to 1e-10 absolute
  • Coordinates / cos² / correlations / eta² to 1e-9 after sign alignment (active blocks; supplementary blocks to 1e-7)
  • Contributions to 1e-8
  • v-tests to 1e-6
  • p-values to 1e-5 relative
  • GPA: RV / RVs / simi to 1e-6; consensus / Xfin matched as rotation-invariant inter-object distances
  • HCPC partitions to ARI ≥ 0.999 (k-means consolidation can swap a couple of individuals)

Fixtures are JSON dumps of R FactoMineR results, generated by tools/refresh_r_fixtures.R and committed under tests/fixtures/r_outputs/. The Python tests load them without needing R at test time. Every fixture in the repo is byte-identical to what live R FactoMineR 2.14 emits on a Linux GitHub runner with R 4.6.0 (verified by the rpy2-parity CI job, which is triggerable on-demand via workflow_dispatch and runs on a weekly cron).

To regenerate fixtures locally (requires R + FactoMineR + jsonlite):

Rscript tools/refresh_r_fixtures.R
pytest -q

Known limitations / use with caution

This port targets the most common FactoMineR API surface and is rigorously validated on the bundled datasets, but the following caveats apply:

  • Complete data only — no missing-value handling. R's iterative imputation / NA-as-category paths (PCA / CA / MCA / GPA missing values) are not implemented; pass complete data.
  • Remaining gaps are at the option level, not whole methods: FAMD supplementary individuals (ind_sup; sup_var is supported); MCA method="Burt" combined with quali_sup; MFA reconstruction via reconst; CaGalt qualitative covariates (type="n") and its bootstrap confidence ellipses; LinearModel Type-II SS and AIC/BIC stepwise selection, and meansComp (which would need an emmeans/multcompView port); simule (stochastic) and write.infile (text I/O). These are documented per row in the status table.
  • GPA parity is rotation-invariant, and the port is deterministic. R's GPA is stochastic (random multi-start + random rank-deficient basis completion), so its consensus / Xfin / PANOVA are reproducible only up to a global rotation/reflection and the converged optimum — an inherent gauge freedom of Procrustes analysis (R's GPA is not even reproducible run-to-run with set.seed). The port implements the deterministic single-start core; RV / RVs / simi (from the raw configurations) match R exactly, consensus / Xfin match R's inter-object distances, and PANOVA matches at a stochastic tolerance. Unequal-width configurations are supported; missing values are not.
  • Parity is empirical, not exhaustive. Every analytic method is checked column-by-column against freshly-generated live R FactoMineR 2.14 output (via a CI rpy2 job) on the bundled fixtures. Plots are verified structurally, not pixel-by-pixel.
  • Sign of axes is arbitrary. SVD is sign-ambiguous; we apply a deterministic rule that may give the opposite sign from R on a given axis. Distances, clusters, contributions, and cos² are sign-invariant; coordinates may need a flip to align visually with R output.
  • HCPC partitions can differ by one or two individuals. K-means consolidation is sensitive to initialization; the adjusted Rand index against R is ≥ 0.999 on the decathlon test fixture but not exactly 1.0.
  • Plot parity is structural, not pixel-exact. Both backends are verified to produce the expected traces/artists and the R-faithful coord.ellipse geometry, but not pixel-identical images. The plotly backend mirrors the matplotlib surface and shares the same data layer.

For production analyses, journal submissions, or any use where reproducibility against R FactoMineR is load-bearing, cross-check results against the original R package.

Datasets

Bundled datasets under factominer.datasets:

Loader Source Use case
load_decathlon() IAAF 2004 Athens Olympic + Décastar 2004, re-derived from public results PCA, dimdesc, HCPC
load_children() FactoMineR's children (children's worries by socio-educational category) CA
load_tea() FactoMineR's tea (300-person tea-consumption survey) MCA, catdes
load_poison() FactoMineR's poison (food-poisoning outbreak survey) FAMD, mixed quantitative + categorical

See factominer/datasets/data/PROVENANCE.md for each dataset's origin and licensing notes.

Contributing

See CONTRIBUTING.md for dev setup, parity-bar expectations, and the PR / issue workflow. Bug reports and feature requests are welcome — please use the issue templates so we have the reproducer / R-side context up front. For security issues, see SECURITY.md and email hello@aigora.com rather than filing a public issue.

Citing

If you use FactoMinePy in published work, please cite both this package and the original R FactoMineR (Lê, Josse, Husson, J. Stat. Softw. 2008, doi:10.18637/jss.v025.i01). A CITATION.cff is included for tools that consume it automatically.

License

MIT for code. Bundled datasets carry their original licensing — see factominer/datasets/data/PROVENANCE.md. The package does not redistribute R FactoMineR source (GPL); everything is reimplemented from the published documentation and validated against R outputs.

Acknowledgments

  • The R FactoMineR package by Sébastien Lê, Julie Josse, François Husson (and many contributors) defines the API surface this package targets.
  • factoextra for the visualization patterns that the matplotlib backend reproduces.
  • scientisttools and prince for prior Python ports that informed the API shape.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

factominer-0.3.0.dev0.tar.gz (662.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

factominer-0.3.0.dev0-py3-none-any.whl (112.9 kB view details)

Uploaded Python 3

File details

Details for the file factominer-0.3.0.dev0.tar.gz.

File metadata

  • Download URL: factominer-0.3.0.dev0.tar.gz
  • Upload date:
  • Size: 662.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for factominer-0.3.0.dev0.tar.gz
Algorithm Hash digest
SHA256 82849a5f3e95fb7ebadebc6d92d01c59ffd896f150f7a9d6f9f0d88ca17f7e83
MD5 db052f01faab036861b6435d4e673ceb
BLAKE2b-256 2b5605f9570daceac0ab5efe28ad81ce8031072caca9bc37edbf0a854b18231d

See more details on using hashes here.

Provenance

The following attestation bundles were made for factominer-0.3.0.dev0.tar.gz:

Publisher: release.yml on aigorahub/FactoMinePy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file factominer-0.3.0.dev0-py3-none-any.whl.

File metadata

  • Download URL: factominer-0.3.0.dev0-py3-none-any.whl
  • Upload date:
  • Size: 112.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for factominer-0.3.0.dev0-py3-none-any.whl
Algorithm Hash digest
SHA256 e4b5f8e5fd9e4202986a87af428076b975a37a4883476a9a6358d9d592e6c33d
MD5 ccd46ad82d10502179285a0c24464757
BLAKE2b-256 ac203c6d90e597c068b6e224fdcc26c8d5596a02e17560d4f3fd62fbc24956d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for factominer-0.3.0.dev0-py3-none-any.whl:

Publisher: release.yml on aigorahub/FactoMinePy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page