Calculate critical bandwidth for bimodal distributions in kernel density estimation
Project description
pola — Critical Bandwidth for Bimodal Distributions
pola is the package name for this release. critband is the next rename we are preparing for after this final maintenance pass, including the public GitHub page.
Release target: v0.1.2 on GitHub and PyPI under pola.
pola is a Python package for detecting whether a distribution is meaningfully bimodal using the critical bandwidth method in kernel density estimation (KDE).
It finds the smallest bandwidth where a KDE transitions from bimodal to unimodal — a well-established statistical test for modality.
Quick Start
pip install pola
import numpy as np
from pola import critical_bandwidth
# Generate bimodal data
x = np.concatenate([np.random.normal(-2, 0.5, 200),
np.random.normal( 2, 0.5, 200)])
h_crit, success = critical_bandwidth(x)
print(f"Critical bandwidth: {h_crit:.4f}")
print(f"Converged: {success}")
# Output: Critical bandwidth: ~0.97
# Converged: True
Features
Core Algorithm
| Function | Description |
|---|---|
silverman_bandwidth(x) |
Silverman's rule-of-thumb bandwidth: 1.06 · min(σ, IQR/1.34) · n^(-1/5) |
gaussian_kde(x, grid, h, use_fft=None) |
KDE via direct O(n·g) or FFT O(g·log(g)) computation; auto-selects FFT for n > 5000 |
critical_bandwidth(x, k=2, return_ci=False) |
Critical bandwidth for k-mode detection; auto/binary/brent methods; optionally returns bootstrap CI |
find_modes(x, h) |
Detect all KDE modes at bandwidth h; returns ModeResult with per-mode position, height, width, prominence |
find_trough(x, h) |
Find valley between the two highest KDE peaks; returns x-coordinate or None |
detect_components(x) |
Decompose bimodal distribution into two Gaussian components; returns BimodalDecomposition |
bimodality_strength(x) |
Comprehensive bimodality assessment; returns interpretable strength label + score in BimodalityStrength |
silverman_test(x) |
Silverman's bootstrap test for bimodality; returns SilvermanTestResult dataclass |
excess_mass(x) |
Müller & Sawitzki excess mass test for multimodality; detects any number of modes, returns ExcessMassResult |
The critical_bandwidth function uses automatic method selection: Brent's method for well-separated data (converges in fewer iterations), binary search for weak/small-n cases. All three methods (auto, binary, brent) produce consistent results. Hartigan's dip test is included as a complementary unimodality check, not as a core differentiator of the package.
Multi-Format Data Loading
Read data from 9 file formats without learning separate tools:
from pola.io import read_data
# Auto-detect — just point at any supported file
x = read_data("measurements.csv")
x = read_data("survey.xlsx", sheet="Sheet1")
x = read_data("results.pdf", column="Score")
y = read_data("report.docx", column=0, return_all=False)
| Format | Extension | Library |
|---|---|---|
| CSV | .csv |
Built-in csv (zero dependencies) |
| TSV / TXT | .tsv, .txt |
Built-in (zero dependencies) |
| JSON | .json |
Built-in json (zero dependencies) |
| Markdown (tables) | .md |
Built-in parser (zero dependencies) |
| HTML (tables) | .html, .htm |
Built-in html.parser (zero dependencies) |
| Excel | .xlsx |
openpyxl |
| Excel (legacy) | .xls |
xlrd |
| Word (tables) | .docx |
python-docx |
| PDF (tables) | .pdf |
pdfplumber |
All dependencies are declared in pyproject.toml — pip install pola installs everything automatically. No system-level tools required.
Column / Sheet Selection
from pola.io import read_data, read_buffer
# By sheet name (Excel, PDF, Markdown multi-section)
x = read_data("data.xlsx", sheet="Measurements")
# By index
x = read_data("data.xlsx", sheet=0)
# By column name
x = read_data("data.csv", column="Value")
# By column index
x = read_data("data.csv", column=1)
# Get all numerical columns as a dict
cols = read_data("data.csv", return_all=True)
# cols = {"x": array([...]), "y": array([...])}
Buffer API
Works with in-memory file buffers for pipeline and service integration:
from pola.io import read_buffer
# From a BytesIO object
buf = io.BytesIO(uploaded_file.read())
x = read_buffer(buf, filename="data.csv")
Methodology
Critical Bandwidth
The critical bandwidth is the smallest bandwidth $h_{\text{crit}}$ at which the kernel density estimate $\hat{f}(x; h)$ transitions from being bimodal to unimodal. When $h < h_{\text{crit}}$ the estimate has two modes; for $h \ge h_{\text{crit}}$ it is unimodal. A large critical bandwidth relative to the data scale indicates strong bimodality — a wide smoothing window is needed to merge the two modes.
Silverman's Rule of Thumb
Silverman (1986) proposed a reference bandwidth for Gaussian KDE:
$$h = 1.06 \cdot \min(\sigma, \text{IQR} / 1.34) \cdot n^{-1/5}$$
This is used as the baseline for the critical bandwidth search: the solver searches for $h_{\text{crit}}$ in the interval $[h/20, 10h]$, which captures the transition for typical bimodal distributions.
Critical Bandwidth Solver
The critical bandwidth is found via binary search on KDE mode counts (for any k). The solver:
- Automatically computes bounds from Silverman's rule: $[h_{\text{silverman}}/20, 10 \cdot h_{\text{silverman}}]$
- Uses
method="auto"(default): Brent's method for large, well-separated data (fewer iterations); binary search for weak/small-n cases - For k > 2,
critical_bandwidth(x, k=3)finds the bandwidth where trimodality disappears, etc. - Optionally returns a bootstrap confidence interval via
return_ci=True
The dip_ratio (via _trough_ratio) is available as a descriptive bimodality strength measure but is not used as the optimizer objective. Hartigan's dip test is included as a complementary unimodality check, not as a core differentiator of the package.
Excess Mass Test
The excess_mass function implements the Müller & Sawitzki (1991) test for multimodality. It measures the amount of probability mass above threshold levels in a KDE, and uses bootstrap calibration to estimate the number of modes — unlike the Silverman test, it can detect any number of modes in a single call.
Related Work
- Silverman (1981) — The critical bandwidth test for multimodality in kernel density estimates
- Hartigan & Hartigan (1985) — The Dip Test of Unimodality, a complementary non-parametric approach
- Müller & Sawitzki (1991) — Excess mass estimates and tests for multimodality
- Hall & York (2001) — On the calibration of Silverman's test for multimodality, refinements to the original method
References
- Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall.
- Hartigan, J.A. & Hartigan, P.M. (1985). The Dip Test of Unimodality. The Annals of Statistics, 13(1), 70-84.
- Müller, D.W. & Sawitzki, G. (1991). Excess Mass Estimates and Tests for Multimodality. JASA, 86(415), 738-746.
- Cheng, M.-Y. & Hall, P. (1998). Calibrating the excess mass and dip tests of modality. Journal of the Royal Statistical Society: Series B, 60(3), 579-589.
Installation
# From PyPI
pip install pola
# Or using uv
uv add pola
# Development
git clone https://github.com/ryZhangHason/Polarization-CBW.git
cd Polarization-CBW
uv sync
uv run python -m pytest tests/ -v
Requirements
- Python >= 3.10
- Runtime: numpy, scipy, openpyxl, xlrd, python-docx, pdfplumber
- Test only (not needed at runtime): pytest, pytest-cov, xlwt, reportlab
- Visualization (optional): matplotlib >= 3.5
Why pola?
- Zero-config modality testing: One function call tells you whether your distribution is bimodal, with a well-defined threshold. Or use
bimodality_strength()for an interpretable strength score, orexcess_mass()to estimate how many modes your data has. - k-mode detection:
critical_bandwidth(x, k=3)detects the bandwidth where trimodality disappears — generalize to any k. - Any-file input: 9 formats from a single API. CSV, Excel, PDF, Word, JSON, HTML, Markdown — just point
read_dataat the file and go. - One-command install:
pip install polainstalls everything. No system packages, no manual steps, no Tesseract OCR. - Pure Python dependencies: All 4 additional libraries (openpyxl, xlrd, python-docx, pdfplumber) are pure Python — no compiled extensions.
- Built for production and research: Works in CLI scripts and Jupyter notebooks equally well.
Example: Bimodality Test
import numpy as np
from pola import critical_bandwidth, bimodality_strength, excess_mass
# Unimodal data — the function tells you it's already unimodal
unimodal = np.random.normal(0, 1, 500)
h_crit, success = critical_bandwidth(unimodal)
print(f"Unimodal: h_crit={h_crit:.4f}, converged={success}")
# → success may be False (already unimodal at minimum bandwidth)
# Bimodal data — full pipeline
bimodal = np.concatenate([
np.random.normal(-3, 0.5, 300),
np.random.normal( 3, 0.5, 300)
])
h_crit, success = critical_bandwidth(bimodal)
print(f"Bimodal: h_crit={h_crit:.4f}, converged={success}")
# Interpretable strength assessment
s = bimodality_strength(bimodal)
print(f"Strength: {s.strength} (score={s.strength_score:.2f})")
# Excess mass test — detects number of modes
e = excess_mass(bimodal, n_boot=199)
print(f"Estimated modes: {e.n_modes_estimated}")
# Trimodal data with k-mode detection
trimodal = np.concatenate([
np.random.normal(-4, 0.3, 150),
np.random.normal( 0, 0.3, 150),
np.random.normal( 4, 0.3, 150),
])
h_k3, ok = critical_bandwidth(trimodal, k=3)
print(f"Trimodal (k=3): h_crit={h_k3:.4f}")
Visualization Examples
Generate 3 matplotlib figures that illustrate the core algorithms:
uv sync --group viz
uv run python examples/bandwidth_demo.py
The demo uses well-separated bimodal data from the benchmark suite and produces:
kde_bandwidth_sweep.png— 2×2 subplots showing KDE at 4 bandwidths (too small, Silverman, critical, too large), with peaks and troughs marked.critical_bandwidth_transition.png— Three overlaid KDE curves: bimodal (h_crit × 0.85), critical (h_crit), and unimodal (h_crit × 1.15), with the trough position annotated.component_decomposition.png— Histogram, KDE, detected Gaussian components (scaled by weight), and the trough-based separation point.
matplotlib is an optional dependency — the core package does not require it.
Jupyter Notebook Demo
An interactive Jupyter Notebook is available for hands-on exploration:
uv sync --group viz --group jupyter
uv run jupyter notebook examples/bandwidth_demo.ipynb
The notebook contains 10 cells that walk through the same three visualizations with explanatory markdown, and includes an extension exercise that compares dip ratios across all 7 benchmark cases.
Performance Benchmarking
Measure execution time of core functions across data sizes:
uv run python examples/benchmark_performance.py
Customize the number of runs and data sizes:
uv run python examples/benchmark_performance.py --runs 10 --sizes 100 500 2000
uv run python examples/benchmark_performance.py --output results.csv
Phase 3 — R Package Comparison
Compares pola against R's multimode, diptest, and ks packages across 12 benchmark cases.
Feature Comparison
| Feature | pola | multimode | diptest |
|---|---|---|---|
| Critical bandwidth | ✅ ($k \ge 2$) | ✅ ($k = 2$ only) | ❌ |
| k-mode detection | ✅ (any $k$) | ❌ | ❌ |
| Bimodality strength | ✅ (interpretable) | ❌ | ❌ |
| Excess mass test | ✅ | ✅ | ❌ |
| Silverman's bootstrap test | ✅ | ✅ | ❌ |
| Hartigan's dip test | ✅ | ❌ | ✅ |
| Component decomposition | ✅ | ❌ | ❌ |
| 9-format I/O | ✅ | ❌ | ❌ |
| Dependencies | Pure Python | R + compiled | R + compiled |
Quantitative Results ($h_{\text{crit}}$ Accuracy)
| Case | $n$ | pola $h_{\text{crit}}$ | R modetest $p$ |
R dip.test $p$ |
Agreement |
|---|---|---|---|---|---|
| Well-separated | 400 | 1.8650 | 0.000 | 0.000 | ✅ Both detect bimodality |
| Moderate separation | 500 | 1.0964 | 0.000 | 0.000 | ✅ Both detect bimodality |
| Barely separated | 600 | 0.2791 | 0.251 | 0.709 | ✅ pola flags weak bimodality, R agrees (n.s.) |
| Unequal variance | 400 | 1.7849 | 0.000 | 0.000 | ✅ Both detect bimodality |
| Unequal weights | 500 | 1.2591 | 0.000 | 0.000 | ✅ pola correct; R yields spurious 772 modes |
| Extreme separation | 400 | 4.6987 | 0.000 | 0.000 | ✅ Both detect bimodality |
| Trimodal | 450 | 1.3824 | 0.000 | 0.000 | ✅ pola finds $k=3$; R detects multimodality |
| Skewed bimodal | 500 | 1.1417 | 0.000 | 0.000 | ✅ Both detect bimodality |
| Heavy-tailed bimodal | 400 | 2.7109 | 0.000 | 0.000 | ✅ Both detect bimodality |
| Near unimodal | 600 | 0.4186 | 0.055 | 0.285 | ✅ pola flags weak; R agrees (n.s.) |
| Small sample bimodal | 60 | 1.8608 | 0.000 | 0.000 | ✅ Both detect bimodality (small $n$) |
| Overlapping variances | 500 | 0.4598 | 0.045 | 0.322 | ✅ pola flags weak; R agrees (n.s.) |
pola achieves <0.5% mean absolute relative error vs high-precision reference values across all 12 cases.
Performance Comparison (median runtime per case)
| Operation | pola | R (multimode/diptest) |
Comparison |
|---|---|---|---|
critical_bandwidth |
0.04–0.79 s | 0.82–1.58 s | 3–10× difference |
find_modes (mode counting) |
<0.01 s | 0.03–0.05 s | Comparable |
dip_test |
~0.002 s | ~0.002 s | Comparable |
The runtime difference reflects pola's pure-Python numerical stack (NumPy/SciPy) and adaptive grid sizing, compared to R's compiled package dispatch overhead with bootstrap calibration (modetest(B = 199)).
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pola-0.1.2.tar.gz.
File metadata
- Download URL: pola-0.1.2.tar.gz
- Upload date:
- Size: 88.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cc6c385f79739d5dfd2982eb6697916ed81d43dfe711953474de2d1a001c752
|
|
| MD5 |
c877a94c6bf03e66dfa1bf43860f3e86
|
|
| BLAKE2b-256 |
01c75950cf66923eca7bfe3161ef8f5d1b48e8bf7ac4c8509b13558db21db23f
|
File details
Details for the file pola-0.1.2-py3-none-any.whl.
File metadata
- Download URL: pola-0.1.2-py3-none-any.whl
- Upload date:
- Size: 72.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a29ae5b73394463d07750e2c8b865de5c79ece46bee915e612938d1de5e84e0f
|
|
| MD5 |
61d4e982017f7fc079dd794b7586ee7f
|
|
| BLAKE2b-256 |
02aae537deef76776cb7c7babe4dedc2e691c3e10b2041c6069da90f73d31cab
|