Probabilistic models over linguistic typology source data (WALS, Grambank, ...) with pluggable count-to-probability estimators.
Project description
typola
Probabilistic models over linguistic typology source data (WALS, Grambank, …).
- Live demo (web UI): https://thorwhalen-typola.hf.space/
- PyPI: https://pypi.org/project/typola/
The core idea is separation of concerns:
- Data prep — acquire and canonicalize. Raw CLDF datasets → pandas DataFrames, via one generic loader that works for WALS, Grambank, APiCS, and any other CLDF StructureDataset.
- Count-to-probability estimators — first-class, pluggable, comparable. MLE, Laplace, Jeffreys, Dirichlet-Multinomial, empirical-Bayes, mixtures — pick one, configure it, or plug in your own. Same API.
- Probabilistic models —
Marginal P(parameter|condition)andConditional P(target|given)built from counts + an estimator. - Query / drill-down — one entry point (
query) plus a few small utilities (compare_estimators,cross_validate_estimators,rank_associations,compare_conditions) for actually interrogating the model.
You can use any layer on its own. The prep layer is just pandas — no probabilistic-model imports required.
Install
pip install -e .
# with optional bayesian extras (ba + spyn):
pip install -e '.[bayesian]'
# with web UI backend:
pip install -e '.[web]'
Web UI
A React probability-console frontend (zodal + shadcn) ships in webapp/:
# 1. Start the API (from repo root):
python -m webapp.api.main
# 2. Start the UI:
cd webapp/ui && npm install && npm run dev
# open http://127.0.0.1:5173
See webapp/README.md for details.
60-second tour
from typola import load, query, estimators
from typola.query import compare_estimators, cross_validate_estimators, rank_associations
# 1. Load a typology. Downloaded & cached on first call.
wals = load("wals")
# Typology(name='wals', n_languages=3573, n_parameters=192, n_codes=1143, n_values=76475)
# 2. P(Order of Subject and Verb) globally — Jeffreys smoothing.
d = query(wals, target="81A", estimator=estimators.jeffreys())
d.top_k(4)
# name count probability
# 81A-1 SOV 564 0.409206
# 81A-2 SVO 488 0.354114
# 81A-7 No dominant 189 0.137369
# 81A-3 VSO 95 0.069228
# 3. Condition on language metadata.
query(wals, target="81A", condition={"Family": "Niger-Congo"}).top_k(3)
# name count probability
# 81A-2 SVO 277 0.911
# 81A-7 No dominant 20 0.066
# 81A-1 SOV 4 0.013
# 4. Full conditional P(target | given) — a CPT.
cpt = query(wals, target="83A", given="81A", estimator=estimators.laplace(0.5))
cpt.as_matrix() # DataFrame, rows sum to 1
cpt.p_given("81A-2") # row distribution when subject–verb order is SVO
cpt.mutual_information() # bits
# 5. Compare estimators on the same question.
compare_estimators(
wals, target="81A",
condition={"Family": "Austronesian"},
estimators=[estimators.mle(), estimators.jeffreys(),
estimators.empirical_bayes(wals.counts("81A").values, strength=20)],
)
# 6. Actually test which estimator is best — cross-validated log-likelihood.
cross_validate_estimators(
wals, target="81A",
estimators=[estimators.mle(), estimators.laplace(0.1),
estimators.laplace(0.5), estimators.laplace(1.0),
estimators.empirical_bayes(wals.counts("81A").values, strength=20)],
n_folds=5, random_state=0,
condition={"Family": "Austronesian"},
)
# log_likelihood perplexity
# laplace(alpha=1.0) -44.2886 3.7548
# laplace(alpha=0.5) -44.3577 3.7648
# laplace(alpha=0.1) -44.8365 3.8244
# empirical_bayes(global_counts=..., strength=20.0) -45.3420 3.8896
# mle() -52.9648 5.2109
# 7. Drill down: which parameters are most informative about Subject–Verb order?
rank_associations(wals, target="81A", top_k=5, estimator=estimators.laplace(0.5))
# parameter_id parameter_name mutual_information n_languages
# 0 83A Order of Object and Verb 1.06 1368
# 1 84A Order of Object, Oblique, and Verb 0.99 486
# 2 97A Rel. between OV and AdjN 0.97 1190
# 3 95A Rel. between OV and AdpN 0.96 1039
# 4 96A Rel. between OV and RelN 0.91 807
Run python misc/example_bakeoff.py for the same flow in full.
Architecture
typola
├── sources/ ← source catalog (WALS, Grambank, ...) + downloader
├── prep/ ← CLDF → Typology → dol stores
├── estimators/ ← count → probability: MLE, Laplace, Jeffreys, Dirichlet, ...
├── models/ ← Distribution, Marginal, Conditional
└── query/ ← query(), compare_estimators(), cross_validate_estimators(), rank_associations(), compare_conditions()
Each layer only depends on the ones above it in the list — so you can use the prep layer without any probabilistic code, or you can use estimators on counts from any other source (not just typola).
Data sources
Currently registered:
| Name | License | Source |
|---|---|---|
wals |
CC BY-NC 4.0 | https://wals.info/ — Dryer & Haspelmath 2013 |
grambank |
CC BY 4.0 | https://grambank.clld.org/ — Skirgård et al. 2023 |
Register more with:
from typola.sources import register_source, SourceSpec
register_source(SourceSpec(
name="apics",
url="https://github.com/cldf-datasets/apics/archive/refs/heads/master.zip",
citation="...",
license="CC-BY-4.0",
archive_type="zip",
strip_components=1,
))
The loader handles any CLDF StructureDataset that provides languages.csv, parameters.csv, codes.csv, values.csv.
Custom estimators
Subclass Estimator or just provide any callable with a .name attribute:
from dataclasses import dataclass, field
from typola.estimators import Estimator
import numpy as np
@dataclass(frozen=True, repr=False)
class _HaldaneMix(Estimator):
name: str = "haldane_mix"
params: dict = field(default_factory=lambda: {"alpha": 0.01})
def _estimate(self, counts):
a = self.params["alpha"]
smoothed = counts + a
return smoothed / smoothed.sum()
estimators_under_test = [_HaldaneMix(), estimators.jeffreys(), ...]
Citing data sources
Every Typology carries a .citation string. Cite it in any downstream output.
- WALS requires attribution (CC BY-NC 4.0), no commercial use.
- Grambank is CC BY 4.0.
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file typola-0.1.5.tar.gz.
File metadata
- Download URL: typola-0.1.5.tar.gz
- Upload date:
- Size: 100.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9af3f89ddb82a2c086099555293775273570ecebb845ecbb4b9adcde13993eba
|
|
| MD5 |
99986b214b2665be8743e7fb91b9b8a8
|
|
| BLAKE2b-256 |
2da14e22d4ebda8c6e1c3f2ea53be27c2c72f3a16cdb7abb363013148aa1b446
|
File details
Details for the file typola-0.1.5-py3-none-any.whl.
File metadata
- Download URL: typola-0.1.5-py3-none-any.whl
- Upload date:
- Size: 32.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a31946623da7fbd964f65eb2c5958775a6f76e5ef46fa01432e0b954f9a76f0b
|
|
| MD5 |
f8df5ed9a514922295ea4fa3eea576ba
|
|
| BLAKE2b-256 |
099882579cfefd5f3595d6069210ea0d6a0a8277e25ebeca065ecfc29d398b5a
|