Skip to main content

Calculate linguistic distance between languages using WALS-style feature data.

Project description

qwals

Compute linguistic distance between languages using WALS-style feature data.

qwals calculates distance at runtime from raw CSV data. No precomputed distance matrix is required.

Citation

When using this package for research, please cite the following paper as reference:

Eronen, J., Ptaszynski, M., Wicherkiewicz, T., Borges, R., Janic, K., Liu, Z., Mahmud, T., & Masui, F. (2026). Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance. Machine Learning and Knowledge Extraction, 8(3), 65. https://doi.org/10.3390/make8030065

@article{eronen2026language,
  title={Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance},
  author={Eronen, Juuso and Ptaszynski, Michal and Wicherkiewicz, Tomasz and Borges, Robert and Janic, Katarzyna and Liu, Zhenzhen and Mahmud, Tanjim and Masui, Fumito},
  journal={Machine Learning and Knowledge Extraction},
  volume={8},
  number={3},
  pages={65},
  year={2026},
  publisher={MDPI}
}

Features

  • Runtime distance calculation, all hot paths in vectorised NumPy
  • Two methods: ordinal and onehot
  • Uses WALS_feature_order.csv for ordered feature values
  • Cleans commas and whitespace before matching values
  • Returns optional per-feature details and per-distance coverage
  • Pairwise distance matrices, plus one-vs-many and "N nearest neighbours" helpers
  • Per-feature weights for emphasising phonology, morphology, syntax, etc.
  • Task-specific feature presets from the qWALS paper (Appendix A) — dep, ner, sentiment, abusive
  • Leave-one-feature-out optimiser — replicates the paper's −0.99 DEP correlation
  • explain_distance() — per-feature contribution breakdown ("transparent vs lang2vec")
  • Bootstrap confidence intervals for distances on sparsely-attested pairs
  • suggest_transfer_source() — rank source-language candidates for cross-lingual transfer
  • Heatmap and dendrogram plotting (matplotlib + scipy, optional via [viz] extra)
  • Feature-set introspection (features_for, shared_features, coverage_for)
  • Disk-cached precomputed matrices — second-and-later inits in ~7 ms
  • Accepts language names, ISO 639-1 codes (pl, en), and WALS Language_ID codes (pol, eng) interchangeably — case-insensitive
  • Built-in CLI: python -m qwals compare pl en (or qwals compare pl en after install)

Installation

Minimal install (NumPy only — needed for distance, feature_distance, similarity):

pip install .

With pandas (needed for pairwise_matrix, save_pairwise_matrix, and distance(..., return_details=True)):

pip install ".[reports]"

Development install:

pip install -e ".[dev]"

Required input files

wals-data.csv

Must contain these columns:

Language_name
Parameter_name
Value

WALS_feature_order.csv

Must contain these columns:

NAME
VALUES IN ORDER

Example:

NAME,VALUES IN ORDER
Consonant-Vowel Ratio,"Low,Moderately low,Average,Moderately high,High"

Quick usage

from qwals import QwalsCalculator

calc = QwalsCalculator(
    "wals-data.csv",
    "WALS_feature_order.csv",
)

d = calc.distance("English", "Japanese", method="ordinal")
print(d)

Methods

Ordinal

The ordinal method uses the value order from VALUES IN ORDER.

Formula:

abs(index1 - index2) / (number_of_values - 1)

Onehot

The onehot method only checks exact matches:

same value      -> 0
different value -> 1

Details output

result = calc.distance(
    "English",
    "Japanese",
    method="ordinal",
    return_details=True,
)

print(result["distance"])
print(result["features_used"])
print(result["details"])

Folder-based loading

calc = QwalsCalculator.from_folder("data/")

N nearest neighbours

nearest(language, n=10, method="ordinal", min_shared=None) returns the n languages closest to the target as a list of (name, distance) tuples, sorted ascending. It runs as a single vectorised one-vs-all pass — typically a couple of milliseconds on the full WALS dataset.

calc.nearest("Polish", n=5)
# [('Lithuanian',     0.131),
#  ('Russian',        0.138),
#  ('Latvian',        0.146),
#  ('Greek (Modern)', 0.163),
#  ('Albanian',       0.173)]

Why min_shared matters

WALS is unevenly populated — many languages have only a handful of features documented. Without a filter, "nearest" tends to surface those sparse languages first because their few features happen to match the target by chance, giving a misleading distance of 0.0. To prevent that, nearest applies a default min_shared=50 (exposed as QwalsCalculator.NEAREST_MIN_SHARED) — only languages sharing at least 50 features with the target are considered. The threshold is on the unweighted count of overlapping features, so changing per-feature weights doesn't silently shift it.

# Default: well-attested neighbours only.
calc.nearest("Polish", n=5)

# Looser — include languages with as few as 20 shared features.
calc.nearest("Polish", n=5, min_shared=20)

# Disable the filter entirely (the pre-0.7 behaviour).
# Expect lots of zero-distance noise from languages with 1–3 features.
calc.nearest("Polish", n=5, min_shared=0)

# Stricter — only well-documented matches.
calc.nearest("Polish", n=5, min_shared=80)

If the target itself is sparsely attested (say only 5 features documented), the default threshold often returns an empty list. That's intentional: without enough features to compare, no "nearest" claim is defensible. Lower min_shared if you want to see candidates anyway.

calc.nearest("Japanese", n=3, method="onehot")
calc.nearest("pl",       n=3)             # alias input is fine
calc.nearest("Polish",   n=3, include_self=True)   # keeps the (Polish, 0.0) entry

One-vs-many distance

distance_to_many(language, others=None, method="ordinal", min_shared=0) returns a {language: distance} dict. With others=None (the default) it computes the distance from the target to every other language; pass an explicit list to restrict it. Unlike nearest, the default does not filter by min_shared — pass it explicitly when you want to drop sparsely- attested entries from the dict.

calc.distance_to_many("Polish", others=["English", "German", "Russian"])
# {'English': 0.222, 'German': 0.241, 'Russian': 0.084}

# Drop languages with fewer than 20 shared features:
calc.distance_to_many("Polish", min_shared=20)

# Sortable, plottable as a pandas Series:
s = calc.distance_to_many("Polish", as_series=True)
s.sort_values().head(10)

This is faster than calling distance in a loop — both share the same vectorised kernel as nearest.

Per-feature weights

By default every feature contributes equally to the distance. Pass a weights dict to the constructor (or use set_weight / set_weights / reset_weights at runtime) to emphasise some features over others. Weights must be non-negative and finite; a weight of 0 excludes the feature from the distance entirely.

calc = QwalsCalculator.from_folder(
    "data/",
    weights={
        "Order of Subject and Verb":  3.0,   # syntax matters more
        "Order of Object and Verb":   3.0,
        "Number of Genders":          0.5,   # morphology matters less
    },
)
calc.distance("Polish", "English")

# Adjust at runtime
calc.set_weight("Tone", 0.0)         # ignore tone entirely
calc.set_weights({"Velar Nasal": 2.0, "Front Rounded Vowels": 2.0})
calc.weights                          # -> {feat: weight} of all non-1.0 features
calc.reset_weights()                  # back to all-1.0

Weights are honoured by distance, pairwise_matrix, nearest, and distance_to_many. The weighted distance formula is

sum(w[f] * mask[f] * |o1[f] - o2[f]| / (n[f] - 1))   /   sum(w[f] * mask[f])

(for the ordinal method; onehot is the same with 1{v1[f] != v2[f]} in place of the scaled ordinal gap).

distance(..., return_details=True) includes a Weight column in the per-feature DataFrame whenever any weight differs from 1.0.

Task-specific feature presets

The qWALS paper (Eronen et al., 2026, Appendix A) reports four leave-one-feature-out-optimised feature subsets, one per cross-lingual NLP task. They're shipped as named presets:

Preset # features Paper Pearson ρ vs transfer F1
dep 75 −0.99 (DEP)
ner 63 −0.81 (NER)
abusive 53 −0.82 (abusive language)
sentiment 21 −0.80 (sentiment analysis)

Apply with use_features:

calc = QwalsCalculator.from_folder("data/")
calc.use_features("dep")               # restrict to the 75 DEP features
calc.distance("Polish", "English")     # task-tuned distance
calc.nearest("Polish", n=5)            # → Ukrainian, German, Lithuanian, Russian, Danish
calc.reset_features()                  # back to all 192 features

When a preset is active, the nearest min_shared default is auto-scaled in proportion to the active feature count (e.g. ~20 for DEP rather than 50), so you don't get an empty result by surprise.

You can also pass an explicit list:

calc.use_features(["Order of Subject and Verb", "Number of Genders", "Tone"])

The full preset lists are available as a Python dict:

from qwals import TASK_FEATURES, TASKS
TASKS                           # ('abusive', 'sentiment', 'ner', 'dep')
TASK_FEATURES["dep"][:3]        # first three DEP features

Leave-one-feature-out (LOFO) optimisation

Replicates §4.5 of the qWALS paper: greedily drop one feature at a time until correlation with a target signal stops improving. Use this to distil a custom feature set for any new task or transfer-score table:

target_scores = {
    ("Polish", "English"):  0.84,    # Polish→English transfer F1
    ("Polish", "German"):   0.75,
    ("Polish", "Russian"):  0.91,
    # … all language pairs you have scored …
}
result = calc.optimize_features(target_scores, method="ordinal", correlation="pearson")
print(result["pearson"], result["n_features"])    # e.g. -0.991, 75
calc.use_features(result["features"])             # apply the optimised set

Returns {features, pearson, spearman, n_features, n_dropped, history}, where history is the per-step trace useful for plotting the dropping curve. Key kwargs:

  • correlation: "pearson" (default) or "spearman"
  • direction: "minimise" (low distance ↔ high transfer; default) or "maximise" (positively-correlated targets like expert dissimilarity)
  • max_drops / min_features: stop conditions
  • verbose=True: print one line per drop

The procedure is greedy and O(F · L²) per step (≈100 ms for the eight study languages × 192 features in the paper).

Suggesting a transfer source

suggest_transfer_source(target, task=None) ties everything above together for the paper's headline use case — picking a source language for cross-lingual transfer:

calc.suggest_transfer_source("Polish", task="dep", n=5)
# [{'language': 'Ukrainian',  'distance': 0.05, 'n_shared': 20, 'coverage': 0.27, 'confidence': 0.49},
#  {'language': 'German',     'distance': 0.05, 'n_shared': 37, 'coverage': 0.49, 'confidence': 0.66},
#  {'language': 'Lithuanian', 'distance': 0.07, 'n_shared': 36, 'coverage': 0.48, 'confidence': 0.65},
#  {'language': 'Russian',    'distance': 0.07, 'n_shared': 41, 'coverage': 0.55, 'confidence': 0.69},
#  {'language': 'Danish',     'distance': 0.07, 'n_shared': 26, 'coverage': 0.35, 'confidence': 0.55}]

Each entry exposes both the raw distance and n_shared so you can weigh "close but sparsely-attested" vs "well-attested but mid-range" yourself. The confidence score is a deliberately simple combination — (1 − distance) · √coverage — and is provided as a convenience, not as a substitute for inspecting the underlying numbers (the paper showed that more aggressive multiplicative reliability penalties can mislead).

task=None works on the currently-active feature set; pass an explicit candidates=[...] list to restrict the pool. The active mask is restored when the call returns, so suggest_transfer_source never mutates your calculator's state.

Per-feature explainability

explain_distance(L1, L2, top_k=10) shows exactly which typological features are driving a given distance — the "transparent vs lang2vec" selling point of qWALS in plain action:

calc.explain_distance("Polish", "English", top_k=5)
#                              Feature   per_feature_distance   Weight   contribution
#                  Number of Genders                      1.0      1.0          0.012
#                  Order of Adjective and Noun            1.0      1.0          0.012
#                  …

Returns a pandas.DataFrame sorted by contribution (descending), where contribution is the share of the final weighted distance that the feature accounts for; contributions sum to 1.0 across the table when top_k=None.

Coverage and reliability

Every distance in qwals can return its own coverage information so you can weigh per-pair reliability:

calc.distance("Polish", "English", return_coverage=True)
# {'distance': 0.222, 'n_shared': 82, 'coverage': 0.43, 'n_total_features': 192, ...}

calc.coverage_for("Polish", "Chuj")
# {'n_shared': 12, 'lang1_total': 83, 'lang2_total': 29, 'coverage': 0.06, ...}

For sparsely-attested pairs, get a bootstrap confidence interval on the distance (resamples the shared feature index with replacement):

calc.distance_ci("Polish", "Chuj", n_bootstrap=2000, rng=42)
# {'distance': 0.20, 'ci_low': 0.07, 'ci_high': 0.35, 'std': 0.07, 'n_shared': 12, ...}

The CI for a richly-attested pair like Polish↔Russian (≈80 shared features) will be tight; the Polish↔Chuj CI above is much wider — exactly the sparsity-driven uncertainty the qWALS paper §4.4 calls out.

To re-rank candidates in nearest by an upper-CI criterion (the paper's sparsity-aware ranking suggestion), pass sort_by="upper_ci":

calc.nearest("Polish", n=5, sort_by="upper_ci", n_bootstrap=200, rng=42)
# [(name, distance, n_shared, ci_high), ...]

Feature-set introspection

features_for(language) returns the list of features the language has a value for (in the canonical calc.features order). shared_features(l1, l2) returns the intersection — useful for filtering before a pairwise call, or for diagnosing why two languages have a surprisingly small distance.

len(calc.features_for("Polish"))                # e.g. 83 / 192
calc.shared_features("Polish", "English")[:5]
# ['Absence of Common Consonants',
#  'Alignment of Verbal Person Marking',
#  'Asymmetrical Case-Marking',
#  'Coding of Nominal Plurality',
#  'Comitatives and Instrumentals']

Plotting (matplotlib + scipy)

Install the [viz] extra to enable:

pip install "qwals[viz]"

Then:

import matplotlib.pyplot as plt

# Heatmap — reproduces the per-task panels from Figure 2 of the paper.
calc.plot_heatmap(["pl", "en", "de", "hr", "ru", "ja", "ko"])

# Dendrogram — recovers Slavic / Germanic / Koreano-Japonic groupings.
calc.plot_dendrogram(["pl", "en", "de", "hr", "ru", "ja", "ko"], linkage_method="average")
plt.show()

Both functions accept a pre-existing Axes, so you can compose multi-panel figures (e.g. side-by-side qWALS distance matrix and transfer-score heatmap, exactly as in the paper):

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
calc.plot_heatmap(langs, ax=axes[0], title="qWALS")
calc.plot_dendrogram(langs, ax=axes[1], title="Hierarchical clustering")

Command-line interface

After pip install (or running python -m qwals from a folder containing the WALS CSVs):

# distance between two languages
qwals compare pl en
qwals compare Polish Japanese --method onehot

# N nearest neighbours (default min_shared=50)
qwals nearest Polish --n 5
qwals nearest pl --method onehot
qwals nearest Polish --n 5 --min-shared 20    # looser filter
qwals nearest Polish --n 5 --min-shared 0     # no filter (noisy)

# pairwise matrix to stdout (CSV) or a file
qwals pairwise --out distances.csv
qwals pairwise Polish English German Russian       # restrict to a subset

# feature introspection
qwals shared pl en --limit 20
qwals features Polish

# point at a different folder
qwals --data ~/data/wals compare pl en
qwals --no-cache compare pl en      # bypass the on-disk cache

python -m qwals is identical to the qwals entry point. Run with --help (or <subcommand> --help) for the full option list.

Language identifiers

Any of the following forms is accepted everywhere a language is taken — exact name, case-insensitive name, ISO 639-1 (two-letter), or WALS Language_ID (three-letter, present in the data CSV).

calc.distance("Polish", "English")   # exact names
calc.distance("pl", "en")             # ISO 639-1
calc.distance("pol", "eng")           # WALS Language_ID
calc.distance("polish", "ENGLISH")    # case-insensitive
calc.distance("pl", "English")        # mixed forms

pairwise_matrix accepts the same forms and always uses canonical WALS names in the resulting index and columns, so output is stable regardless of input style.

You can register custom aliases at runtime:

calc.add_alias("Lehia", "Polish")     # archaic name → modern
calc.resolve_language("Lehia")        # -> "Polish"
calc.aliases_for("Polish")            # -> ['lehia', 'pl', 'pol', 'polish']

The bundled ISO 639-1 table prefers WALS-friendly names where they differ from the standard form (e.g. el → "Greek (Modern)", zh → "Mandarin", hr/sr → "Serbian-Croatian"). Aliases that don't correspond to any language in your loaded data are silently dropped.

Notes

  • Only shared features between two languages are used.
  • Features are weighted equally by default; see Per-feature weights to change that.
  • Values are normalized by removing commas and repeated whitespace before matching.
  • Missing feature orders are inferred from wals-data.csv by default.
  • To disable inferred orders:
calc = QwalsCalculator(
    "wals-data.csv",
    "WALS_feature_order.csv",
    infer_missing_orders=False,
)

Performance

Two int16 matrices of shape (n_languages, n_features) are built once at construction time, then cached to disk so subsequent inits are ~5–10 ms. All hot paths are pure NumPy with a tiled inner loop. On the bundled WALS data (2,673 languages × 192 features):

Operation 0.1 (pandas) 0.3 (numpy) 0.4 (this) Speedup vs 0.1
init (cold, no cache) ~150 ms ~190 ms ~180 ms ~1×
init (warm, disk-cached) n/a n/a ~7 ms ~20× vs cold
distance (ordinal, single pair) 22 ms 7 µs 6 µs ~3,600×
distance (onehot, single pair) 24 ms 4 µs 3 µs ~7,000×
pairwise_matrix (30 langs) 18 s <1 ms 0.2 ms ~90,000×
pairwise_matrix (full ordinal) ~39 h (proj.) 1.9 s 1.1 s ~130,000×
pairwise_matrix (full onehot) ~39 h (proj.) 1.0 s 0.7 s ~200,000×

Memory of the precomputed matrices halved from 4.1 MiB → 2.05 MiB (int32 → int16). Alias lookup is O(1) — one extra dict.get on the cold path.

Disk cache

By default, after the first build the precomputed matrices are written to ~/.cache/qwals/<hash>.npz. The hash is keyed on the resolved CSV paths, their mtime+size, the package version, and the infer_missing_orders / inferred_order_method options — so editing the data CSV or upgrading the package transparently invalidates the cache.

QwalsCalculator.from_folder(".", cache=True)         # default
QwalsCalculator.from_folder(".", cache=False)        # never write
QwalsCalculator.from_folder(".", cache="my.npz")     # custom path

Override the default location with the QWALS_CACHE_DIR env var.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

qwals-0.8.2.tar.gz (58.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

qwals-0.8.2-py3-none-any.whl (41.2 kB view details)

Uploaded Python 3

File details

Details for the file qwals-0.8.2.tar.gz.

File metadata

  • Download URL: qwals-0.8.2.tar.gz
  • Upload date:
  • Size: 58.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for qwals-0.8.2.tar.gz
Algorithm Hash digest
SHA256 a8d20557d811ccf10c9802bfd3f12b49b9f1ce85934bddc609d2daf779bee580
MD5 fab53ae470e886ca9dc4d0a3028fa14f
BLAKE2b-256 ee388c510b4dd1b17e30b9e3599c6b991c2544303b3858d56c11da8144063085

See more details on using hashes here.

File details

Details for the file qwals-0.8.2-py3-none-any.whl.

File metadata

  • Download URL: qwals-0.8.2-py3-none-any.whl
  • Upload date:
  • Size: 41.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for qwals-0.8.2-py3-none-any.whl
Algorithm Hash digest
SHA256 de91a3e62a4e9f729b9d491451d792f45b805fec9c48f7891d88190727b90f20
MD5 79858fb549e2f7893929425b4df711ef
BLAKE2b-256 0a674a3d2d550ce7079faae4923ffca769d9ceeb072a068f8e55c9d8a7b8cb5e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page