Standalone phonological feature systems for historical linguistics
Project description
distfeat
distfeat is a standalone Python package for manipulating phonological
features.
It provides:
- bundled phonological feature datasets
- pluggable feature systems
- feature geometry and distance functions
- query and analysis helpers for graphemes and feature sets
distfeat is dependency-free at runtime and is the standalone home for the
feature subsystem extracted from alteruphono.
The canonical modern API is built around native representations:
- use
get_representation(...)when you want the system's native feature model - use
matches(...)andsegment_distance(...)for system-native comparison - treat
get_features(...),partial_match(...), andsound_distance(...)as convenience helpers for categorical systems
Installation
Install from PyPI:
pip install distfeat
Requires Python 3.12+.
Development install:
git clone https://github.com/tresoldi/distfeat.git
cd distfeat
uv venv
uv pip install -e ".[dev]"
Run checks in the project environment:
uv run ruff check .
uv run mypy src
uv run pytest -q
uv run python scripts/verify_examples.py
Core Concepts
The package is organized around:
- a bundled
FeatureDataset - a lazy default registry plus explicit
Registryinstances - built-in systems:
ipatresoldidistinctivepbase-hcpbase-jfhpbase-spepbase-uftc
The package does not define a Sound object. It works directly with graphemes,
feature bundles, native multi-state feature tables, scalar dimensions, and
matrices.
Quick Start
import distfeat
# Built-in systems
print(distfeat.list_systems())
# ['ipa', 'tresoldi', 'distinctive', 'pbase-hc', 'pbase-jfh', 'pbase-spe', 'pbase-uftc']
# Basic grapheme lookup
print(distfeat.get_features("p"))
# frozenset({'consonant', 'voiceless', 'bilabial', 'stop'})
# Predefined sound classes
print(distfeat.get_class_features("V"))
# frozenset({'vowel'})
# Direct grapheme distance
print(distfeat.distance("a", "e"))
Working With Systems
You can use the lazy default registry through top-level helpers, or you can work with a specific system object.
import distfeat
ipa = distfeat.get_system("ipa")
tresoldi = distfeat.get_system("tresoldi")
distinctive = distfeat.get_system("distinctive")
pbase = distfeat.get_system("pbase-hc")
print(ipa.grapheme_to_features("a"))
print(tresoldi.grapheme_to_features("a"))
print(distinctive.grapheme_to_features("a"))
print(pbase.grapheme_to_representation("a"))
Exact reverse lookup is available when a native representation maps directly to
a known grapheme. For categorical systems this is usually a frozenset[str];
for valued systems it can be a dict[str, FeatureState | str] or
ValuedFeatures.
ipa = distfeat.get_system("ipa")
grapheme = ipa.features_to_grapheme(
frozenset({"consonant", "voiced", "bilabial", "stop"})
)
print(grapheme)
# 'b'
Feature Queries
Find Graphemes Matching a Feature Set
Use features_to_graphemes(...) to retrieve all graphemes satisfying a
feature query.
By default, matching is partial and uses the semantics of the selected system.
import distfeat
# All vowels in the default system
vowels = distfeat.features_to_graphemes(frozenset({"vowel"}))
print(vowels[:10])
# Voiceless consonants
voiceless_consonants = distfeat.features_to_graphemes(
frozenset({"consonant", "-voiced"})
)
print(voiceless_consonants[:10])
You can also force exact matching:
import distfeat
ipa = distfeat.get_system("ipa")
features = ipa.grapheme_to_features("a")
print(distfeat.features_to_graphemes(features, exact=True))
Native Multi-State Systems
distfeat also supports systems whose native representation is a named
feature-value table instead of a categorical set. The bundled P-base-derived
systems expose multi-state values such as +, -, n, ., o, and x
through FeatureState.
import distfeat
rep = distfeat.get_representation("a", system="pbase-hc")
print(rep.values["syllabic"])
# FeatureState.POSITIVE
matches = distfeat.features_to_graphemes({"syllabic": "+"}, system="pbase-hc")
print(matches[:10])
The bundled P-base table is intentionally described as derived rather than
verbatim. The source data contains duplicate IPA rows, including rows with
conflicting values in a small number of columns. distfeat merges duplicate
rows conservatively:
- identical duplicate rows collapse into one row
- if duplicate rows disagree, only the conflicting cells are downgraded to
.(FeatureState.DOT)
This preserves a single usable row per grapheme without inventing new positive or negative values where the source disagrees.
Derive Shared Class Features
Use derive_class_features(...) to compute the strict shared feature
intersection of a set of graphemes.
import distfeat
print(distfeat.derive_class_features(["t", "d"]))
# frozenset({'consonant', 'alveolar', 'stop', ...})
print(distfeat.derive_class_features(["t", "d", "s"]))
# fewer shared features than the pair above
For multi-state systems, the result is a dictionary of shared feature states:
import distfeat
print(distfeat.derive_class_features(["t", "d"], system="pbase-hc"))
# {'consonantal': <FeatureState.POSITIVE: '+'>, ...}
Minimal Distinguishing Matrices
Use minimal_matrix(...) to compute the smallest feature set needed to
distinguish a given list of graphemes.
import distfeat
matrix = distfeat.minimal_matrix(["t", "d"], system="ipa")
print(matrix.columns)
print(matrix.rows)
For ipa and tresoldi, the matrix is categorical and boolean. For
distinctive, it uses scalar dimensions. For P-base-derived systems, it uses
native multi-state values.
import distfeat
matrix = distfeat.minimal_matrix(["t", "d", "s"], system="ipa")
print(distfeat.tabulate_matrix(matrix))
Example plain-text output:
grapheme | continuant | voiced
---------+------------+-------
t | False | False
d | False | True
s | True | False
Markdown output is also supported:
print(distfeat.tabulate_matrix(matrix, format="markdown"))
P-base-derived systems render symbolic state values directly:
import distfeat
matrix = distfeat.minimal_matrix(["t", "d"], system="pbase-hc")
print(distfeat.tabulate_matrix(matrix))
Distinctive Scalars
The distinctive system also exposes scalar representations.
from distfeat import DistinctiveFeatureSystem, load_builtin_dataset
system = DistinctiveFeatureSystem(dataset=load_builtin_dataset())
print(system.grapheme_to_scalars("a"))
print(system.features_to_scalars(system.grapheme_to_features("a")))
print(system.scalars_to_features({"voice": 1.0, "labial": 1.0}))
Distance
System-Based Distance
The default distance(...) helper resolves graphemes through the selected
system and uses that system's native distance.
import distfeat
print(distfeat.distance("a", "e"))
print(distfeat.distance("a", "u"))
print(distfeat.distance("p", "b"))
print(distfeat.distance("t", "d", system="pbase-hc"))
Precomputed Distance Matrices
You can also supply a precomputed nested dictionary.
import distfeat
precomputed = {
"a": {"e": 1.5, "u": 2.0},
"p": {"b": 0.5},
}
print(distfeat.distance("a", "e", precomputed=precomputed))
print(distfeat.distance("b", "p", precomputed=precomputed))
If a requested pair is missing from the precomputed matrix, the function raises
KeyError.
Custom Datasets
Load From a Directory
from distfeat import create_registry, load_dataset
dataset = load_dataset(directory="my_feature_data")
registry = create_registry(dataset=dataset)
system = registry.get_system("ipa")
print(system.grapheme_to_features("k"))
Expected files in my_feature_data/:
sounds.tsvclasses.tsvfeatures.tsv
Bundled P-base-Derived Data
distfeat bundles a derived segment table based on the P-base distribution.
The bundled systems are:
pbase-hcpbase-jfhpbase-spepbase-uftc
These systems use the same registry and analysis APIs as the categorical and scalar systems, but operate on native multi-state feature values.
The P-base-derived data is bundled separately from the MIT-licensed code and
retains its own attribution and license notice in src/distfeat/data/pbase/.
Build From In-Memory Rows
from distfeat import create_registry, dataset_from_rows
from distfeat.systems.ipa import IPAFeatureSystem
dataset = dataset_from_rows(
sounds={"a": "open front vowel", "p": "voiceless bilabial consonant stop"},
classes={"V": ("vowel", "vowel", ["a"])},
features=[("open", "height"), ("front", "centrality"), ("stop", "manner")],
)
registry = create_registry(dataset=dataset, register_builtin=False)
registry.register("ipa", IPAFeatureSystem(dataset))
print(registry.get_system("ipa").grapheme_to_features("a"))
Explicit Registries
Use explicit registries when you want isolated state instead of the default global registry.
from distfeat import create_registry, load_builtin_dataset
registry = create_registry(dataset=load_builtin_dataset())
registry.set_default("tresoldi")
print(registry.get_system().name)
print(registry.list_systems())
What The Package Does Not Do
The current package intentionally does not provide:
- a legacy
DistFeatfacade class - the old binary/tristate feature-table interface
grapheme2features(..., t_values=False)style+/-/0rendering- vector output modes for feature tables or matrices
- a command-line interface
- ML-based distance training
The current public API is built around categorical feature bundles, native
multi-state feature tables, scalar dimensions for the distinctive system,
and analysis helpers over those representations.
Documentation
- docs/index.md for the package overview
- docs/api.md for the public API
- docs/datasets.md for dataset loading
- docs/systems.md for built-in systems
- docs/recipes.md for task-oriented workflows
- docs/development.md for implementation constraints
Relationship to alteruphono
alteruphono should be treated as a consumer of distfeat, not the owner of
the feature subsystem.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distfeat-0.5.0.tar.gz.
File metadata
- Download URL: distfeat-0.5.0.tar.gz
- Upload date:
- Size: 154.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab263aa3c82d86dc1551ea0d19c1aaad4face6397af142c8010d1f291410696e
|
|
| MD5 |
c575b86bce238d2753d73a291a3069e1
|
|
| BLAKE2b-256 |
a516ca6a6f16352f65f8fb9142ec94b04499234f1a6db63739f655e1ff550221
|
File details
Details for the file distfeat-0.5.0-py3-none-any.whl.
File metadata
- Download URL: distfeat-0.5.0-py3-none-any.whl
- Upload date:
- Size: 152.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5fc28ef5202d5168b074814562dad4df52d1252e9a30137741eb2202d029cfc
|
|
| MD5 |
94eee1f9b9d5a7d22cf4d234c7c90b8b
|
|
| BLAKE2b-256 |
8b903e13382e49325c10cd8d27debd368867f6b4098fe27873c2df0cf39a2c8b
|