Skip to main content

Standalone phonological feature systems for historical linguistics

Project description

distfeat

distfeat is a standalone Python package for manipulating phonological features.

It provides:

  • bundled phonological feature datasets
  • pluggable feature systems
  • feature geometry and distance functions
  • query and analysis helpers for graphemes and feature sets

distfeat is dependency-free at runtime and is the standalone home for the feature subsystem extracted from alteruphono.

The canonical modern API is built around native representations:

  • use get_representation(...) when you want the system's native feature model
  • use matches(...) and segment_distance(...) for system-native comparison
  • treat get_features(...), partial_match(...), and sound_distance(...) as convenience helpers for categorical systems

Installation

Install from PyPI:

pip install distfeat

Requires Python 3.12+.

Development install:

git clone https://github.com/tresoldi/distfeat.git
cd distfeat
uv venv
uv pip install -e ".[dev]"

Run checks in the project environment:

uv run ruff check .
uv run mypy src
uv run pytest -q
uv run python scripts/verify_examples.py

Core Concepts

The package is organized around:

  • a bundled FeatureDataset
  • a lazy default registry plus explicit Registry instances
  • built-in systems:
    • ipa
    • tresoldi
    • distinctive
    • pbase-hc
    • pbase-jfh
    • pbase-spe
    • pbase-uftc

The package does not define a Sound object. It works directly with graphemes, feature bundles, native multi-state feature tables, scalar dimensions, and matrices.

Quick Start

import distfeat

# Built-in systems
print(distfeat.list_systems())
# ['ipa', 'tresoldi', 'distinctive', 'pbase-hc', 'pbase-jfh', 'pbase-spe', 'pbase-uftc']

# Basic grapheme lookup
print(distfeat.get_features("p"))
# frozenset({'consonant', 'voiceless', 'bilabial', 'stop'})

# Predefined sound classes
print(distfeat.get_class_features("V"))
# frozenset({'vowel'})

# Direct grapheme distance
print(distfeat.distance("a", "e"))

Working With Systems

You can use the lazy default registry through top-level helpers, or you can work with a specific system object.

import distfeat

ipa = distfeat.get_system("ipa")
tresoldi = distfeat.get_system("tresoldi")
distinctive = distfeat.get_system("distinctive")
pbase = distfeat.get_system("pbase-hc")

print(ipa.grapheme_to_features("a"))
print(tresoldi.grapheme_to_features("a"))
print(distinctive.grapheme_to_features("a"))
print(pbase.grapheme_to_representation("a"))

Exact reverse lookup is available when a native representation maps directly to a known grapheme. For categorical systems this is usually a frozenset[str]; for valued systems it can be a dict[str, FeatureState | str] or ValuedFeatures.

ipa = distfeat.get_system("ipa")

grapheme = ipa.features_to_grapheme(
    frozenset({"consonant", "voiced", "bilabial", "stop"})
)
print(grapheme)
# 'b'

Feature Queries

Find Graphemes Matching a Feature Set

Use features_to_graphemes(...) to retrieve all graphemes satisfying a feature query.

By default, matching is partial and uses the semantics of the selected system.

import distfeat

# All vowels in the default system
vowels = distfeat.features_to_graphemes(frozenset({"vowel"}))
print(vowels[:10])

# Voiceless consonants
voiceless_consonants = distfeat.features_to_graphemes(
    frozenset({"consonant", "-voiced"})
)
print(voiceless_consonants[:10])

You can also force exact matching:

import distfeat

ipa = distfeat.get_system("ipa")
features = ipa.grapheme_to_features("a")
print(distfeat.features_to_graphemes(features, exact=True))

Native Multi-State Systems

distfeat also supports systems whose native representation is a named feature-value table instead of a categorical set. The bundled P-base-derived systems expose multi-state values such as +, -, n, ., o, and x through FeatureState.

import distfeat

rep = distfeat.get_representation("a", system="pbase-hc")
print(rep.values["syllabic"])
# FeatureState.POSITIVE

matches = distfeat.features_to_graphemes({"syllabic": "+"}, system="pbase-hc")
print(matches[:10])

The bundled P-base table is intentionally described as derived rather than verbatim. The source data contains duplicate IPA rows, including rows with conflicting values in a small number of columns. distfeat merges duplicate rows conservatively:

  • identical duplicate rows collapse into one row
  • if duplicate rows disagree, only the conflicting cells are downgraded to . (FeatureState.DOT)

This preserves a single usable row per grapheme without inventing new positive or negative values where the source disagrees.

Derive Shared Class Features

Use derive_class_features(...) to compute the strict shared feature intersection of a set of graphemes.

import distfeat

print(distfeat.derive_class_features(["t", "d"]))
# frozenset({'consonant', 'alveolar', 'stop', ...})

print(distfeat.derive_class_features(["t", "d", "s"]))
# fewer shared features than the pair above

For multi-state systems, the result is a dictionary of shared feature states:

import distfeat

print(distfeat.derive_class_features(["t", "d"], system="pbase-hc"))
# {'consonantal': <FeatureState.POSITIVE: '+'>, ...}

Minimal Distinguishing Matrices

Use minimal_matrix(...) to compute the smallest feature set needed to distinguish a given list of graphemes.

import distfeat

matrix = distfeat.minimal_matrix(["t", "d"], system="ipa")
print(matrix.columns)
print(matrix.rows)

For ipa and tresoldi, the matrix is categorical and boolean. For distinctive, it uses scalar dimensions. For P-base-derived systems, it uses native multi-state values.

import distfeat

matrix = distfeat.minimal_matrix(["t", "d", "s"], system="ipa")
print(distfeat.tabulate_matrix(matrix))

Example plain-text output:

grapheme | continuant | voiced
---------+------------+-------
t        | False      | False
d        | False      | True
s        | True       | False

Markdown output is also supported:

print(distfeat.tabulate_matrix(matrix, format="markdown"))

P-base-derived systems render symbolic state values directly:

import distfeat

matrix = distfeat.minimal_matrix(["t", "d"], system="pbase-hc")
print(distfeat.tabulate_matrix(matrix))

Distinctive Scalars

The distinctive system also exposes scalar representations.

from distfeat import DistinctiveFeatureSystem, load_builtin_dataset

system = DistinctiveFeatureSystem(dataset=load_builtin_dataset())

print(system.grapheme_to_scalars("a"))
print(system.features_to_scalars(system.grapheme_to_features("a")))
print(system.scalars_to_features({"voice": 1.0, "labial": 1.0}))

Distance

System-Based Distance

The default distance(...) helper resolves graphemes through the selected system and uses that system's native distance.

import distfeat

print(distfeat.distance("a", "e"))
print(distfeat.distance("a", "u"))
print(distfeat.distance("p", "b"))
print(distfeat.distance("t", "d", system="pbase-hc"))

Precomputed Distance Matrices

You can also supply a precomputed nested dictionary.

import distfeat

precomputed = {
    "a": {"e": 1.5, "u": 2.0},
    "p": {"b": 0.5},
}

print(distfeat.distance("a", "e", precomputed=precomputed))
print(distfeat.distance("b", "p", precomputed=precomputed))

If a requested pair is missing from the precomputed matrix, the function raises KeyError.

Custom Datasets

Load From a Directory

from distfeat import create_registry, load_dataset

dataset = load_dataset(directory="my_feature_data")
registry = create_registry(dataset=dataset)
system = registry.get_system("ipa")

print(system.grapheme_to_features("k"))

Expected files in my_feature_data/:

  • sounds.tsv
  • classes.tsv
  • features.tsv

Bundled P-base-Derived Data

distfeat bundles a derived segment table based on the P-base distribution. The bundled systems are:

  • pbase-hc
  • pbase-jfh
  • pbase-spe
  • pbase-uftc

These systems use the same registry and analysis APIs as the categorical and scalar systems, but operate on native multi-state feature values.

The P-base-derived data is bundled separately from the MIT-licensed code and retains its own attribution and license notice in src/distfeat/data/pbase/.

Build From In-Memory Rows

from distfeat import create_registry, dataset_from_rows
from distfeat.systems.ipa import IPAFeatureSystem

dataset = dataset_from_rows(
    sounds={"a": "open front vowel", "p": "voiceless bilabial consonant stop"},
    classes={"V": ("vowel", "vowel", ["a"])},
    features=[("open", "height"), ("front", "centrality"), ("stop", "manner")],
)

registry = create_registry(dataset=dataset, register_builtin=False)
registry.register("ipa", IPAFeatureSystem(dataset))

print(registry.get_system("ipa").grapheme_to_features("a"))

Explicit Registries

Use explicit registries when you want isolated state instead of the default global registry.

from distfeat import create_registry, load_builtin_dataset

registry = create_registry(dataset=load_builtin_dataset())
registry.set_default("tresoldi")

print(registry.get_system().name)
print(registry.list_systems())

What The Package Does Not Do

The current package intentionally does not provide:

  • a legacy DistFeat facade class
  • the old binary/tristate feature-table interface
  • grapheme2features(..., t_values=False) style +/-/0 rendering
  • vector output modes for feature tables or matrices
  • a command-line interface
  • ML-based distance training

The current public API is built around categorical feature bundles, native multi-state feature tables, scalar dimensions for the distinctive system, and analysis helpers over those representations.

Documentation

Relationship to alteruphono

alteruphono should be treated as a consumer of distfeat, not the owner of the feature subsystem.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distfeat-0.5.0.tar.gz (154.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distfeat-0.5.0-py3-none-any.whl (152.7 kB view details)

Uploaded Python 3

File details

Details for the file distfeat-0.5.0.tar.gz.

File metadata

  • Download URL: distfeat-0.5.0.tar.gz
  • Upload date:
  • Size: 154.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for distfeat-0.5.0.tar.gz
Algorithm Hash digest
SHA256 ab263aa3c82d86dc1551ea0d19c1aaad4face6397af142c8010d1f291410696e
MD5 c575b86bce238d2753d73a291a3069e1
BLAKE2b-256 a516ca6a6f16352f65f8fb9142ec94b04499234f1a6db63739f655e1ff550221

See more details on using hashes here.

File details

Details for the file distfeat-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: distfeat-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 152.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for distfeat-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d5fc28ef5202d5168b074814562dad4df52d1252e9a30137741eb2202d029cfc
MD5 94eee1f9b9d5a7d22cf4d234c7c90b8b
BLAKE2b-256 8b903e13382e49325c10cd8d27debd368867f6b4098fe27873c2df0cf39a2c8b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page