Skip to main content

Standalone phonological feature systems for historical linguistics

Project description

merkmal

merkmal is a standalone Python package for manipulating phonological features. Zero runtime dependencies, Python 3.12+.

It provides:

  • bundled phonological feature datasets
  • pluggable feature systems (9 built-in)
  • feature geometry and distance functions (Clements & Hume 1995)
  • tonal geometry (Yip/Bao)
  • query and analysis helpers for graphemes and feature sets
  • UPA transcription support

Installation

Install from PyPI:

pip install merkmal

Development install:

git clone https://github.com/tresoldi/merkmal.git
cd merkmal
pip install -e ".[dev]"

Run checks:

ruff check .
mypy src
pytest -q

Quick start

import merkmal

# Built-in systems
print(merkmal.list_systems())
# ['descriptive', 'broad', 'distinctive', 'pbase-hc', 'pbase-jfh',
#  'pbase-spe', 'pbase-uftc', 'phoible', 'classfeat']

# Basic grapheme lookup
print(merkmal.get_features("p"))
# frozenset({'consonant', 'voiceless', 'bilabial', 'stop'})

# Predefined sound classes
print(merkmal.get_class_features("V"))
# frozenset({'vowel'})

# Distance
print(merkmal.distance("a", "e"))
print(merkmal.distance("p", "b", system="classfeat"))

Systems

System Type Features Distance
descriptive categorical articulatory geometry-weighted
broad categorical simplified geometry-weighted
distinctive privative Clements & Hume geometry-weighted
pbase-hc, -jfh, -spe, -uftc multi-state 4 theoretical families geometry-weighted
phoible binary 37 features Hamming
classfeat hybrid sound classes + continuous trained weights

All systems implement the same FeatureSystem protocol. Distances, queries, matrices, and natural class derivation work across all of them.

Working with systems

You can use the lazy default registry through top-level helpers, or work with a specific system object.

import merkmal

descriptive = merkmal.get_system("descriptive")
distinctive = merkmal.get_system("distinctive")
pbase = merkmal.get_system("pbase-hc")

print(descriptive.grapheme_to_features("a"))
print(distinctive.grapheme_to_features("a"))
print(pbase.grapheme_to_representation("a"))

Exact reverse lookup is available when a native representation maps directly to a known grapheme.

descriptive = merkmal.get_system("descriptive")

grapheme = descriptive.features_to_grapheme(
    frozenset({"consonant", "voiced", "bilabial", "stop"})
)
print(grapheme)
# 'b'

Feature queries

Use features_to_graphemes(...) to find all graphemes matching a feature set. Matching is partial by default.

import merkmal

vowels = merkmal.features_to_graphemes(frozenset({"vowel"}))
print(vowels[:10])

# Exact matching
features = merkmal.get_features("a")
print(merkmal.features_to_graphemes(features, exact=True))

Natural classes and matrices

import merkmal

# Shared features of a segment set
print(merkmal.derive_class_features(["p", "t", "k"]))
# frozenset({'consonant', 'voiceless', 'stop'})

# Minimal distinguishing matrix
matrix = merkmal.minimal_matrix(["t", "d", "s"])
print(merkmal.tabulate_matrix(matrix))
grapheme | continuant | voiced
---------+------------+-------
t        | False      | False
d        | False      | True
s        | True       | False

Distance

import merkmal

print(merkmal.distance("a", "e"))
print(merkmal.distance("a", "u"))
print(merkmal.distance("p", "b"))
print(merkmal.distance("t", "d", system="pbase-hc"))

You can also supply a precomputed nested dictionary:

precomputed = {"a": {"e": 1.5, "u": 2.0}, "p": {"b": 0.5}}
print(merkmal.distance("a", "e", precomputed=precomputed))

Multi-state systems (P-base)

P-base-derived systems expose multi-state values (+, -, n, ., o, x) through FeatureState.

import merkmal

rep = merkmal.get_representation("a", system="pbase-hc")
print(rep.values["syllabic"])
# FeatureState.POSITIVE

The bundled P-base table is derived, not verbatim. Duplicate rows with conflicting values have the conflicting cells downgraded to . (FeatureState.DOT). The P-base data retains its own attribution and license notice in src/merkmal/data/pbase/.

Custom datasets

from merkmal import create_registry, load_dataset

dataset = load_dataset(directory="my_feature_data")
registry = create_registry(dataset=dataset)
system = registry.get_system("descriptive")
print(system.grapheme_to_features("k"))

Expected files in my_feature_data/: sounds.tsv, classes.tsv, features.tsv.

Cognator export

merkmal export-cognator writes a small, byte-stable bundle of TSV + JSON files that downstream consumers (in particular the cognator Go package) can read without any Python dependency on merkmal.

# single system → ./cognator_export/descriptive/
merkmal export-cognator --system=descriptive

# every built-in system → ./cognator_export/<system>/
merkmal export-cognator --all-systems --out=./cognator_export --force

The bundle contains:

  • distances.tsv — full Cartesian pairwise distances, normalized to [0.0, 1.0] via d' = clip(d_raw / d_max_raw, 0, 1).
  • classes.tsv — sound-class reduction (only for systems that expose one, e.g. classfeat).
  • prosody.tsv — per-grapheme role tag (C, R, V, G, T, S, X).
  • fallback.tsv — optional grapheme-normalization table for out-of-inventory inputs (initially empty, populated over time).
  • manifest.json — merkmal version, export date, grapheme count, and SHA-256 hashes of every file in the bundle.

All text files are UTF-8 with NFC-normalized graphemes, LF line endings, and deterministic row ordering. Floats use fixed %.6f formatting. Pin SOURCE_DATE_EPOCH to produce byte-identical bundles across runs.

The same capability is available as a library function:

import merkmal

merkmal.export_cognator("descriptive", "./cognator_export/descriptive")
merkmal.export_all_systems("./cognator_export")

Documentation

See the tutorials for worked examples covering phonological features, typology, historical linguistics, cognate detection, and UPA transcription.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

merkmal-0.2.0.tar.gz (185.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

merkmal-0.2.0-py3-none-any.whl (176.4 kB view details)

Uploaded Python 3

File details

Details for the file merkmal-0.2.0.tar.gz.

File metadata

  • Download URL: merkmal-0.2.0.tar.gz
  • Upload date:
  • Size: 185.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for merkmal-0.2.0.tar.gz
Algorithm Hash digest
SHA256 22cf8efe75404af082a099faee45cfbf56b517addcb7b49072d89f4bcc2b0694
MD5 74662efc13ada321a64658bbed8b03a0
BLAKE2b-256 06aefca227d10fead93b263803ae4b0ae955ecf7b5e4c6c415aa4e51bef59281

See more details on using hashes here.

File details

Details for the file merkmal-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: merkmal-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 176.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for merkmal-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cd70005935f9b259a4014c084780005f760ea1bd197a23475f7d721367c08883
MD5 32e73551f4a4c580848e529e0a79bd59
BLAKE2b-256 511988307ec0a01a50e3b6e93546d2e3c9ecdc5b164c3701dc8a9dfe431f98f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page