Indic well-formedness checks backed by Nisaba Brahmic FAR releases

Project description

nisaba-tools

nisaba-tools provides a small Python API for Nisaba normalization and transliteration FARs:

Brahmic visual_norm, reading_norm, iso, fixed, natural_translit romanization, deromanization, IPA transcription, and wellformed
English letter spellout for selected Brahmic and Arabic-script languages
Abjad/alphabet visual_norm, reading_norm, and reversible_roman

This project is not affiliated with Nisaba. It is a convenience wrapper around a useful upstream project whose Bazel-centric build and packaging are harder to consume directly from a small Python package.

People should not hold the Nisaba maintainers responsible for breakages in this wrapper, its packaging, or these convenience release assets.

This wrapper exists because Nisaba exposes useful functionality that was not readily available elsewhere in a small Python package, especially its visual normalization, reading normalization, and well-formedness checks.

It uses byte-mode FAR assets from these releases in ramSeraph/nisaba by default:

Default assets include:

The default API expects byte-mode FARs (*.far), not UTF-8-mode FARs (*_utf8.far)
Brahmic per-script or per-language visual_norm.*.far assets such as visual_norm.Deva.far or visual_norm.Beng.bn.far
Brahmic per-script or per-language reading_norm.*.far assets where Nisaba publishes them, such as reading_norm.Beng.far or reading_norm.Deva.hi.far
Abjad per-language visual_norm.Arab.<lang>.far assets such as visual_norm.Arab.ur.far or visual_norm.Arab.fa.far
Abjad per-language reading_norm.Arab.<lang>.far assets such as reading_norm.Arab.ur.far or reading_norm.Arab.fa.far
combined reversible_roman.far
combined iso.far
combined fixed.far
per-language natural-translit romanization FARs such as hi_iso_nat.far, hi_iso_psac.far, and hi_iso_psaf.far
per-language natural-translit IPA FARs such as hi_iso_ipa.far
per-language natural-translit deromanization FARs such as hi_deva.far, hi_iso.far, ta_taml.far, and ta_iso.far
combined en_spellout.far
wellformed.far

Requirements

Python 3.13
rustfst-python

nisaba-tools currently depends on rustfst-python, and the practical Python version requirement comes from that upstream package rather than from Nisaba itself. Upstream currently declares requires-python = ">=3.13,<3.14" and is only being published with Python 3.13 wheels on some Linux/macOS targets and no source distribution, so Python 3.13 is required for installation here. See the upstream issue: garvys-org/rustfst#301. This package uses rustfst rather than the openfst/pynini stack partly because packaging and installation are also much harder to rely on there.

If upstream packaging improves, this requirement can likely be relaxed later.

Install

uv python install 3.13
uv venv --python 3.13
uv sync --python 3.13

If you prefer not to activate the virtual environment, you can pin the version per command with uv run:

uv run --python 3.13 python -c "from nisaba_tools import visual_normalize; print(visual_normalize('क़', language='hi'))"

Development checks

uv run --python 3.13 --extra dev ruff check .
uv run --python 3.13 --extra dev ruff format --check .
uv run --python 3.13 --extra dev python -m pytest

FAR caching

Downloaded FAR assets are cached on disk by default and reused across processes. Every public transliterator/normalizer accepts:

disk_cache=True (default) to use the OS cache directory
disk_cache=<path> to use a specific persistent cache directory
disk_cache=False to use a per-process temporary cache directory

The default persistent cache location is:

macOS: ~/Library/Caches/nisaba-tools
Windows: %LOCALAPPDATA%\\nisaba-tools
Linux/other Unix: $XDG_CACHE_HOME/nisaba-tools or ~/.cache/nisaba-tools

Cache downloads are written to a unique temporary file in the chosen cache directory and then atomically moved into place, so concurrent processes can share the same persistent cache safely even if they occasionally duplicate a download.

Brahmic

These APIs use canonical language or script tags such as hi, ta, or und-Deva. Script guessing is best-effort: it can infer a supported script like DEVA or BENG, but it cannot distinguish Assamese from Bengali automatically, so pass language="as" or language="bn" when that matters.

Visual normalization

visual_normalize(...) is the explicit source-side normalization API. Upstream visual_norm includes NFC internally and then applies broader script-specific visual-normalization rewrites.

By default, the package prefers smaller standalone visual_norm.*.far assets instead of the larger combined visual_norm.far when Nisaba publishes them.

from nisaba_tools import visual_normalize

normalized = visual_normalize("क़", language="hi")

from nisaba_tools import VisualNormalizer

result = VisualNormalizer().normalize("क़", language="hi")

print(result.normalized_text)
print(result.resolved_language)

Well-formedness

is_wellformed(...) is currently Brahmic-only. It applies visual_norm automatically before checking well-formedness, and the default release surface is the combined byte-mode wellformed.far.

from nisaba_tools import is_wellformed

ok = is_wellformed("क़", language="hi")

from nisaba_tools import WellFormednessChecker

result = WellFormednessChecker().check("क़", language="hi")

print(result.is_wellformed)
print(result.resolved_language)

Reading normalization

reading_normalize(...) applies visual_norm before reading_norm by default, matching the intended pipeline even though the upstream reading_norm FARs do not currently compose that preprocessing step themselves.

By default, the package prefers smaller standalone reading_norm.*.far assets when Nisaba publishes them. For Brahmic, the current default releases cover Bengali script, Malayalam, Lepcha, and Hindi-in-Devanagari.

from nisaba_tools import reading_normalize

reading = reading_normalize("क़", language="hi")
raw_reading = reading_normalize("क़", language="hi", apply_visual_norm=False)

from nisaba_tools import ReadingNormalizer

normalizer = ReadingNormalizer()
reading = normalizer.normalize("क़", language="hi")
raw_reading = normalizer.normalize("क़", language="hi", apply_visual_norm=False)

print(reading.normalized_text)
print(raw_reading.normalized_text)

ISO transliteration

to_iso(...) and from_iso(...) are currently Brahmic-only. to_iso(...) applies visual_norm before FROM_* by default because upstream iso.far already includes NFC but not the broader script-specific visual_norm rewrites in the native-to-ISO direction. That means to_iso(...) output is still NFC-normalized even if you disable the explicit visual_norm prepass. from_iso(...) requires an explicit target language or script.

The default release surface is the combined byte-mode iso.far.

from nisaba_tools import from_iso, to_iso

iso_text = to_iso("क़", language="hi")
native_text = from_iso(iso_text, language="hi")
raw_iso = to_iso("क़", language="hi", apply_visual_norm=False)

from nisaba_tools import IsoTransliterator

transliterator = IsoTransliterator()
to_iso_result = transliterator.transliterate_to_iso("क़", language="hi")
from_iso_result = transliterator.transliterate_from_iso(
    to_iso_result.output_text or "",
    language="hi",
)

print(to_iso_result.output_text)
print(from_iso_result.output_text)

Brahmic script-to-script transliteration

brahmic_transliterate(...) is currently Brahmic-only. It composes to_iso(...) -> from_iso(...), so it also applies source-side visual_norm automatically.

If you enable apply_reading_norm=True, pass an explicit source_language= when language-specific source rules matter, such as Hindi reading_norm.

from nisaba_tools import brahmic_transliterate

telugu_text = brahmic_transliterate(
    "अन्त",
    source_language="hi",
    target_language="te",
    apply_reading_norm=True,
)

from nisaba_tools import BrahmicTransliterator

result = BrahmicTransliterator().transliterate(
    "अन्त",
    source_language="hi",
    target_language="te",
    apply_reading_norm=True,
)

print(result.output_text)
print(result.iso_text)

Fixed transliteration

fixed_transliterate(...) is currently Brahmic-only. It requires an explicit language or script because the Latin-script input does not identify the target Brahmic script. It also accepts a scheme= parameter. For Malayalam, it defaults to Mozhi.

The default release surface is the combined byte-mode fixed.far, and the current upstream fixed.far only contains MLYM, so default fixed-rule transliteration is currently Malayalam-only unless you pass a custom FAR.

from nisaba_tools import fixed_transliterate

fixed_text = fixed_transliterate("m", language="ml", scheme="Mozhi")

from nisaba_tools import FixedTransliterator

result = FixedTransliterator().transliterate("m", language="ml", scheme="Mozhi")

print(result.output_text)
print(result.scheme)

Natural romanization

Natural-translit romanization is a separate Brahmic romanization surface built on Nisaba's natural_translit romanization docs and grammars. Upstream published grammars are ISO-input grammars, but the convenience APIs in this package start from either native script or ISO:

natural_romanize(...) composes to_iso(...) -> natural_romanize_from_iso(...)
natural_romanize_from_iso(...) starts from Nisaba ISO text

The default release assets currently cover bn, gu, hi, kn, ml, mr, pa, ta, and te. Pass an explicit language code like hi, ml, or ta; a script-only tag like und-Deva is not enough to choose a language-specific romanization asset.

Available schemes:

nat = natural everyday romanization, the default
psac = Pan South Asian coarse-grained romanization
psaf = Pan South Asian fine-grained romanization

For example, Nisaba's docs use Hindi āṭīna to illustrate the difference: nat might look like ateen, psac like atin, and psaf like aatiin.

Nisaba ISO is a useful shared Brahmic transliteration layer, but it is not a language-agnostic promise that every downstream grammar will interpret a given ISO string the same way. The release assets are per-language byte-mode *_iso_nat.far, *_iso_psac.far, and *_iso_psaf.far files.

from nisaba_tools import natural_romanize, natural_romanize_from_iso

nat_from_script = natural_romanize("अटीना", language="hi")
nat_from_iso = natural_romanize_from_iso("āṭīna", language="hi", scheme="psac")

from nisaba_tools import NaturalRomanTransliterator

transliterator = NaturalRomanTransliterator()
script_result = transliterator.transliterate("अटीना", language="hi")
iso_result = transliterator.transliterate_iso("āṭīna", language="hi", scheme="psac")

print(script_result.output_text)
print(iso_result.output_text)

IPA transcription

IPA transcription is a separate natural_translit phonological-transcription surface. Upstream published grammars are also ISO-input grammars, but the convenience APIs in this package start from either native script or ISO:

to_ipa(...) composes to_iso(...) -> to_ipa_from_iso(...)
to_ipa_from_iso(...) starts from Nisaba ISO text

The default IPA release assets cover bn, gu, hi, kn, ml, mr, pa, ta, and te. Pass an explicit language code like hi, ml, or ta; a script-only tag like und-Deva is not enough to choose a language-specific asset. The release assets are per-language byte-mode *_iso_ipa.far files with ISO_TO_IPA.

This is best thought of as Nisaba's transliteration-oriented phonological transcription layer, not a general high-coverage G2P system for every spelling or pronunciation edge case.

from nisaba_tools import to_ipa, to_ipa_from_iso

ipa_from_script = to_ipa("अटीना", language="hi")
ipa_from_iso = to_ipa_from_iso("āṭīna", language="hi")

from nisaba_tools import IpaTranscriber

transcriber = IpaTranscriber()
script_result = transcriber.transcribe("अटीना", language="hi")
iso_result = transcriber.transcribe_iso("āṭīna", language="hi")

print(script_result.output_text)
print(iso_result.output_text)

Natural deromanization

Natural-translit deromanization is the reverse surface published in upstream natural_translit deromanization. It starts from Latin-script input and currently has two published output targets:

natural_deromanize(...) for Latin text to native script
natural_deromanize_to_iso(...) for Latin text to Nisaba ISO

The default deromanization release assets only cover hi and ta. Pass an explicit language code like hi or ta; a script-only tag like und-Deva is not enough to choose a language-specific asset. The release assets are byte-mode hi_deva.far, hi_iso.far, ta_taml.far, and ta_iso.far.

Treat this as a plausible inference layer, not as a guaranteed inverse of natural_romanize(...) or to_iso(...). In particular, natural_deromanize_to_iso(...) produces an inferred ISO transliteration from Latin input, not a round-trip reconstruction of to_iso(...).

from nisaba_tools import natural_deromanize, natural_deromanize_to_iso

derom_script = natural_deromanize("namaste", language="hi")
derom_iso = natural_deromanize_to_iso("namaste", language="hi")

from nisaba_tools import NaturalDeromanizer

deromanizer = NaturalDeromanizer()
script_result = deromanizer.transliterate("namaste", language="hi")
iso_result = deromanizer.transliterate_to_iso("namaste", language="hi")

print(script_result.output_text)
print(iso_result.output_text)

Abjad

For Arabic-script normalization and reading normalization, pass an explicit language code such as ur, fa, ckb, or ar; script guessing cannot choose the right abjad rules.

Visual normalization

visual_normalize(...) is also the explicit normalization API for abjad input. Upstream visual_norm includes NFC internally there as well. By default, the package prefers smaller standalone visual_norm.Arab.*.far assets instead of the larger combined visual_norm.far when Nisaba publishes them.

from nisaba_tools import visual_normalize

urdu_visual = visual_normalize("ك", language="ur")

from nisaba_tools import VisualNormalizer

result = VisualNormalizer().normalize("ك", language="ur")

print(result.normalized_text)
print(result.resolved_language)

Reading normalization

reading_normalize(...) applies visual_norm before reading_norm by default for abjad input as well.

By default, the package prefers smaller standalone reading_norm.*.far assets when Nisaba publishes them. For abjad, the current default releases cover published Arabic-script language assets such as ur, fa, ckb, and ar.

from nisaba_tools import reading_normalize

urdu_reading = reading_normalize("ك", language="ur")

from nisaba_tools import ReadingNormalizer

result = ReadingNormalizer().normalize("ك", language="ur")

print(result.normalized_text)
print(result.resolved_language)

Reversible romanization

to_reversible_roman(...) and from_reversible_roman(...) are currently abjad/alphabet-only. The default release surface is the combined byte-mode reversible_roman.far with FROM_ARAB and TO_ARAB.

to_reversible_roman(...) can infer Arab script text directly. Arab is the script subtag; ar is Arabic the language. from_reversible_roman(...) can default to und-Arab because the target script is always Arabic script.

from nisaba_tools import from_reversible_roman, to_reversible_roman

urdu_roman = to_reversible_roman("اردو، اردو!")
urdu_script = from_reversible_roman(urdu_roman)

from nisaba_tools import ReversibleRomanTransliterator

transliterator = ReversibleRomanTransliterator()
to_roman_result = transliterator.transliterate_to_roman("اردو، اردو!")
from_roman_result = transliterator.transliterate_from_roman(
    to_roman_result.output_text or ""
)

print(to_roman_result.output_text)
print(from_roman_result.output_text)

Shared helpers

English spellout

english_spellout(...) is a separate helper built from the combined byte-mode en_spellout.far. It spells out English or Latin letters as target-language letter names, which is useful for acronyms or initialisms rather than normal lexical transliteration.

The published English spellout grammar currently supports bn, gu, hi, kn, ml, mr, or, pa, sd, si, ta, te, and ur.

from nisaba_tools import english_spellout

english_letters = english_spellout("ATM", language="hi")

from nisaba_tools import EnglishSpelloutTransliterator

result = EnglishSpelloutTransliterator().transliterate("ATM", language="hi")

print(result.output_text)
print(result.resolved_language)

Support matrix

api_support() reports the languages covered by the package's default published FAR assets. Returned identifiers are canonical language or script tags such as hi, ur, or und-Deva; custom FARs can extend support beyond this default matrix.

from nisaba_tools import api_support

support = api_support()
to_ipa_support = support.support_for_api("to_ipa")

print(support.languages_for_api("to_ipa"))
print(support.languages_for_api("visual_normalize"))
print(support.apis_for_language("ta"))
print(support.apis_for_language("und-Deva"))
print(to_ipa_support.languages)

Custom FAR reuse

The object APIs are also the easiest way to reuse explicit FAR paths across many calls.

from nisaba_tools import WellFormednessChecker

checker = WellFormednessChecker(
    visual_norm_far="/path/to/visual_norm.Beng.bn.far",
    wellformed_far="/path/to/wellformed.far",
)

result = checker.check("বাংলা", language="bn")

print(result.is_wellformed)
print(result.resolved_language)

Project details

Release history Release notifications | RSS feed

0.1.2

Jun 2, 2026

This version

0.1.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nisaba_tools-0.1.0.tar.gz (40.9 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nisaba_tools-0.1.0-py3-none-any.whl (37.3 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file nisaba_tools-0.1.0.tar.gz.

File metadata

Download URL: nisaba_tools-0.1.0.tar.gz
Upload date: Jun 1, 2026
Size: 40.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nisaba_tools-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2572627eb9abe66dcdda50594ac5b5b61e119315886f17e359e9c241a86fa4e9`
MD5	`4d070f7378b5602e8ff0a175fb439e66`
BLAKE2b-256	`5b420f97065a1dea3cbf5421ec7df2eb6460d2a5f62c16966fb3bf70b2265fb7`

See more details on using hashes here.

File details

Details for the file nisaba_tools-0.1.0-py3-none-any.whl.

File metadata

Download URL: nisaba_tools-0.1.0-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 37.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for nisaba_tools-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ba443e23e69c9d923f74ad97f40e8fb610f081f8ea21786ef34735f57c802df`
MD5	`00d41fd55b3a0c7145ae2c5b7848bf24`
BLAKE2b-256	`2b616f03e6c642b768da97042f0552ab9f81df997cf2841560c71693644b7bda`

See more details on using hashes here.

nisaba-tools 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

nisaba-tools

Requirements

Install

Development checks

FAR caching

Brahmic

Visual normalization

Well-formedness

Reading normalization

ISO transliteration

Brahmic script-to-script transliteration

Fixed transliteration

Natural romanization

IPA transcription

Natural deromanization

Abjad

Visual normalization

Reading normalization

Reversible romanization

Shared helpers

English spellout

Support matrix

Custom FAR reuse

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes