Skip to main content

No project description provided

Project description

WorldAlphabets

A tool to access alphabets of the world with Python and Node interfaces.

Usage

Python

Install the package:

pip install worldalphabets

To load the data in Python (omitting script uses the first script listed):

from worldalphabets import get_available_codes, get_scripts, load_alphabet

codes = get_available_codes()
print("Loaded", len(codes), "alphabets")

alphabet = load_alphabet("en")  # defaults to first script (Latn)
print("English uppercase:", alphabet.uppercase[:5])
print("English digits:", alphabet.digits)

scripts = get_scripts("mr")
print("Marathi scripts:", scripts)

alphabet_mr = load_alphabet("mr", script=scripts[0])
print("Marathi uppercase:", alphabet_mr.uppercase[:5])
print("Marathi frequency for 'a':", alphabet_mr.frequency["a"])

# Example with Arabic digits
alphabet_ar = load_alphabet("ar", "Arab")
print("Arabic digits:", alphabet_ar.digits)

Node.js

From npm

Install the package from npm:

npm install worldalphabets

Then, you can use the functions in your project:

const {
  getUppercase,
  getLowercase,
  getFrequency,
  getDigits,
  getAvailableCodes,
  getScripts,
} = require('worldalphabets');

async function main() {
  const codes = await getAvailableCodes();
  console.log('Available codes (first 5):', codes.slice(0, 5));

  const scriptsSr = await getScripts('sr');
  console.log('Serbian scripts:', scriptsSr);

  const uppercaseSr = await getUppercase('sr', scriptsSr[0]);
  console.log('Serbian uppercase:', uppercaseSr);

  const lowercaseFr = await getLowercase('fr');
  console.log('French lowercase:', lowercaseFr);

  const frequencyDe = await getFrequency('de');
  console.log('German frequency for "a":', frequencyDe['a']);

  const digitsAr = await getDigits('ar', 'Arab');
  console.log('Arabic digits:', digitsAr);
}

main();

TypeScript projects receive typings automatically via index.d.ts.

Local Usage

If you have cloned the repository, you can use the module directly:

const { getUppercase } = require('./index');

async function main() {
    const uppercaseSr = await getUppercase('sr', 'Latn');
    console.log('Serbian Latin uppercase:', uppercaseSr);
}

main();

Examples

The examples/ directory contains small scripts demonstrating the library:

  • examples/python/ holds Python snippets for printing alphabets, collecting stats, listing scripts, and more.
  • examples/node/ includes similar examples for Node.js.

Audio Samples

Audio recordings are stored under data/audio/ and named {langcode}_{engine}_{voiceid}.wav. Available voices are listed in data/audio/index.json.

Web Interface

The Vue app under web/ compiles to a static site with npm run build. To work on the interface locally, install its dependencies and start the development server:

cd web
npm install
npm run dev

GitHub Pages publishes the contents of web/dist through a workflow that runs on every push to main.

Each language view is addressable at /<code>, allowing pages to be bookmarked directly.

Alphabet Index

This library also provides an index of all available alphabets with additional metadata.

Python

from worldalphabets import get_index_data, get_language, get_scripts

# Get the entire index
index = get_index_data()
print(f"Index contains {len(index)} languages.")

# Show available scripts for Serbian
scripts = get_scripts("sr")
print(f"Serbian scripts: {scripts}")

# Load Marathi in the Latin script
marathi_latn = get_language("mr", script="Latn")
print(f"Script: {marathi_latn['script']}")
print(f"First letters: {marathi_latn['alphabetical'][:5]}")

Node.js

const { getIndexData, getLanguage, getScripts } = require('worldalphabets');

async function main() {
  // Get the entire index
  const index = await getIndexData();
  console.log(`Index contains ${index.length} languages.`);

  // Show available scripts for Serbian
  const scripts = await getScripts('sr');
  console.log(`Serbian scripts: ${scripts}`);

  // Load Marathi in the Latin script
  const marathiLatn = await getLanguage('mr', 'Latn');
  console.log(`Script: ${marathiLatn.script}`);
  console.log(`First letters: ${marathiLatn.alphabetical.slice(0, 5)}`);
}

main();

Keyboard Layouts

Key entries expose pos (a KeyboardEvent.code when available) along with row, col, and size information.

Python

The script examples/python/keyboard_md_table.py demonstrates rendering a layout as a Markdown table. Copy the layout_to_markdown helper into your project and use it like this:

from keyboard_md_table import layout_to_markdown

print(layout_to_markdown("en-united-kingdom"))

Output:

` 1 2 3 4 5 6 7 8 9 0 - =
q w e r t y u i o p [ ]
a s d f g h j k l ; ' #
z x c v b n m , . /

or with --offset flag

` 1 2 3 4 5 6 7 8 9 0 - =
q w e r t y u i o p [ ]
a s d f g h j k l ; ' #
z x c v b n m , . /

Node.js

const {
  getAvailableLayouts,
  loadKeyboard,
  getUnicode,
} = require('worldalphabets');

async function main() {
  const layouts = await getAvailableLayouts();
  console.log('Available layouts (first 5):', layouts.slice(0, 5));

  const kb = await loadKeyboard('en-us');
  console.log('First key Unicode:', getUnicode(kb.keys[1], 'base'));
  console.log('First key position:', kb.keys[1].pos, kb.keys[1].row, kb.keys[1].col);
}

main();

Supported Languages

For a detailed list of supported languages and their metadata, including available keyboard layouts, see the Alphabet Table.

Developer Guide

Older versions of this project relied on a Java repository and assorted helper scripts to scrape alphabets and estimate letter frequencies. Those utilities have been deprecated in favor of a cleaner pipeline based on Unicode CLDR and Wikidata. The remaining scripts focus on fetching language–script mappings and building alphabet JSON files directly from CLDR exemplar characters, enriching them with frequency counts from the Simia dataset or OpenSubtitles when available.

Each JSON file includes:

  • language – English language name
  • iso639_3 – ISO 639-3 code
  • iso639_1 – ISO 639-1 code when available
  • alphabetical – letters of the alphabet (uppercase when the script has case)
  • uppercase – uppercase letters
  • lowercase – lowercase letters
  • frequency – relative frequency of each lowercase letter (zero when no sample text is available)

Example JSON snippet:

{
  "language": "English",
  "iso639_3": "eng",
  "iso639_1": "en",
  "alphabetical": ["A", "B"],
  "uppercase": ["A", "B"],
  "lowercase": ["a", "b"],
  "frequency": {"a": 0.084, "b": 0.0208}
}

Setup

This project uses uv for dependency management. To set up the development environment:

# Install uv
pipx install uv

# Create and activate a virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -e '.[dev]'

Data Generation

Add ISO language codes

uv run scripts/add_iso_codes.py

Adds English language names and ISO 639 codes to each alphabet JSON.

Fetch language-script mappings

uv run scripts/fetch_language_scripts.py

Queries Wikidata for the scripts used by each language and writes data/language_scripts.json mapping ISO codes to ISO 15924 script codes.

Build alphabets from CLDR

Generate alphabet files from CLDR exemplar character data:

uv run scripts/build_alphabet_from_cldr.py <language> <script>

To build alphabets for every language-script pair in the mapping file:

uv run scripts/build_alphabet_from_cldr.py --manifest data/language_scripts.json

Each file is written to data/alphabets/<language>-<script>.json and combines CLDR exemplar characters with letter frequencies, preferring the Simia unigrams dataset when available and otherwise falling back to OpenSubtitles word frequencies. Locales missing from the CLDR dataset are skipped automatically.

We verified the importer on English, Spanish, Russian, Arabic, Hindi, Kurdish (Latin and Arabic scripts), and Greek. The generated alphabets matched or improved on existing data—Spanish gained accented vowels and Arabic shed contextual forms—so this CLDR-based pipeline is now the recommended way to refresh alphabet JSON files.

Generate translations

Populate a sample translation for each alphabet using Google Translate. The script iterates over every language and script combination, writing a hello_how_are_you field to data/alphabets/<code>-<script>.json.

GOOGLE_TRANS_KEY=<key> uv run scripts/generate_translations.py

To skip languages that already have translations:

GOOGLE_TRANS_KEY=<key> uv run scripts/generate_translations.py --skip-existing

Populate keyboard layouts

To refresh keyboard layout references after restructuring, run:

uv run src/scripts/populate_layouts.py

To skip languages that already have keyboard data:

uv run src/scripts/populate_layouts.py --skip-existing

Linting and type checking

ruff check .
mypy .

Sources

Licence Info

  • This project is licensed under the MIT License.
  • Data sourced from kalenchukov/Alphabet is licensed under the Apache 2.0 License.
  • Data sourced from Simia unigrams dataset (Data from Wiktionary) is licensed under the Creative Commons Attribution-ShareAlike License.
  • Data sourced from Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

worldalphabets-0.0.15.tar.gz (40.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

worldalphabets-0.0.15-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file worldalphabets-0.0.15.tar.gz.

File metadata

  • Download URL: worldalphabets-0.0.15.tar.gz
  • Upload date:
  • Size: 40.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for worldalphabets-0.0.15.tar.gz
Algorithm Hash digest
SHA256 d3a45c7be012afef61ad2881a692d1f6f0178fa78350d1ba30a61d7be4b71ca5
MD5 619e7d84ac4320b12ee8e27f9fd4dc3e
BLAKE2b-256 5e4d493fde6902ccac17118b2420a7d22de85989065f6d9643df4a4e7c0269af

See more details on using hashes here.

Provenance

The following attestation bundles were made for worldalphabets-0.0.15.tar.gz:

Publisher: publish.yml on willwade/WorldAlphabets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file worldalphabets-0.0.15-py3-none-any.whl.

File metadata

File hashes

Hashes for worldalphabets-0.0.15-py3-none-any.whl
Algorithm Hash digest
SHA256 273beee1f25f5f9fd21f296255b1d6ba7b2b92fa95bd4906f8bf42087698eebd
MD5 d7fc262df321c6dd7499e70448f8f80c
BLAKE2b-256 78bff1bf1e2860904ce82abc2e52c3148eb5ac9a8f4ad77a72fe8dd8b4c1a4e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for worldalphabets-0.0.15-py3-none-any.whl:

Publisher: publish.yml on willwade/WorldAlphabets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page