No project description provided
Project description
WorldAlphabets
A tool to access alphabets of the world with Python and Node interfaces.
Usage
Python
Install the package:
pip install worldalphabets
To load the data in Python (omitting script uses the first script listed):
from worldalphabets import get_available_codes, get_scripts, load_alphabet
codes = get_available_codes()
print("Loaded", len(codes), "alphabets")
alphabet = load_alphabet("en") # defaults to first script (Latn)
print("English uppercase:", alphabet.uppercase[:5])
print("English digits:", alphabet.digits)
scripts = get_scripts("mr")
print("Marathi scripts:", scripts)
alphabet_mr = load_alphabet("mr", script=scripts[0])
print("Marathi uppercase:", alphabet_mr.uppercase[:5])
print("Marathi frequency for 'a':", alphabet_mr.frequency["a"])
# Example with Arabic digits
alphabet_ar = load_alphabet("ar", "Arab")
print("Arabic digits:", alphabet_ar.digits)
Node.js
From npm
Install the package from npm:
npm install worldalphabets
Then, you can use the functions in your project:
const {
getUppercase,
getLowercase,
getFrequency,
getDigits,
getAvailableCodes,
getScripts,
} = require('worldalphabets');
async function main() {
const codes = await getAvailableCodes();
console.log('Available codes (first 5):', codes.slice(0, 5));
const scriptsSr = await getScripts('sr');
console.log('Serbian scripts:', scriptsSr);
const uppercaseSr = await getUppercase('sr', scriptsSr[0]);
console.log('Serbian uppercase:', uppercaseSr);
const lowercaseFr = await getLowercase('fr');
console.log('French lowercase:', lowercaseFr);
const frequencyDe = await getFrequency('de');
console.log('German frequency for "a":', frequencyDe['a']);
const digitsAr = await getDigits('ar', 'Arab');
console.log('Arabic digits:', digitsAr);
}
main();
TypeScript projects receive typings automatically via index.d.ts.
Local Usage
If you have cloned the repository, you can use the module directly:
const { getUppercase } = require('./index');
async function main() {
const uppercaseSr = await getUppercase('sr', 'Latn');
console.log('Serbian Latin uppercase:', uppercaseSr);
}
main();
Diacritic Utilities
Both interfaces provide helpers to work with diacritic marks.
Python
from worldalphabets import strip_diacritics, has_diacritics
strip_diacritics("café") # "cafe"
has_diacritics("é") # True
Node.js
const { stripDiacritics, hasDiacritics } = require('worldalphabets');
stripDiacritics('café'); // 'cafe'
hasDiacritics('é'); // true
Use characters_with_diacritics/charactersWithDiacritics to extract letters
with diacritic marks from a list.
Use get_diacritic_variants/getDiacriticVariants to list base letters and
their diacritic forms for a given language.
from worldalphabets import get_diacritic_variants
get_diacritic_variants("pl", "Latn")["L"] # ["L", "Ł"]
const { getDiacriticVariants } = require('worldalphabets');
getDiacriticVariants('pl').then((v) => v.L); // ['L', 'Ł']
Language Detection
To guess possible languages for a string, use
detect_languages/detectLanguages. The detection system uses a hybrid approach:
- Word-based detection (primary): Uses Top-200 frequency lists for languages with available word frequency data
- Character-based fallback: For languages without frequency data, analyzes character sets and character frequencies from alphabet data
from worldalphabets import detect_languages
# Word-based detection (languages with frequency data)
detect_languages("Hello world", candidate_langs=['en', 'de', 'fr'])
# [('en', 0.158), ('de', 0.142), ('fr', 0.139)]
# Character-based fallback (languages without frequency data)
detect_languages("Аҧсуа бызшәа", candidate_langs=['ab', 'ru', 'bg'])
# [('ab', 0.146), ('ru', 0.136), ('bg', 0.125)] # Abkhazian detected via character analysis
const { detectLanguages } = require('worldalphabets');
// Word-based detection
detectLanguages('Hello world', ['en', 'de', 'fr']).then(console.log);
// [['en', 0.158], ['de', 0.142], ['fr', 0.139]]
// Character-based fallback
detectLanguages('ⲧⲙⲛⲧⲣⲙⲛⲕⲏⲙⲉ', ['cop', 'el', 'ar']).then(console.log);
// [['cop', 0.077], ['el', 0.032], ['ar', 0.021]] # Coptic detected via character analysis
The detection system automatically falls back to character-based analysis when word frequency data is unavailable, enabling detection of 331 languages instead of just the 86 with frequency data.
Examples
The examples/ directory contains small scripts demonstrating the library:
examples/python/holds Python snippets for printing alphabets, collecting stats, listing scripts, and more.examples/node/includes similar examples for Node.js.
Audio Samples
Audio recordings are stored under data/audio/ and named
{langcode}_{engine}_{voiceid}.wav. Available voices are listed in
data/audio/index.json.
Web Interface
The Vue app under web/ compiles to a static site with npm run build.
To work on the interface locally, install its dependencies and start the
development server:
cd web
npm install
npm run dev
GitHub Pages publishes the contents of web/dist through a workflow that
runs on every push to main.
Each language view is addressable at /<code>, allowing pages to be
bookmarked directly.
Alphabet Index
This library also provides an index of all available alphabets with additional metadata.
Python
from worldalphabets import get_index_data, get_language, get_scripts
# Get the entire index
index = get_index_data()
print(f"Index contains {len(index)} languages.")
# Show available scripts for Serbian
scripts = get_scripts("sr")
print(f"Serbian scripts: {scripts}")
# Load Marathi in the Latin script
marathi_latn = get_language("mr", script="Latn")
print(f"Script: {marathi_latn['script']}")
print(f"First letters: {marathi_latn['alphabetical'][:5]}")
Node.js
const { getIndexData, getLanguage, getScripts } = require('worldalphabets');
async function main() {
// Get the entire index
const index = await getIndexData();
console.log(`Index contains ${index.length} languages.`);
// Show available scripts for Serbian
const scripts = await getScripts('sr');
console.log(`Serbian scripts: ${scripts}`);
// Load Marathi in the Latin script
const marathiLatn = await getLanguage('mr', 'Latn');
console.log(`Script: ${marathiLatn.script}`);
console.log(`First letters: ${marathiLatn.alphabetical.slice(0, 5)}`);
}
main();
Keyboard Layouts
Key entries expose pos (a KeyboardEvent.code when available) along with row, col, and size information.
Python
The script examples/python/keyboard_md_table.py demonstrates rendering a
layout as a Markdown table. Copy the layout_to_markdown helper into your
project and use it like this:
from keyboard_md_table import layout_to_markdown
print(layout_to_markdown("en-united-kingdom"))
Output:
| ` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | - | = |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| q | w | e | r | t | y | u | i | o | p | [ | ] | |
| a | s | d | f | g | h | j | k | l | ; | ' | # | |
| z | x | c | v | b | n | m | , | . | / | |||
| ␠ |
or with --offset flag
| ` | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0 | - | = |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| q | w | e | r | t | y | u | i | o | p | [ | ] | |
| a | s | d | f | g | h | j | k | l | ; | ' | # | |
| z | x | c | v | b | n | m | , | . | / | |||
| ␠ |
Node.js
const {
getAvailableLayouts,
loadKeyboard,
getUnicode,
} = require('worldalphabets');
async function main() {
const layouts = await getAvailableLayouts();
console.log('Available layouts (first 5):', layouts.slice(0, 5));
const kb = await loadKeyboard('en-us');
console.log('First key Unicode:', getUnicode(kb.keys[1], 'base'));
console.log('First key position:', kb.keys[1].pos, kb.keys[1].row, kb.keys[1].col);
}
main();
Supported Languages
For a detailed list of supported languages and their metadata, including available keyboard layouts, see the Alphabet Table.
Developer Guide
Older versions of this project relied on a Java repository and assorted helper scripts to scrape alphabets and estimate letter frequencies. Those utilities have been deprecated in favor of a cleaner pipeline based on Unicode CLDR and Wikidata. The remaining scripts focus on fetching language–script mappings and building alphabet JSON files directly from CLDR exemplar characters, enriching them with frequency counts from the Simia dataset or OpenSubtitles when available.
The alphabet builder preserves the ordering from CLDR exemplar lists and places diacritic forms immediately after their base letters when the CLDR index omits them. For languages with tonal variants such as Vietnamese, common tone marks are stripped before deduplication to avoid generating separate entries for every tone combination.
Each JSON file includes:
language– English language nameiso639_3– ISO 639-3 codeiso639_1– ISO 639-1 code when availablealphabetical– letters of the alphabet (uppercase when the script has case)uppercase– uppercase letterslowercase– lowercase lettersfrequency– relative frequency of each lowercase letter (zero when no sample text is available)
Example JSON snippet:
{
"language": "English",
"iso639_3": "eng",
"iso639_1": "en",
"alphabetical": ["A", "B"],
"uppercase": ["A", "B"],
"lowercase": ["a", "b"],
"frequency": {"a": 0.084, "b": 0.0208}
}
Setup
This project uses uv for dependency management. To set up the development
environment:
# Install uv
pipx install uv
# Create and activate a virtual environment
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -e '.[dev]'
Data Generation
Consolidated Pipeline (Recommended)
The WorldAlphabets project uses a unified Python-based data collection pipeline:
# Build complete dataset with all stages
uv run scripts/build_data_pipeline.py
# Build with verbose output
uv run scripts/build_data_pipeline.py --verbose
# Run specific pipeline stage
uv run scripts/build_data_pipeline.py --stage build_alphabets
# Build single language
uv run scripts/build_data_pipeline.py --language mi --script Latn
Pipeline Stages:
- collect_sources - Download CLDR, ISO 639-3, frequency data
- build_language_registry - Create comprehensive language database
- build_alphabets - Generate alphabet files from CLDR + fallbacks
- build_translations - Add "Hello, how are you?" translations
- build_keyboards - Generate keyboard layout files
- build_top200 - Generate Top-200 token lists for detection
- build_tts_index - Index available TTS voices
- build_audio - Generate audio files using TTS
- build_index - Create searchable indexes and metadata
- validate_data - Comprehensive data validation
For detailed pipeline documentation, see docs/DATA_PIPELINE.md.
Legacy Individual Scripts (Deprecated)
The following individual scripts are deprecated in favor of the consolidated pipeline:
Add ISO language codes
uv run scripts/add_iso_codes.py # Use: --stage build_language_registry
Fetch language-script mappings
uv run scripts/fetch_language_scripts.py # Use: --stage collect_sources
Build alphabets from CLDR
uv run scripts/build_alphabet_from_cldr.py # Use: --stage build_alphabets
Generate translations
Populate a sample translation for each alphabet using Google Translate. The
script iterates over every language and script combination, writing a
hello_how_are_you field to data/alphabets/<code>-<script>.json.
GOOGLE_TRANS_KEY=<key> uv run scripts/generate_translations.py
To skip languages that already have translations:
GOOGLE_TRANS_KEY=<key> uv run scripts/generate_translations.py --skip-existing
Populate keyboard layouts
To refresh keyboard layout references after restructuring, run:
uv run scripts/populate_layouts.py
To skip languages that already have keyboard data:
uv run scripts/populate_layouts.py --skip-existing
Linting and type checking
ruff check .
mypy .
Top-200 token lists
The language detection helpers rely on compact frequency lists for each language. These lists are generated using a unified 5-priority pipeline that maximizes coverage across as many languages as we can:
# Generate for all languages
uv run python scripts/build_top200_unified.py --all
# Generate for specific languages
uv run python scripts/build_top200_unified.py --langs en,ja,cy
# Generate only for missing languages
uv run python scripts/build_top200_unified.py --missing-only
Priority Sources (in order):
- Leipzig Corpora Collection - High-quality news/web corpora (CC-BY)
- HermitDave FrequencyWords - OpenSubtitles/Wikipedia sources (CC-BY)
- Tatoeba sentences - Sentence-based extraction (CC-BY 2.0 FR)
- Existing alphabet frequency data - Character-level fallback
- Simia unigrams - CJK character data
The script writes results to data/freq/top200 with build reports in
BUILD_REPORT_UNIFIED.json. The unified pipeline also runs within the
consolidated data pipeline as the build_top200 stage.
Sources
Licence Info
- This project is licensed under the MIT License.
- Data sourced from kalenchukov/Alphabet is licensed under the Apache 2.0 License.
- Data sourced from Simia unigrams dataset (Data from Wiktionary) is licensed under the Creative Commons Attribution-ShareAlike License.
- Data sourced from Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file worldalphabets-0.0.18.tar.gz.
File metadata
- Download URL: worldalphabets-0.0.18.tar.gz
- Upload date:
- Size: 94.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
62f6c8353cb3ecdaad322c4424f1b3d0982d57542314dfcbbc402a3e6c01c8f8
|
|
| MD5 |
83affa1821cf3d1de84034950bd63e75
|
|
| BLAKE2b-256 |
fdc0eb95ce888d6692191954647fc6d48f3bd082bd918d12f301efb535ac2f3a
|
Provenance
The following attestation bundles were made for worldalphabets-0.0.18.tar.gz:
Publisher:
publish.yml on willwade/WorldAlphabets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
worldalphabets-0.0.18.tar.gz -
Subject digest:
62f6c8353cb3ecdaad322c4424f1b3d0982d57542314dfcbbc402a3e6c01c8f8 - Sigstore transparency entry: 517367341
- Sigstore integration time:
-
Permalink:
willwade/WorldAlphabets@62ca32d35318c5e5b713277c9174c13bba414243 -
Branch / Tag:
refs/tags/v0.0.18 - Owner: https://github.com/willwade
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@62ca32d35318c5e5b713277c9174c13bba414243 -
Trigger Event:
release
-
Statement type:
File details
Details for the file worldalphabets-0.0.18-py3-none-any.whl.
File metadata
- Download URL: worldalphabets-0.0.18-py3-none-any.whl
- Upload date:
- Size: 607.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fcbd1e8036fad5720f5fb3638edf6494c05db1bd6563e5ddaf13d89997adf3a
|
|
| MD5 |
060a5533bd2583e870d5514398944d99
|
|
| BLAKE2b-256 |
d1c4f77faff78ac8115f17d5af1e0064a36fd9403c04a077eb1ac89d38248f42
|
Provenance
The following attestation bundles were made for worldalphabets-0.0.18-py3-none-any.whl:
Publisher:
publish.yml on willwade/WorldAlphabets
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
worldalphabets-0.0.18-py3-none-any.whl -
Subject digest:
3fcbd1e8036fad5720f5fb3638edf6494c05db1bd6563e5ddaf13d89997adf3a - Sigstore transparency entry: 517367350
- Sigstore integration time:
-
Permalink:
willwade/WorldAlphabets@62ca32d35318c5e5b713277c9174c13bba414243 -
Branch / Tag:
refs/tags/v0.0.18 - Owner: https://github.com/willwade
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@62ca32d35318c5e5b713277c9174c13bba414243 -
Trigger Event:
release
-
Statement type: