No project description provided

Project description

WorldAlphabets

A tool to access alphabets of the world with Python and Node interfaces.

Usage

Python

To load the data in Python:

from worldalphabets import get_available_codes, load_alphabet

codes = get_available_codes()
print("Loaded", len(codes), "alphabets")

alphabet = load_alphabet("en")
print(alphabet.uppercase[:5])  # ['A', 'B', 'C', 'D', 'E']
print(alphabet.frequency['e'])

Node.js

From npm

Install the package from npm:

npm install worldalphabets

Then, you can use the functions in your project:

const {
  getUppercase,
  getLowercase,
  getFrequency,
  getAvailableCodes,
} = require('worldalphabets');

async function main() {
  const codes = await getAvailableCodes();
  console.log('Available codes (first 5):', codes.slice(0, 5));

  const uppercaseEn = await getUppercase('en');
  console.log('English uppercase:', uppercaseEn);

  const lowercaseFr = await getLowercase('fr');
  console.log('French lowercase:', lowercaseFr);

  const frequencyDe = await getFrequency('de');
  console.log('German frequency for "a":', frequencyDe['a']);
}

main();

TypeScript projects receive typings automatically via index.d.ts.

Local Usage

If you have cloned the repository, you can use the module directly:

const { getUppercase } = require('./index');

async function main() {
    const uppercaseEn = await getUppercase('en');
    console.log('English uppercase:', uppercaseEn);
}

main();

Supported Languages

Alphabet JSON files are available for these ISO language codes (language names from langcodes):

Code	Language
af	Afrikaans
ak	Akan
am	Amharic
ar	Arabic
ast	Asturian
az	Azerbaijani
ba	Bashkir
ban	Balinese
bax	Bamun
be	Belarusian
bg	Bulgarian
bku	Buhid
bm	Bambara
bn	Bangla
bo	Tibetan
bug	Buginese
bya	Batak
ca	Catalan
ceb	Cebuano
chr	Cherokee
ckb	Central Kurdish
cop	Coptic
cs	Czech
cv	Chuvash
da	Danish
de	German
dz	Dzongkha
el	Greek
en	English
eo	Esperanto
es	Spanish
et	Estonian
eu	Basque
fa	Persian
fi	Finnish
fo	Faroese
fr	French
fur	Friulian
ga	Irish
gd	Scottish Gaelic
gez	Geez
gl	Galician
gu	Gujarati
gv	Manx
haw	Hawaiian
he	Hebrew
hi	Hindi
hnn	Hanunoo
ht	Haitian Creole
hu	Hungarian
hy	Armenian
ie	Interlingue
is	Icelandic
it	Italian
ja	Japanese
jv	Javanese
ka	Georgian
kab	Kabyle
kk	Kazakh
kl	Kalaallisut
km	Khmer
kn	Kannada
ko	Korean
ks	Kashmiri
ksh	Colognian
ku	Kurdish
ky	Kyrgyz
la	Latin
lb	Luxembourgish
lep	Lepcha
lif	Limbu
lij	Ligurian
lis	Lisu
lo	Lao
lt	Lithuanian
lv	Latvian
mg	Malagasy
mid	Mandaic
mk	Macedonian
ml	Malayalam
mn	Mongolian
mo	Romanian
my	Burmese
mzn	Mazanderani
nds	Low German
ne	Nepali
nn	Norwegian Nynorsk
no	Norwegian
nqo	N’Ko
nso	Northern Sotho
oc	Occitan
or	Odia
pl	Polish
ps	Pashto
pt	Portuguese
rej	Rejang
rm	Romansh
ro	Romanian
ru	Russian
sa	Sanskrit
sam	Samaritan Aramaic
saz	Saurashtra
sc	Sardinian
se	Northern Sami
sg	Sango
si	Sinhala
sl	Slovenian
sn	Shona
so	Somali
sr	Serbian
su	Sundanese
sv	Swedish
syr	Syriac
szl	Silesian
ta	Tamil
tbw	Tagbanwa
te	Telugu
tg	Tajik
th	Thai
ti	Tigrinya
tk	Turkmen
tl	Filipino
tn	Tswana
tr	Turkish
tt	Tatar
uk	Ukrainian
ur	Urdu
vai	Vai
vec	Venetian
wo	Wolof
zh	Chinese
zh-classical	Classical Chinese
zh-min-nan	Min Nan Chinese
zh-yue	Cantonese
zra	Kara (Korea)

Developer Guide

This project uses the kalenchukov/Alphabet Java repository as the source for alphabet data. A helper script clones the repository, scans all *Alphabet.java files, downloads a sample Wikipedia article for supported languages, and writes JSON files containing the alphabet and estimated letter frequencies. A second utility can replace those estimates with corpus frequencies from the Simia unigrams dataset.

Each JSON file includes:

alphabetical – letters of the alphabet (uppercase when the script has case)
uppercase – uppercase letters
lowercase – lowercase letters
frequency – relative frequency of each lowercase letter (zero when no sample text is available)

Example JSON snippet:

{
  "alphabetical": ["A", "B", ...],
  "uppercase": ["A", "B", ...],
  "lowercase": ["a", "b", ...],
  "frequency": {"a": 0.084, "b": 0.0208, ...}
}

Setup

This project uses uv for dependency management. To set up the development environment:

# Install uv
pipx install uv

# Create and activate a virtual environment
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -e '.[dev]'

Data Generation

Extract alphabets

uv run scripts/extract_alphabets.py

The script clones the Java project and stores JSON files for every available alphabet under data/alphabets/, named by ISO language code. If no sample text is available, frequency values default to zero and the language is recorded in data/todo_languages.csv for follow-up.

Update letter frequencies

uv run scripts/update_frequencies.py

This script downloads the unigrams.zip archive and rewrites each alphabet's frequency mapping using the published counts.

Generate alphabets from locale data

Derive an alphabet from an ICU locale's exemplar character set:

uv run scripts/generate_alphabet_from_locale.py <code> --locale <locale>

The script writes data/alphabets/<code>.json, using the locale's standard exemplar set for the base letters and populating frequency values from the Simia unigrams dataset when available. Locales without exemplar data are skipped.

Generate alphabets from unigrams

For languages present in the Simia dataset but missing here:

uv run scripts/generate_alphabet_from_unigrams.py <code> --locale <locale> \
  --block <Unicode block>

The script writes data/alphabets/<code>.json. To list missing codes:

uv run scripts/missing_unigram_languages.py

Generate missing alphabets

Create alphabet files for every language in the Simia unigrams dataset that does not yet have one:

uv run scripts/generate_missing_alphabets.py --limit 10

Omit --limit to process all missing languages. Each file is written under data/alphabets/ and combines ICU exemplar characters with Simia frequencies.

Linting and type checking

ruff check .
mypy .

Future work

Add sample text or unigram support for more languages.

Project details

Release history Release notifications | RSS feed

0.1.0

Aug 22, 2025

0.0.33

Dec 5, 2025

0.0.32

Dec 5, 2025

0.0.31

Nov 18, 2025

0.0.30

Nov 17, 2025

0.0.29

Nov 17, 2025

0.0.28

Nov 11, 2025

0.0.27

Nov 11, 2025

0.0.26

Nov 11, 2025

0.0.25

Nov 11, 2025

0.0.24

Nov 5, 2025

0.0.23

Sep 17, 2025

0.0.22

Sep 17, 2025

0.0.19

Sep 16, 2025

0.0.18

Sep 15, 2025

0.0.16

Sep 3, 2025

0.0.15

Sep 3, 2025

0.0.14

Sep 3, 2025

0.0.13

Aug 27, 2025

0.0.12

Aug 27, 2025

0.0.11

Aug 22, 2025

This version

0.0.7

Aug 21, 2025

0.0.6

Aug 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

worldalphabets-0.0.7.tar.gz (227.8 kB view details)

Uploaded Aug 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

worldalphabets-0.0.7-py3-none-any.whl (257.6 kB view details)

Uploaded Aug 21, 2025 Python 3

File details

Details for the file worldalphabets-0.0.7.tar.gz.

File metadata

Download URL: worldalphabets-0.0.7.tar.gz
Upload date: Aug 21, 2025
Size: 227.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for worldalphabets-0.0.7.tar.gz
Algorithm	Hash digest
SHA256	`5b013b9aeea7c3c33b3747b2f4664c6a95a30e7cc91190e2ffe467446b361304`
MD5	`340a87a434e0902ef9db33c2b8967971`
BLAKE2b-256	`17b9b8731e7ddb63737ff97756c1a6e68351b1b2f3ebb32cd37cf6ed0b2f437f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for worldalphabets-0.0.7.tar.gz:

Publisher: publish.yml on willwade/WorldAlphabets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: worldalphabets-0.0.7.tar.gz
- Subject digest: 5b013b9aeea7c3c33b3747b2f4664c6a95a30e7cc91190e2ffe467446b361304
- Sigstore transparency entry: 419186317
- Sigstore integration time: Aug 21, 2025
Source repository:
- Permalink: willwade/WorldAlphabets@10d462297ebb0e9c83bd820757dee047553bb19e
- Branch / Tag: refs/tags/v0.0.7
- Owner: https://github.com/willwade
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@10d462297ebb0e9c83bd820757dee047553bb19e
- Trigger Event: release

File details

Details for the file worldalphabets-0.0.7-py3-none-any.whl.

File metadata

Download URL: worldalphabets-0.0.7-py3-none-any.whl
Upload date: Aug 21, 2025
Size: 257.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for worldalphabets-0.0.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3c5fad73390a2996a3417450c734ca0202fe8a30518447f627b7a531697cfd1d`
MD5	`b6323a3ba0f6142130aba38ccbb413b6`
BLAKE2b-256	`325adb0433c421faa9b61347d61146fe7025db99eb60be00c03e8f9ab86f5c7e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for worldalphabets-0.0.7-py3-none-any.whl:

Publisher: publish.yml on willwade/WorldAlphabets

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: worldalphabets-0.0.7-py3-none-any.whl
- Subject digest: 3c5fad73390a2996a3417450c734ca0202fe8a30518447f627b7a531697cfd1d
- Sigstore transparency entry: 419186356
- Sigstore integration time: Aug 21, 2025
Source repository:
- Permalink: willwade/WorldAlphabets@10d462297ebb0e9c83bd820757dee047553bb19e
- Branch / Tag: refs/tags/v0.0.7
- Owner: https://github.com/willwade
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@10d462297ebb0e9c83bd820757dee047553bb19e
- Trigger Event: release

worldalphabets 0.0.7

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

WorldAlphabets

Usage

Python

Node.js

From npm

Local Usage

Supported Languages

Developer Guide

Setup

Data Generation

Linting and type checking

Future work

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance