Correct Unicode text handling for every script
Project description
UniWorld
Correct Unicode text handling for every script.
UniWorld is an open-source library, a set of language bindings, and developer tools that implement the Unicode standard's core text algorithms -- all from a single, conformance-tested Rust core. It ships as a library (Rust, Python, JavaScript/WASM, C, Go), a VS Code extension, and a PowerShell module.
uniworld.world -- Full documentation, install guides, and the complete UniWorld ecosystem.
The problem UniWorld solves
Unicode text handling is one of the most pervasive unsolved problems in everyday software. It affects everyone:
If you work in English or other Latin-script languages, you've seen emoji split apart by your cursor, combining accents orphaned by backspace, and pasted text that looks identical but doesn't match because of invisible normalization differences. Your terminal miscounts column widths when it encounters fullwidth characters. Your truncation logic cuts strings in the middle of grapheme clusters. These are Unicode problems, and they happen constantly in English-language workflows.
If you work with Arabic, Hebrew, or any right-to-left script, correct bidirectional layout is essential and routinely broken. Numbers embedded in RTL paragraphs reorder incorrectly. Cursor movement goes the wrong direction. Mixed-direction text renders as gibberish.
If you work with Thai, Lao, Khmer, or Myanmar, your text has no spaces between words. Line breaking requires dictionary-based segmentation that most tools simply don't have. Text wraps mid-word or not at all.
If you work with CJK (Chinese, Japanese, Korean), Indic scripts (Devanagari, Bengali, Tamil), or emoji, selection and editing break on complex characters. Cursors land inside ligatures, conjuncts, and ZWJ sequences. Column counts are wrong. Truncation corrupts display.
The Unicode Consortium publishes the algorithms to handle all of this correctly. Most implementations address only one or two, partially, for a subset of scripts. UniWorld implements five core standards completely and makes them available everywhere.
What UniWorld provides
| Algorithm | Standard | What it does |
|---|---|---|
| Bidirectional layout | UAX #9 | Correct visual ordering and cursor mapping for mixed LTR/RTL text |
| Line breaking | UAX #14 | Rule-based and dictionary-based break opportunities, including Thai, Lao, Khmer, Myanmar (179,081-word dictionary from ICU) |
| Text segmentation | UAX #29 | Grapheme cluster, word, and sentence boundaries for cursor movement, backspace, selection |
| Normalization | UAX #15 | NFC, NFD, NFKC, NFKD for canonical equivalence and compatibility |
| Display width | East Asian Width | True terminal column count (CJK=2, emoji=2, combining=0) |
| Safe truncation | -- | Truncate to N display columns without breaking grapheme clusters |
| Case mapping | Unicode CaseFolding | Full Unicode upper/lower/title/fold with special casing (Turkish, Lithuanian, Greek final sigma) |
| Cursor navigation | UAX #9 + #29 | Logical and visual cursor movement respecting grapheme clusters and bidi |
Conformance
Every algorithm is tested against the official Unicode conformance test suites for UCD 17.0.0. Run cargo test --features conformance; the harness prints pass totals. Row counts below match the number of test lines in each file except BidiTest.txt, which expands each data row across paragraph directions (see printed total).
| Test suite | Cases (rows in UCD 17.0.0 files) |
|---|---|
| Bidi (BidiTest.txt) | total printed by tests |
| Bidi character (BidiCharacterTest.txt) | 91,707 |
| Line break (LineBreakTest.txt) | 19,338 |
| Word segmentation (WordBreakTest.txt) | 1,944 |
| Grapheme segmentation (GraphemeBreakTest.txt) | 766 |
| Sentence segmentation (SentenceBreakTest.txt) | 512 |
| Normalization (NormalizationTest.txt) | Full (all 5 parts) |
Unicode 17.0 throughout (UCD 17.0.0 data files).
Get UniWorld
Rust (core library)
cargo add uniworld
crates.io/crates/uniworld | API docs
Python
pip install uniworld
pypi.org/project/uniworld | Integration guide
JavaScript / WASM
npm install uniworld
npmjs.com/package/uniworld | Integration guide
C
cargo build --release --features cffi
cbindgen --crate uniworld --output uniworld.h
Go
cargo build --release --features cffi
cd bindings/go && go test
VS Code extension
Search "UniWorld" in the Extensions panel, or:
ext install aguywithai.uniworld
VS Code Marketplace | Extension README
Grapheme-aware cursor and delete, bidi visualization, display width, Unicode inspector, normalization commands, line break decorations, script-aware word selection. See the full feature list.
PowerShell module
Install-Module UniWorld
PowerShell Gallery | Module README
12 cmdlets: Get-GraphemeBoundaries, Get-WordBoundaries, Get-SentenceBoundaries, Get-DisplayWidth, Limit-DisplayWidth, ConvertTo-NFC, ConvertTo-NFD, ConvertTo-NFKC, ConvertTo-NFKD, Get-BidiClasses, Get-LineBreakOpportunities, Get-UnicodeInfo. See the full cmdlet reference.
Quick start
Rust
use uniworld::{grapheme_boundaries, display_width, normalize_nfc};
let clusters = grapheme_boundaries("cafe\u{0301}"); // ["c", "a", "f", "e\u{0301}"]
let nfc = normalize_nfc("cafe\u{0301}"); // "cafe" (composed e-acute)
let width = display_width("Hello"); // 5
Python
import uniworld
uniworld.grapheme_boundaries("cafe\u0301") # ["c", "a", "f", "e\u0301"]
uniworld.display_width("Hello") # 10 (CJK)
uniworld.normalize_nfc("cafe\u0301") # "cafe" (composed)
PowerShell
Import-Module UniWorld
"Hello" | Get-DisplayWidth # 5
"cafe`u{0301}" | ConvertTo-NFC # composed e-acute
Get-BidiClasses "Hello" | Format-Table # per-character bidi levels
Architecture
UniWorld Rust core
/ | | \ \
/ | | \ \
Python JS/WASM C Go cdylib
(PyO3) (wasm- (FFI) (CGo) (DLL/so/dylib)
bindgen) |
C# P/Invoke
|
VS Code extension PowerShell module
(WASM binding) (native FFI)
One Rust implementation. Every binding shares the same algorithms, the same data tables, and the same conformance test results. The behavior is identical everywhere because it is the same code.
Build and test
# Core library
cargo build
cargo test
# With conformance tests (requires test data in _development/data/)
cargo test --features conformance
# C FFI (for PowerShell / C / Go)
cargo build --release --features cffi
# WASM (for VS Code / JavaScript)
wasm-pack build --release --features wasm --no-default-features
# VS Code extension
cd extensions/vscode && npm install && npm run compile
# PowerShell module
Import-Module extensions/powershell/UniWorld.psd1
Invoke-Pester -Path extensions/powershell/Tests/
Scripts covered
UniWorld correctly handles text in: Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Bengali, Gurmukhi, Tamil, Sinhala, Thai, Lao, Khmer, Myanmar, Chinese (Simplified/Traditional), Japanese (Kanji + Hiragana + Katakana), Korean (Hangul), Ethiopic, Tifinagh, Cherokee, Canadian Aboriginal Syllabics (Cree, Inuktitut, Ojibwe), and emoji (including ZWJ sequences, skin tones, and flag pairs).
See the Unicode Showcase for a comprehensive stress-test document demonstrating UniWorld across all supported scripts.
Documentation
| Document | Description |
|---|---|
| uniworld.world | Project website with full documentation and install guides |
| VS Code Extension README | Features, settings, commands, development |
| PowerShell Module README | Cmdlets, pipeline usage, architecture |
| Python integration | PyO3 binding setup and API |
| JavaScript/WASM integration | wasm-bindgen setup and API |
| C integration | C FFI API and header generation |
| Go integration | CGo wrapper setup and API |
| Unicode Showcase | Multi-script stress test and demo |
| Project specification | Full architecture, design decisions, and phase history |
Repository layout
README.md # This file
src/ # Rust core (algorithms, data tables, bindings)
tests/ # Rust integration tests
docs/ # User-facing docs (integration guides, showcase)
extensions/vscode/ # VS Code extension (TypeScript + WASM)
extensions/powershell/ # PowerShell module (cmdlets + native FFI)
bindings/go/ # Go CGo wrapper
_development/ # Dev-only: notes, scripts, working docs
_publishing/ # Publishing: marketing, site, outreach
.github/workflows/ # CI: cross-platform native library builds
Contributing
See CONTRIBUTING.md for build instructions, test procedures, and how to submit test cases or dictionary entries.
License
MIT. See LICENSE.
Unicode Character Database data is used under the Unicode License. ICU dictionary data is used under the ICU License. Both are permissive and compatible with commercial use.
UniWorld is an A Guy With AI project by Sean MacNutt, developed using HAIMU, the AI development methodology also originated by MacNutt. HAIMU (Human-AI Mutual Understandability) generated the insight that led to UniWorld -- when prompted for the largest-ROI neglected technical benefit projects an AI could conceive of, correct Unicode handling emerged as the clear winner. The library was largely built within 14 hours of project idea generation. "Move fast and fix things." Initial development funded by Grand Beta. Visit uniworld.world for the full ecosystem.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file uniworld-0.2.0.tar.gz.
File metadata
- Download URL: uniworld-0.2.0.tar.gz
- Upload date:
- Size: 51.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffa46da7f4a51e8d95d9c430cb77c1b4a2007777b43469b06cf0d829007e3253
|
|
| MD5 |
2edb1bf45d620ba36710392bd6921a47
|
|
| BLAKE2b-256 |
5ba91f7d98e4cc12af2a5b54b811a4b5e6d6c774a76172d8c175b5afedd0145c
|
File details
Details for the file uniworld-0.2.0-cp312-cp312-win_amd64.whl.
File metadata
- Download URL: uniworld-0.2.0-cp312-cp312-win_amd64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.12, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1a9cd75e2a0c2a0baa1f028ad5d929628ddfcc0318c38e453d9c95b1b1d33ba
|
|
| MD5 |
7e283cb7b3544d811f15b3b7a4539902
|
|
| BLAKE2b-256 |
da3fbc8451f29b8ae634b0c812a8140cf658cf24706af14f94c383dd36d3cd26
|
File details
Details for the file uniworld-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.
File metadata
- Download URL: uniworld-0.2.0-cp312-cp312-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efdbb1f4aec601a102918fe844bd673e9a6145de9aa59b9c6b0edff2122c8afd
|
|
| MD5 |
ec9f73218d2700276ea3775f57fe5b65
|
|
| BLAKE2b-256 |
54374c39f24edf1992e5bdcaf40628821735e1b4e2bec04d08bf0201e41ab55e
|
File details
Details for the file uniworld-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: uniworld-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f329936f0d371c24e9f0e4d6d34514a786f861845cd046803e80af52d114e80f
|
|
| MD5 |
b4aef178788c8e257d7df9633fd1c56e
|
|
| BLAKE2b-256 |
ffce1f5fed05b39bb618c111917046ba2bf725f6c26d5033d5b6ae28dbe11d88
|