Skip to main content

Correct Unicode text handling for every script

Project description

UniWorld

Correct Unicode text handling for every script.

UniWorld is an open-source library, a set of language bindings, and developer tools that implement the Unicode standard's core text algorithms -- all from a single, conformance-tested Rust core. It ships as a library (Rust, Python, JavaScript/WASM, C, Go), a VS Code extension, and a PowerShell module.

uniworld.world -- Full documentation, install guides, and the complete UniWorld ecosystem.


The problem UniWorld solves

Unicode text handling is one of the most pervasive unsolved problems in everyday software. It affects everyone:

If you work in English or other Latin-script languages, you've seen emoji split apart by your cursor, combining accents orphaned by backspace, and pasted text that looks identical but doesn't match because of invisible normalization differences. Your terminal miscounts column widths when it encounters fullwidth characters. Your truncation logic cuts strings in the middle of grapheme clusters. These are Unicode problems, and they happen constantly in English-language workflows.

If you work with Arabic, Hebrew, or any right-to-left script, correct bidirectional layout is essential and routinely broken. Numbers embedded in RTL paragraphs reorder incorrectly. Cursor movement goes the wrong direction. Mixed-direction text renders as gibberish.

If you work with Thai, Lao, Khmer, or Myanmar, your text has no spaces between words. Line breaking requires dictionary-based segmentation that most tools simply don't have. Text wraps mid-word or not at all.

If you work with CJK (Chinese, Japanese, Korean), Indic scripts (Devanagari, Bengali, Tamil), or emoji, selection and editing break on complex characters. Cursors land inside ligatures, conjuncts, and ZWJ sequences. Column counts are wrong. Truncation corrupts display.

The Unicode Consortium publishes the algorithms to handle all of this correctly. Most implementations address only one or two, partially, for a subset of scripts. UniWorld implements five core standards completely and makes them available everywhere.

What UniWorld provides

Algorithm Standard What it does
Bidirectional layout UAX #9 Correct visual ordering and cursor mapping for mixed LTR/RTL text
Line breaking UAX #14 Rule-based and dictionary-based break opportunities, including Thai, Lao, Khmer, Myanmar (179,081-word dictionary from ICU)
Text segmentation UAX #29 Grapheme cluster, word, and sentence boundaries for cursor movement, backspace, selection
Normalization UAX #15 NFC, NFD, NFKC, NFKD for canonical equivalence and compatibility
Display width East Asian Width True terminal column count (CJK=2, emoji=2, combining=0)
Safe truncation -- Truncate to N display columns without breaking grapheme clusters
Case mapping Unicode CaseFolding Full Unicode upper/lower/title/fold with special casing (Turkish, Lithuanian, Greek final sigma)
Cursor navigation UAX #9 + #29 Logical and visual cursor movement respecting grapheme clusters and bidi

Conformance

Every algorithm is tested against the official Unicode conformance test suites for UCD 17.0.0. Run cargo test --features conformance; the harness prints pass totals. Row counts below match the number of test lines in each file except BidiTest.txt, which expands each data row across paragraph directions (see printed total).

Test suite Cases (rows in UCD 17.0.0 files)
Bidi (BidiTest.txt) total printed by tests
Bidi character (BidiCharacterTest.txt) 91,707
Line break (LineBreakTest.txt) 19,338
Word segmentation (WordBreakTest.txt) 1,944
Grapheme segmentation (GraphemeBreakTest.txt) 766
Sentence segmentation (SentenceBreakTest.txt) 512
Normalization (NormalizationTest.txt) Full (all 5 parts)

Unicode 17.0 throughout (UCD 17.0.0 data files).

Get UniWorld

Rust (core library)

cargo add uniworld

crates.io/crates/uniworld | API docs

Python

pip install uniworld

pypi.org/project/uniworld | Integration guide

JavaScript / WASM

npm install uniworld

npmjs.com/package/uniworld | Integration guide

C

cargo build --release --features cffi
cbindgen --crate uniworld --output uniworld.h

Integration guide

Go

cargo build --release --features cffi
cd bindings/go && go test

Integration guide

VS Code extension

Search "UniWorld" in the Extensions panel, or:

ext install aguywithai.uniworld

VS Code Marketplace | Extension README

Grapheme-aware cursor and delete, bidi visualization, display width, Unicode inspector, normalization commands, line break decorations, script-aware word selection. See the full feature list.

PowerShell module

Install-Module UniWorld

PowerShell Gallery | Module README

12 cmdlets: Get-GraphemeBoundaries, Get-WordBoundaries, Get-SentenceBoundaries, Get-DisplayWidth, Limit-DisplayWidth, ConvertTo-NFC, ConvertTo-NFD, ConvertTo-NFKC, ConvertTo-NFKD, Get-BidiClasses, Get-LineBreakOpportunities, Get-UnicodeInfo. See the full cmdlet reference.

Quick start

Rust

use uniworld::{grapheme_boundaries, display_width, normalize_nfc};

let clusters = grapheme_boundaries("cafe\u{0301}");  // ["c", "a", "f", "e\u{0301}"]
let nfc = normalize_nfc("cafe\u{0301}");              // "cafe" (composed e-acute)
let width = display_width("Hello");                    // 5

Python

import uniworld

uniworld.grapheme_boundaries("cafe\u0301")   # ["c", "a", "f", "e\u0301"]
uniworld.display_width("Hello")              # 10 (CJK)
uniworld.normalize_nfc("cafe\u0301")         # "cafe" (composed)

PowerShell

Import-Module UniWorld
"Hello" | Get-DisplayWidth                   # 5
"cafe`u{0301}" | ConvertTo-NFC              # composed e-acute
Get-BidiClasses "Hello" | Format-Table       # per-character bidi levels

Architecture

                         UniWorld Rust core
                        /    |    |    \    \
                      /      |    |     \     \
                 Python   JS/WASM  C    Go    cdylib
                (PyO3)  (wasm-   (FFI) (CGo)  (DLL/so/dylib)
                         bindgen)              |
                                        C# P/Invoke
                                              |
                    VS Code extension    PowerShell module
                    (WASM binding)       (native FFI)

One Rust implementation. Every binding shares the same algorithms, the same data tables, and the same conformance test results. The behavior is identical everywhere because it is the same code.

Build and test

# Core library
cargo build
cargo test

# With conformance tests (requires test data in _development/data/)
cargo test --features conformance

# C FFI (for PowerShell / C / Go)
cargo build --release --features cffi

# WASM (for VS Code / JavaScript)
wasm-pack build --release --features wasm --no-default-features

# VS Code extension
cd extensions/vscode && npm install && npm run compile

# PowerShell module
Import-Module extensions/powershell/UniWorld.psd1
Invoke-Pester -Path extensions/powershell/Tests/

Scripts covered

UniWorld correctly handles text in: Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari, Bengali, Gurmukhi, Tamil, Sinhala, Thai, Lao, Khmer, Myanmar, Chinese (Simplified/Traditional), Japanese (Kanji + Hiragana + Katakana), Korean (Hangul), Ethiopic, Tifinagh, Cherokee, Canadian Aboriginal Syllabics (Cree, Inuktitut, Ojibwe), and emoji (including ZWJ sequences, skin tones, and flag pairs).

See the Unicode Showcase for a comprehensive stress-test document demonstrating UniWorld across all supported scripts.

Documentation

Document Description
uniworld.world Project website with full documentation and install guides
VS Code Extension README Features, settings, commands, development
PowerShell Module README Cmdlets, pipeline usage, architecture
Python integration PyO3 binding setup and API
JavaScript/WASM integration wasm-bindgen setup and API
C integration C FFI API and header generation
Go integration CGo wrapper setup and API
Unicode Showcase Multi-script stress test and demo
Project specification Full architecture, design decisions, and phase history

Repository layout

README.md                          # This file
src/                               # Rust core (algorithms, data tables, bindings)
tests/                             # Rust integration tests
docs/                              # User-facing docs (integration guides, showcase)
extensions/vscode/                 # VS Code extension (TypeScript + WASM)
extensions/powershell/             # PowerShell module (cmdlets + native FFI)
bindings/go/                       # Go CGo wrapper
_development/                      # Dev-only: notes, scripts, working docs
_publishing/                       # Publishing: marketing, site, outreach
.github/workflows/                 # CI: cross-platform native library builds

Contributing

See CONTRIBUTING.md for build instructions, test procedures, and how to submit test cases or dictionary entries.

License

MIT. See LICENSE.

Unicode Character Database data is used under the Unicode License. ICU dictionary data is used under the ICU License. Both are permissive and compatible with commercial use.


UniWorld is an A Guy With AI project by Sean MacNutt, developed using HAIMU, the AI development methodology also originated by MacNutt. HAIMU (Human-AI Mutual Understandability) generated the insight that led to UniWorld -- when prompted for the largest-ROI neglected technical benefit projects an AI could conceive of, correct Unicode handling emerged as the clear winner. The library was largely built within 14 hours of project idea generation. "Move fast and fix things." Initial development funded by Grand Beta. Visit uniworld.world for the full ecosystem.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

uniworld-0.2.0.tar.gz (51.5 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

uniworld-0.2.0-cp312-cp312-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.12Windows x86-64

uniworld-0.2.0-cp312-cp312-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

uniworld-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file uniworld-0.2.0.tar.gz.

File metadata

  • Download URL: uniworld-0.2.0.tar.gz
  • Upload date:
  • Size: 51.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for uniworld-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ffa46da7f4a51e8d95d9c430cb77c1b4a2007777b43469b06cf0d829007e3253
MD5 2edb1bf45d620ba36710392bd6921a47
BLAKE2b-256 5ba91f7d98e4cc12af2a5b54b811a4b5e6d6c774a76172d8c175b5afedd0145c

See more details on using hashes here.

File details

Details for the file uniworld-0.2.0-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for uniworld-0.2.0-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a1a9cd75e2a0c2a0baa1f028ad5d929628ddfcc0318c38e453d9c95b1b1d33ba
MD5 7e283cb7b3544d811f15b3b7a4539902
BLAKE2b-256 da3fbc8451f29b8ae634b0c812a8140cf658cf24706af14f94c383dd36d3cd26

See more details on using hashes here.

File details

Details for the file uniworld-0.2.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for uniworld-0.2.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 efdbb1f4aec601a102918fe844bd673e9a6145de9aa59b9c6b0edff2122c8afd
MD5 ec9f73218d2700276ea3775f57fe5b65
BLAKE2b-256 54374c39f24edf1992e5bdcaf40628821735e1b4e2bec04d08bf0201e41ab55e

See more details on using hashes here.

File details

Details for the file uniworld-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for uniworld-0.2.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f329936f0d371c24e9f0e4d6d34514a786f861845cd046803e80af52d114e80f
MD5 b4aef178788c8e257d7df9633fd1c56e
BLAKE2b-256 ffce1f5fed05b39bb618c111917046ba2bf725f6c26d5033d5b6ae28dbe11d88

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page