Skip to main content

Unicode to ASCII transliteration

Project description

Any-Ascii build

jitpack npm pypi gem crates.io

Unicode to ASCII transliteration

Table of Contents

Description

Converts Unicode text to a reasonable representation using only ASCII.

For most characters in Unicode, Any-Ascii provides an ASCII-only replacement string. Text is converted character-by-character without considering the context. The mappings for each language are based on popular existing romanization schemes. Symbolic characters are converted based on their meaning or appearance. All ASCII characters in the input are left unchanged, every other character is replaced with printable ASCII characters. Unknown characters are removed.

Examples

Representative examples for different languages comparing the Any-Ascii output to the conventional romanization.

Language Script Input Output Conventional
French Latin René François Lacôte Rene Francois Lacote Rene Francois Lacote
German Latin Großer Hörselberg Grosser Horselberg Grosser Hoerselberg
Vietnamese Latin Trần Hưng Đạo Tran Hung Dao Tran Hung Dao
Norwegian Latin Nærøy Naeroy Naroy
Ancient Greek Greek Φειδιππίδης Feidippidis Pheidippides
Modern Greek Greek Δημήτρης Φωτόπουλος Dimitris Fotopoylos Dimitris Fotopoulos
Russian Cyrillic Борис Николаевич Ельцин Boris Nikolaevich El'tsin Boris Nikolayevich Yeltsin
Arabic Arabic دمنهور dmnhwr Damanhur
Hebrew Hebrew אברהם הלוי פרנקל 'vrhm hlvy frnkl Abraham Halevi Fraenkel
Georgian Georgian სამტრედია samt'redia Samtredia
Armenian Armenian Աբովյան Abovyan Abovyan
Thai Thai สงขลา sngkhla Songkhla
Lao Lao ສະຫວັນນະເຂດ sahvannaekhd Savannakhet
Mandarin Chinese Han 深圳 ShenZhen Shenzhen
Cantonese Chinese Han 深水埗 ShenShuiBu Sham Shui Po
Korean Hangul 화성시 hwaseongsi Hwaseong-si
Korean Han 華城市 HuaChengShi Hwaseong-si
Japanese Hiragana さいたま saitama Saitama
Japanese Han 埼玉県 QiYuXian Saitama-ken
Japanese Katakana トヨタ toyota Toyota
Unified English Braille Braille ⠠⠎⠁⠽⠀⠭⠀⠁⠛ ^say x ag Say it again
Bengali Bengali ময়মনসিংহ mymnsimh Mymensingh
Gujarati Gujarati પોરબંદર porbmdr Porbandar
Hindi Devanagari महासमुंद mhasmumd Mahasamund
Kannada Kannada ಬೆಂಗಳೂರು bemgluru Bengaluru
Malayalam Malayalam കളമശ്ശേരി klmsseri Kalamassery
Punjabi Gurmukhi ਜਲੰਧਰ jlmdhr Jalandhar
Odia Odia ଗଜପତି gjpti Gajapati
Sinhala Sinhala රත්නපුර rtnpur Ratnapura
Tamil Tamil கன்னியாகுமரி knniyakumri Kanniyakumari
Telugu Telugu శ్రీకాకుళం srikakulm Srikakulam

Reasoning

Unicode is the universal character set, a global standard to support all the world's languages. It consists of 130,000+ characters used by 150 writing systems. Along with characters used in language, it also contains various technical symbols, emojis, and other symbolic characters. The String type in programming languages usually corresponds to Unicode text. Whenever text is used digitally on computers or the internet it is almost always represented using Unicode characters. Unicode characters are not stored directly but instead encoded into bytes using an encoding, typically UTF-8.

ASCII is the most compatible character set, established in 1967. It is a subset of Unicode and UTF-8 consisting of 128 characters using 7-bits in the range 0x00 - 0x7F. The printable characters are English letters, digits, and punctuation in the range 0x20 - 0x7E, with the remaining being control characters. All of the characters found on a standard US keyboard correspond to the printable ASCII characters.

Conversion into the Latin script used by English and ASCII is called romanization.

When converting between writing systems there are multiple properties that can be preserved:

  • Meaning: Translation replaces text with an equivalent in the target language with the same meaning. This relies heavily on context and automatic translation is extremely complicated.
  • Appearance: Preserving the visual appearance of a character when converting between languages is rarely possible and requires readers to have knowledge of the source language.
  • Sound: Orthographic transcription uses the spelling and pronunciation rules of the target language to produce text that a speaker of the target language will pronounce as accurately as possible to the original.
  • Spelling: Transliteration converts each letter individually using predictable rules. An unambiguous transliteration allows for reconstruction of the original text by using unique mappings for each letter. A phonetic transliteration instead uses the most phonetically accurate mappings which may result in duplicates or ambiguity.

Implementations

CLI

$ anyascii άνθρωποι
anthropoi

Use cd rust && cargo build --release to build a native executable to rust/target/release/anyascii

Go

package main

import (
    "github.com/hunterwb/any-ascii"
)

func main() {
    s := anyascii.Transliterate("άνθρωποι")
    // anthropoi
}

Go 1.10+ Compatible

Java

String s = AnyAscii.transliterate("άνθρωποι");
// anthropoi

Java 6+ compatible

Available through JitPack

Maven
<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>
<dependency>
    <groupId>com.hunterwb</groupId>
    <artifactId>any-ascii</artifactId>
    <version>0.1.3</version>
</dependency>
Gradle
repositories {
    maven { url 'https://jitpack.io' }
}
dependencies {
    implementation 'com.hunterwb:any-ascii:0.1.3'
}

Node.js

const anyAscii = require('any-ascii');

const s = anyAscii('άνθρωποι');
// anthropoi

Node.js 4.0+ compatible

Install latest release: npm install any-ascii

Install pre-release: npm install hunterwb/any-ascii

Python

from anyascii import anyascii

s = anyascii('άνθρωποι')
#  anthropoi

Python 3.3+ compatible

Install latest release: pip install anyascii

Install pre-release: pip install https://github.com/hunterwb/any-ascii/archive/master.zip#subdirectory=python

Ruby

require 'any_ascii'

s = AnyAscii.transliterate('άνθρωποι')
# anthropoi

Ruby 2.0+ compatible

Install latest release: gem install any_ascii

Use pre-release:

# Gemfile
gem 'any_ascii', git: 'https://github.com/hunterwb/any-ascii', glob: 'ruby/any_ascii.gemspec'

Rust

use any_ascii::any_ascii;

let s = any_ascii("άνθρωποι");
// anthropoi

Rust 1.20+ compatible

Use latest release:

# Cargo.toml
[dependencies]
any_ascii = "0.1.3"

Use pre-release:

# Cargo.toml
[dependencies]
any_ascii = { git = "https://github.com/hunterwb/any-ascii" }

See Also

ALA-LC Romanization
BGN/PCGN Romanization
Compart: Unicode Charts
ICAO 9303: Machine Readable Passports
ISO 15919: Indic Romanization
ISO 9: Cyrillic Romanization
KNAB Romanization Systems
Sean M. Burke: Unidecode
Sean M. Burke: Unidecode, Perl Journal
Thomas T. Pedersen: Transliteration of Non-Roman Scripts
UNGEGN Romanization
Unicode CLDR: Transliteration Guidelines
Unicode Unihan Database
Unified English Braille
Wikipedia: Romanization of Arabic
Wikipedia: Romanization of Armenian
Wikipedia: Romanization of Georgian
Wikipedia: Romanization of Greek
Wikipedia: Romanization of Hebrew
Wikipedia: Romanization of Japanese
Wikipedia: Romanization of Korean
Wikipedia: Romanization of Russian

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anyascii-0.1.3.tar.gz (163.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anyascii-0.1.3-py3-none-any.whl (242.1 kB view details)

Uploaded Python 3

File details

Details for the file anyascii-0.1.3.tar.gz.

File metadata

  • Download URL: anyascii-0.1.3.tar.gz
  • Upload date:
  • Size: 163.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.0

File hashes

Hashes for anyascii-0.1.3.tar.gz
Algorithm Hash digest
SHA256 b34ff008c170b0fb867dd331d5a79754f7d9fcce39dc44782fd0429766c7a7b6
MD5 0cb9aa073d7dc03e7e5d9e539fae5e0c
BLAKE2b-256 f8e4133f78c96e2f2d6a68f9e371b3419f4a328c7187b986151cb2a97aa90cbc

See more details on using hashes here.

File details

Details for the file anyascii-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: anyascii-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 242.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.0

File hashes

Hashes for anyascii-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 05558ba0c93ea0f15d119de8ebb5ce14a5dea310c1d0913a50c15471e369f638
MD5 f86e2edadf603ab0690f6ec487d188a1
BLAKE2b-256 b12bcfd443e1ac72663a497d663d1fa12509eb2fa992d82b28fef24f13fc021b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page