Skip to main content

Unicode to ASCII transliteration

Project description

Any-Ascii build

jitpack npm pypi gem crates.io

Unicode to ASCII transliteration

Table of Contents

Description

Converts Unicode text to a reasonable representation using only ASCII.

For most characters in Unicode, Any-Ascii provides an ASCII-only replacement string. Text is converted character-by-character without considering the context. The mappings for each language are based on popular existing romanization schemes. Symbolic characters are converted based on their meaning or appearance. All ASCII characters in the input are left unchanged, every other character is replaced with printable ASCII characters. Unknown characters are removed.

Examples

Representative examples for different languages comparing the Any-Ascii output to the conventional romanization.

Language (Script) Input Output Conventional
French (Latin) René François Lacôte Rene Francois Lacote Rene Francois Lacote
German (Latin) Großer Hörselberg Grosser Horselberg Grosser Hoerselberg
Vietnamese (Latin) Trần Hưng Đạo Tran Hung Dao Tran Hung Dao
Norwegian (Latin) Nærøy Naeroy Naroy
Ancient Greek (Greek) Φειδιππίδης Feidippidis Pheidippides
Modern Greek (Greek) Δημήτρης Φωτόπουλος Dimitris Fotopoylos Dimitris Fotopoulos
Russian (Cyrillic) Борис Николаевич Ельцин Boris Nikolaevich El'tsin Boris Nikolayevich Yeltsin
Ukrainian (Cyrillic) Володимир Горбулін Volodimir Gorbulin Volodymyr Horbulin
Bulgarian (Cyrillic) Търговище T'rgovishche Targovishte
Mandarin Chinese (Han) 深圳 ShenZhen Shenzhen
Cantonese Chinese (Han) 深水埗 ShenShuiBu Sham Shui Po
Korean (Hangul) 화성시 HwaSeongSi Hwaseong-si
Korean (Han) 華城市 HuaChengShi Hwaseong-si
Japanese (Hiragana) さいたま saitama Saitama
Japanese (Han) 埼玉県 QiYuXian Saitama-ken
Japanese (Katakana) トヨタ toyota Toyota
Arabic دمنهور dmnhwr Damanhur
Armenian Աբովյան Abovyan Abovyan
Georgian სამტრედია samt'redia Samtredia
Hebrew אברהם הלוי פרנקל 'vrhm hlvy frnkl Abraham Halevi Fraenkel
Unified English Braille (Braille) ⠠⠎⠁⠽⠀⠭⠀⠁⠛ ^say x ag Say it again
Bengali ময়মনসিংহ mymnsimh Mymensingh
Burmese (Myanmar) ထန်တလန် htntln Thantlang
Gujarati પોરબંદર porbmdr Porbandar
Hindi (Devanagari) महासमुंद mhasmumd Mahasamund
Kannada ಬೆಂಗಳೂರು bemgluru Bengaluru
Khmer សៀមរាប siemrab Siem Reap
Lao ສະຫວັນນະເຂດ sahvannaekhd Savannakhet
Malayalam കളമശ്ശേരി klmsseri Kalamassery
Odia ଗଜପତି gjpti Gajapati
Punjabi (Gurmukhi) ਜਲੰਧਰ jlmdhr Jalandhar
Sinhala රත්නපුර rtnpur Ratnapura
Tamil கன்னியாகுமரி knniyakumri Kanniyakumari
Telugu శ్రీకాకుళం srikakulm Srikakulam
Thai สงขลา sngkhla Songkhla
Symbols Input Output
Emojis 😎 👑 🍎 :sunglasses: :crown: :apple:
Misc. ☆ ♯ ♰ ⚄ ⛌ * # + 5 X
Letterlike № ℳ ⅋ ⅍ No M & A/S

Background

Unicode is the foundation for text in all modern software: it’s how all mobile phones, desktops, and other computers represent the text of every language. People are using Unicode every time they type a key on their phone or desktop computer, and every time they look at a web page or text in an application. *

Unicode is the universal character set, a global standard to support all the world's languages. It consists of 140,000+ characters used by 150+ scripts. It also contains various technical symbols, emojis, and other symbolic characters. Unicode characters are encoded into bytes using an encoding, typically UTF-8.

ASCII is the most compatible character set, established in 1967. It is a subset of Unicode and UTF-8 consisting of 128 characters using 7-bits. The printable characters are English letters, digits, and punctuation, with the remaining being control characters. All of the characters found on a standard US keyboard correspond to the printable ASCII characters.

... expressed only in the original non-control ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control. *

A language is represented in writing using characters from a specific script. A script can be alphabetic, logographic, syllabic, or something else. Some languages use multiple scripts: Japanese uses Kanji, Hiragana, and Katakana. Some scripts are used by multiple languages: Han characters are used in Chinese, Japanese, and Korean. The script used by English and ASCII is known as the Latin script.

When converting text between languages there are multiple properties that can be preserved:

  • Meaning: Translation replaces text with an equivalent in the target language with the same meaning.
  • Appearance: Preserving the visual appearance of a character when converting between languages is rarely possible and requires readers to have knowledge of the source language.
  • Sound: Orthographic transcription uses the spelling and pronunciation rules of the target language to produce text that a speaker of the target language will pronounce as accurately as possible to the original.
  • Spelling: Transliteration converts each letter individually using predictable rules. A reversible transliteration allows for reconstruction of the original text by using unique mappings for each letter.

Romanization is the conversion into the Latin script using transliteration or transcription or a mix of both. Romanization is most commonly used when representing the names of people and places.

Clear to anyone, Romanization is for foreigners. Geographical names are Romanized to help foreigners find the place they intend to go to and help them remember cities, villages and mountains they visited and climbed. But it is Koreans who make up the Roman transcription of their proper names to print on their business cards and draw up maps for international tourists. Sometimes, they write the lyrics of a Korean song in Roman letters to help foreigners join in a singing session or write part of a public address (in Korean) in Roman letters for a visiting foreign VIP. In this sense, it is for both foreigners and the local public. The Romanization system must not be a code only for the native English-speaking community here but an important tool for international communication between Korean society, foreign residents in the country and the entire external world. If any method causes much confusion because it is unable to properly reflect the original sound to the extent that different words are transcribed into the same Roman characters too frequently, it definitely is not a good system. *

Implementations

Any-Ascii is implemented in 7 different programming languages.

Go

package main

import (
    "github.com/hunterwb/any-ascii"
)

func main() {
    s := anyascii.Transliterate("άνθρωποι")
    // anthropoi
}

Go 1.10+ Compatible

Java

String s = AnyAscii.transliterate("άνθρωποι");
// anthropoi

Java 6+ compatible

Available through JitPack

Maven
<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>
<dependency>
    <groupId>com.hunterwb</groupId>
    <artifactId>any-ascii</artifactId>
    <version>0.1.5</version>
</dependency>
Gradle
repositories {
    maven { url 'https://jitpack.io' }
}
dependencies {
    implementation 'com.hunterwb:any-ascii:0.1.5'
}

JavaScript

Node.js
const anyAscii = require('any-ascii');

const s = anyAscii('άνθρωποι');
// anthropoi

Node.js 4.0+ compatible

Install latest release: npm install any-ascii

Install pre-release: npm install hunterwb/any-ascii

Python

from anyascii import anyascii

s = anyascii('άνθρωποι')
#  anthropoi

Python 3.3+ compatible

Install latest release: pip install anyascii

Install pre-release: pip install https://github.com/hunterwb/any-ascii/archive/master.zip#subdirectory=python

Ruby

require 'any_ascii'

s = AnyAscii.transliterate('άνθρωποι')
# anthropoi

Ruby 2.0+ compatible

Install latest release: gem install any_ascii

Use pre-release:

# Gemfile
gem 'any_ascii', git: 'https://github.com/hunterwb/any-ascii', glob: 'ruby/any_ascii.gemspec'

Rust

use any_ascii::any_ascii;

let s = any_ascii("άνθρωποι");
// anthropoi

Rust 1.20+ compatible

Use latest release:

# Cargo.toml
[dependencies]
any_ascii = "0.1.5"

Use pre-release:

# Cargo.toml
[dependencies]
any_ascii = { git = "https://github.com/hunterwb/any-ascii" }
CLI
$ anyascii άνθρωποι
anthropoi

Use cd rust && cargo build --release to build a native executable to rust/target/release/anyascii

.NET

C#
using AnyAscii;

string s = "άνθρωποι".Transliterate();
// anthropoi

See Also

ALA-LC Romanization
BGN/PCGN Romanization
CC-CEDICT: Free Mandarin Chinese Dictionary
Compart: Unicode Charts
Discord: Emojis
ICAO 9303: Machine Readable Passports
ISO: Transliteration Standards
KNAB Romanization Systems
Sean M. Burke: Unidecode
Sean M. Burke: Unidecode, Perl Journal
South Korea: Revised Romanization
Thomas T. Pedersen: Transliteration of Non-Roman Scripts
UNGEGN Romanization
Unicode CLDR: Transliteration Guidelines
Unicode: Emoji
Unicode: Unihan
Unified English Braille
Wikipedia: Romanization
Wiktionary: Romanization

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anyascii-0.1.5.tar.gz (183.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

anyascii-0.1.5-py3-none-any.whl (270.7 kB view details)

Uploaded Python 3

File details

Details for the file anyascii-0.1.5.tar.gz.

File metadata

  • Download URL: anyascii-0.1.5.tar.gz
  • Upload date:
  • Size: 183.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for anyascii-0.1.5.tar.gz
Algorithm Hash digest
SHA256 a677fd6b0c637c885f450e3ea04178d9c3fc45995785c2cd500e9228856c741b
MD5 7f2ccf5b3ad2ff971dc933760f2055a1
BLAKE2b-256 9e296021ed2196e58a303c88d3bd58a0dd8ca5255f01b5842104733a40854f5a

See more details on using hashes here.

File details

Details for the file anyascii-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: anyascii-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 270.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for anyascii-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 72415a8667a613f7e399166365c749a831b2ecd711a69faee311122d04302a80
MD5 fb32ee346a9bd7edbd65c2409e8bf008
BLAKE2b-256 ef46e87d327d935ced6c48464b494525a34b960d08cf8cd814b8439e80e4032d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page