Unicode to ASCII transliteration
Project description
Any-Ascii
Unicode to ASCII transliteration
Table of Contents
Description
Converts Unicode text to a reasonable representation using only ASCII.
Unicode is the universal character set, a global standard to support all the world's languages.
It consists of 130,000+ characters used by 150 writing systems.
Along with characters used in language, it also contains various technical symbols, emojis, and other symbolic characters.
The String
type in programming languages usually corresponds to Unicode text.
Whenever text is used digitally on computers or the internet it is almost always represented using Unicode characters.
Unicode characters are not stored directly but instead encoded into bytes using an encoding, typically UTF-8.
ASCII is the most compatible character set, established in 1967.
It is a subset of Unicode and UTF-8 consisting of 128 characters using 7-bits in the range 0x00
- 0x7F
.
The printable characters are English letters, digits, and punctuation in the range 0x20
- 0x7E
,
with the remaining being control characters.
All of the characters found on a standard US keyboard correspond to the printable ASCII characters.
Conversion into the Latin script used by English and ASCII is called romanization.
When converting between writing systems there are multiple properties that can be preserved:
- Meaning: Translation replaces text with an equivalent in the target language with the same meaning. This relies heavily on context and automatic translation is extremely complicated.
- Appearance: Preserving the visual appearance of a character when converting between languages is rarely possible and requires readers to have knowledge of the source language.
- Sound: Orthographic transcription uses the spelling and pronunciation rules of the target language to produce text that a speaker of the target language will pronounce as accurately as possible to the original.
- Spelling: Transliteration converts each letter individually using predictable rules. An unambiguous transliteration allows for reconstruction of the original text by using unique mappings for each letter. A phonetic transliteration instead uses the most phonetically accurate mappings which may result in duplicates or ambiguity.
Any-Ascii is a transliteration, it converts text character-by-character without considering the context. Characters used in language are converted using the most popular already existing transliteration scheme for each language, with small modifications. Symbolic characters are instead converted based on their meaning or appearance.
Examples
Language | Script | Input | Output | Actual |
---|---|---|---|---|
French | Latin | René François Lacôte | Rene Francois Lacote | Rene Francois Lacote |
German | Latin | Großer Hörselberg | Grosser Horselberg | Grosser Hoerselberg |
Vietnamese | Latin | Trần Hưng Đạo | Tran Hung Dao | Tran Hung Dao |
Norwegian | Latin | Nærøy | Naeroy | Naroy |
Ancient Greek | Greek | Φειδιππίδης | Feidippidis | Pheidippides |
Modern Greek | Greek | Δημήτρης Φωτόπουλος | Dimitris Fotopoylos | Dimitris Fotopoulos |
Russian | Cyrillic | Борис Николаевич Ельцин | Boris Nikolaevich El'tsin | Boris Nikolayevich Yeltsin |
Hebrew | Hebrew | אברהם הלוי פרנקל | 'vrhm hlvy frnkl | Abraham Halevi Fraenkel |
Mandarin Chinese | Han | 深圳 | ShenZhen | Shenzhen |
Cantonese Chinese | Han | 深水埗 | ShenShuiBu | Sham Shui Po |
Korean | Hangul | 화성시 | hwaseongsi | Hwaseong-si |
Korean | Han | 華城市 | HuaChengShi | Hwaseong-si |
Japanese | Hiragana | さいたま | saitama | Saitama |
Japanese | Han | 埼玉県 | QiYuXian | Saitama-ken |
Japanese | Katakana | トヨタ | toyota | Toyota |
Unified English Braille | Braille | ⠠⠎⠁⠽⠀⠭⠀⠁⠛ | ^say x ag | Say it again |
Implementations
CLI
$ anyascii άνθρωποι
anthropoi
Use cd rust && cargo build --release
to build a native executable to rust/target/release/anyascii
Go
package main
import (
"github.com/hunterwb/any-ascii"
)
func main() {
s := anyascii.Transliterate("άνθρωποι")
// anthropoi
}
Java
String s = AnyAscii.transliterate("άνθρωποι");
// anthropoi
Java 6+ compatible
Available through JitPack
Maven
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependency>
<groupId>com.hunterwb</groupId>
<artifactId>any-ascii</artifactId>
<version>0.1.1</version>
</dependency>
Gradle
repositories {
maven { url 'https://jitpack.io' }
}
dependencies {
implementation 'com.hunterwb:any-ascii:0.1.1'
}
Node.js
const anyAscii = require('any-ascii');
const s = anyAscii('άνθρωποι');
// anthropoi
Node.js 4+ compatible
Install latest release: npm install any-ascii
Install pre-release: npm install hunterwb/any-ascii
Python
from anyascii import anyascii
s = anyascii('άνθρωποι')
# anthropoi
Python 3.3+ compatible
Install latest release: pip install anyascii
Install pre-release: pip install https://github.com/hunterwb/any-ascii/archive/master.zip#subdirectory=python
Ruby
require 'any_ascii'
s = AnyAscii.transliterate('άνθρωποι')
# anthropoi
Use pre-release:
# Gemfile
gem 'any_ascii', git: 'https://github.com/hunterwb/any-ascii', glob: 'ruby/any_ascii.gemspec'
Rust
use any_ascii::any_ascii;
let s = any_ascii("άνθρωποι");
// anthropoi
Use pre-release:
[dependencies]
any_ascii = { git = "https://github.com/hunterwb/any-ascii" }
See Also
ALA-LC Romanization
BGN/PCGN Romanization
Compart: Unicode Charts
ICAO 9303: Machine Readable Passports
ISO 9: Cyrillic Romanization
KNAB Romanization Systems
Sean M. Burke: Unidecode
Sean M. Burke: Unidecode, Perl Journal
Thomas T. Pedersen: Transliteration of Non-Roman Scripts
UNGEGN Romanization
Unicode CLDR: Transliteration Guidelines
Unicode Unihan Database
Unified English Braille
Wikipedia: Romanization of Arabic
Wikipedia: Romanization of Georgian
Wikipedia: Romanization of Greek
Wikipedia: Romanization of Hebrew
Wikipedia: Romanization of Russian
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.