Unicode to ASCII transliteration

These details have not been verified by PyPI

Project links

Homepage

Project description

Any-Ascii

Unicode to ASCII transliteration

Description
Examples
Reasoning
Implementations
- CLI
- Go
- Java
- Node.js
- Python
- Ruby
- Rust
See Also

Description

Converts Unicode text to a reasonable representation using only ASCII.

For most characters in Unicode, Any-Ascii provides an ASCII-only replacement string. Text is converted character-by-character without considering the context. The mappings for each language are based on popular existing romanization schemes. Symbolic characters are converted based on their meaning or appearance. All ASCII characters in the input are left unchanged, every other character is replaced with printable ASCII characters. Unknown characters are removed.

Examples

Representative examples for different languages comparing the Any-Ascii output to the conventional romanization.

Language	Script	Input	Output	Conventional
French	Latin	René François Lacôte	Rene Francois Lacote	Rene Francois Lacote
German	Latin	Großer Hörselberg	Grosser Horselberg	Grosser Hoerselberg
Vietnamese	Latin	Trần Hưng Đạo	Tran Hung Dao	Tran Hung Dao
Norwegian	Latin	Nærøy	Naeroy	Naroy
Ancient Greek	Greek	Φειδιππίδης	Feidippidis	Pheidippides
Modern Greek	Greek	Δημήτρης Φωτόπουλος	Dimitris Fotopoylos	Dimitris Fotopoulos
Russian	Cyrillic	Борис Николаевич Ельцин	Boris Nikolaevich El'tsin	Boris Nikolayevich Yeltsin
Arabic	Arabic	دمنهور	dmnhwr	Damanhur
Hebrew	Hebrew	אברהם הלוי פרנקל	'vrhm hlvy frnkl	Abraham Halevi Fraenkel
Georgian	Georgian	სამტრედია	samt'redia	Samtredia
Armenian	Armenian	Աբովյան	Abovyan	Abovyan
Thai	Thai	สงขลา	sngkhla	Songkhla
Lao	Lao	ສະຫວັນນະເຂດ	sahvannaekhd	Savannakhet
Mandarin Chinese	Han	深圳	ShenZhen	Shenzhen
Cantonese Chinese	Han	深水埗	ShenShuiBu	Sham Shui Po
Korean	Hangul	화성시	hwaseongsi	Hwaseong-si
Korean	Han	華城市	HuaChengShi	Hwaseong-si
Japanese	Hiragana	さいたま	saitama	Saitama
Japanese	Han	埼玉県	QiYuXian	Saitama-ken
Japanese	Katakana	トヨタ	toyota	Toyota
Unified English Braille	Braille	⠠⠎⠁⠽⠀⠭⠀⠁⠛	^say x ag	Say it again
Bengali	Bengali	ময়মনসিংহ	mymnsimh	Mymensingh
Gujarati	Gujarati	પોરબંદર	porbmdr	Porbandar
Hindi	Devanagari	महासमुंद	mhasmumd	Mahasamund
Kannada	Kannada	ಬೆಂಗಳೂರು	bemgluru	Bengaluru
Malayalam	Malayalam	കളമശ്ശേരി	klmsseri	Kalamassery
Punjabi	Gurmukhi	ਜਲੰਧਰ	jlmdhr	Jalandhar
Odia	Odia	ଗଜପତି	gjpti	Gajapati
Sinhala	Sinhala	රත්නපුර	rtnpur	Ratnapura
Tamil	Tamil	கன்னியாகுமரி	knniyakumri	Kanniyakumari
Telugu	Telugu	శ్రీకాకుళం	srikakulm	Srikakulam

Reasoning

Unicode is the universal character set, a global standard to support all the world's languages. It consists of 130,000+ characters used by 150 writing systems. Along with characters used in language, it also contains various technical symbols, emojis, and other symbolic characters. The String type in programming languages usually corresponds to Unicode text. Whenever text is used digitally on computers or the internet it is almost always represented using Unicode characters. Unicode characters are not stored directly but instead encoded into bytes using an encoding, typically UTF-8.

ASCII is the most compatible character set, established in 1967. It is a subset of Unicode and UTF-8 consisting of 128 characters using 7-bits in the range 0x00 - 0x7F. The printable characters are English letters, digits, and punctuation in the range 0x20 - 0x7E, with the remaining being control characters. All of the characters found on a standard US keyboard correspond to the printable ASCII characters.

Conversion into the Latin script used by English and ASCII is called romanization.

When converting between writing systems there are multiple properties that can be preserved:

Meaning: Translation replaces text with an equivalent in the target language with the same meaning. This relies heavily on context and automatic translation is extremely complicated.
Appearance: Preserving the visual appearance of a character when converting between languages is rarely possible and requires readers to have knowledge of the source language.
Sound: Orthographic transcription uses the spelling and pronunciation rules of the target language to produce text that a speaker of the target language will pronounce as accurately as possible to the original.
Spelling: Transliteration converts each letter individually using predictable rules. An unambiguous transliteration allows for reconstruction of the original text by using unique mappings for each letter. A phonetic transliteration instead uses the most phonetically accurate mappings which may result in duplicates or ambiguity.

Implementations

CLI

$ anyascii άνθρωποι
anthropoi

Use cd rust && cargo build --release to build a native executable to rust/target/release/anyascii

Go

package main

import (
    "github.com/hunterwb/any-ascii"
)

func main() {
    s := anyascii.Transliterate("άνθρωποι")
    // anthropoi
}

Go 1.10+ Compatible

Java

String s = AnyAscii.transliterate("άνθρωποι");
// anthropoi

Java 6+ compatible

Available through JitPack

Maven

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

<dependency>
    <groupId>com.hunterwb</groupId>
    <artifactId>any-ascii</artifactId>
    <version>0.1.3</version>
</dependency>

Gradle

repositories {
    maven { url 'https://jitpack.io' }
}

dependencies {
    implementation 'com.hunterwb:any-ascii:0.1.3'
}

Node.js

const anyAscii = require('any-ascii');

const s = anyAscii('άνθρωποι');
// anthropoi

Node.js 4.0+ compatible

Install latest release: npm install any-ascii

Install pre-release: npm install hunterwb/any-ascii

Python

from anyascii import anyascii

s = anyascii('άνθρωποι')
#  anthropoi

Python 3.3+ compatible

Install latest release: pip install anyascii

Install pre-release: pip install https://github.com/hunterwb/any-ascii/archive/master.zip#subdirectory=python

Ruby

require 'any_ascii'

s = AnyAscii.transliterate('άνθρωποι')
# anthropoi

Ruby 2.0+ compatible

Install latest release: gem install any_ascii

Use pre-release:

# Gemfile
gem 'any_ascii', git: 'https://github.com/hunterwb/any-ascii', glob: 'ruby/any_ascii.gemspec'

Rust

use any_ascii::any_ascii;

let s = any_ascii("άνθρωποι");
// anthropoi

Rust 1.20+ compatible

Use latest release:

# Cargo.toml
[dependencies]
any_ascii = "0.1.3"

Use pre-release:

# Cargo.toml
[dependencies]
any_ascii = { git = "https://github.com/hunterwb/any-ascii" }

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.2

Mar 16, 2023

0.3.1

Apr 6, 2022

0.3.0

Sep 3, 2021

0.2.0

Apr 18, 2021

0.1.7

Oct 19, 2020

0.1.6

Jul 28, 2020

0.1.5

May 2, 2020

0.1.4

Mar 20, 2020

This version

0.1.3

Feb 27, 2020

0.1.2

Feb 15, 2020

0.1.1

Jan 27, 2020

0.1.0

Jan 24, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anyascii-0.1.3.tar.gz (163.3 kB view hashes)

Uploaded Feb 27, 2020 Source

Built Distribution

anyascii-0.1.3-py3-none-any.whl (242.1 kB view hashes)

Uploaded Feb 27, 2020 Python 3

Hashes for anyascii-0.1.3.tar.gz

Hashes for anyascii-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`b34ff008c170b0fb867dd331d5a79754f7d9fcce39dc44782fd0429766c7a7b6`
MD5	`0cb9aa073d7dc03e7e5d9e539fae5e0c`
BLAKE2b-256	`f8e4133f78c96e2f2d6a68f9e371b3419f4a328c7187b986151cb2a97aa90cbc`

Hashes for anyascii-0.1.3-py3-none-any.whl

Hashes for anyascii-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05558ba0c93ea0f15d119de8ebb5ce14a5dea310c1d0913a50c15471e369f638`
MD5	`f86e2edadf603ab0690f6ec487d188a1`
BLAKE2b-256	`b12bcfd443e1ac72663a497d663d1fa12509eb2fa992d82b28fef24f13fc021b`