Unicode to ASCII transliteration

These details have not been verified by PyPI

Project links

Homepage

Project description

Any-Ascii

Unicode to ASCII transliteration

Web Demo

Description
Examples
Background
Details
Implementations
- Go
- Java
- JavaScript
- Python
- Ruby
- Rust
- Shell
- .NET
Unidecode
See Also

Description

Converts Unicode text to a reasonable representation using only ASCII.

For most characters in Unicode, Any-Ascii provides an ASCII-only replacement string. Text is converted character-by-character without considering the context. The mappings for each script are based on popular existing romanization schemes. Symbolic characters are converted based on their meaning or appearance. All ASCII characters in the input are left unchanged, every other character is replaced with printable ASCII characters. Unknown characters are removed.

Examples

Representative examples for different languages comparing the Any-Ascii output to the conventional romanization.

Language (Script)	Input	Output	Conventional
French (Latin)	René François Lacôte	Rene Francois Lacote	Rene Francois Lacote
German (Latin)	Großer Hörselberg	Grosser Horselberg	Grosser Hoerselberg
Vietnamese (Latin)	Trần Hưng Đạo	Tran Hung Dao	Tran Hung Dao
Norwegian (Latin)	Nærøy	Naeroy	Naroy
Ancient Greek (Greek)	Φειδιππίδης	Feidippidis	Pheidippides
Modern Greek (Greek)	Δημήτρης Φωτόπουλος	Dimitris Fotopoylos	Dimitris Fotopoulos
Russian (Cyrillic)	Борис Николаевич Ельцин	Boris Nikolaevich El'tsin	Boris Nikolayevich Yeltsin
Ukrainian (Cyrillic)	Володимир Горбулін	Volodimir Gorbulin	Volodymyr Horbulin
Bulgarian (Cyrillic)	Търговище	T'rgovishche	Targovishte
Mandarin Chinese (Han)	深圳	ShenZhen	Shenzhen
Cantonese Chinese (Han)	深水埗	ShenShuiBu	Sham Shui Po
Korean (Hangul)	화성시	HwaSeongSi	Hwaseong-si
Korean (Han)	華城市	HuaChengShi	Hwaseong-si
Japanese (Hiragana)	さいたま	saitama	Saitama
Japanese (Han)	埼玉県	QiYuXian	Saitama-ken
Japanese (Katakana)	トヨタ	toyota	Toyota
Amharic (Ethiopic)	ደብረ ዘይት	debre zeyt	Dobre Zeyit
Tigrinya (Ethiopic)	ደቀምሓረ	dek'emhare	Dekemhare
Arabic	دمنهور	dmnhwr	Damanhur
Armenian	Աբովյան	Abovyan	Abovyan
Georgian	სამტრედია	samt'redia	Samtredia
Hebrew	אברהם הלוי פרנקל	'vrhm hlvy frnkl	Abraham Halevi Fraenkel
Manding (N'Ko)	ߞߐߣߊߞߙߌ߫	konakri	konakiri
Unified English Braille (Braille)	⠠⠎⠁⠽⠀⠭⠀⠁⠛	+say x ag	Say it again
Bengali	ময়মনসিংহ	mymnsimh	Mymensingh
Burmese (Myanmar)	ထန်တလန်	htntln	Thantlang
Gujarati	પોરબંદર	porbmdr	Porbandar
Hindi (Devanagari)	महासमुंद	mhasmumd	Mahasamund
Kannada	ಬೆಂಗಳೂರು	bemgluru	Bengaluru
Khmer	សៀមរាប	siemrab	Siem Reap
Lao	ສະຫວັນນະເຂດ	sahvannaekhd	Savannakhet
Malayalam	കളമശ്ശേരി	klmsseri	Kalamassery
Odia	ଗଜପତି	gjpti	Gajapati
Punjabi (Gurmukhi)	ਜਲੰਧਰ	jlmdhr	Jalandhar
Sinhala	රත්නපුර	rtnpur	Ratnapura
Tamil	கன்னியாகுமரி	knniyakumri	Kanniyakumari
Telugu	శ్రీకాకుళం	srikakulm	Srikakulam
Thai	สงขลา	sngkhla	Songkhla

Symbols	Input	Output
Emojis	😎 👑 🍎	`:sunglasses: :crown: :apple:`
Misc.	☆ ♯ ♰ ⚄ ⛌	* # + 5 X
Letterlike	№ ℳ ⅋ ⅍	No M & A/S

Background

Unicode is the foundation for text in all modern software: it’s how all mobile phones, desktops, and other computers represent the text of every language. People are using Unicode every time they type a key on their phone or desktop computer, and every time they look at a web page or text in an application. *

Unicode is the universal character set, a global standard to support all the world's languages. It contains 140,000+ characters used by 150+ scripts along with emojis and various symbols. Typically encoded into bytes using UTF-8.

ASCII is the most compatible character set, established in 1967. It is a subset of Unicode and UTF-8 consisting of 128 characters using 7-bits. The printable characters are English letters, digits, and punctuation, with the remaining being control characters. The characters found on a standard US keyboard correspond to the printable ASCII characters.

... expressed only in the original non-control ASCII range so as to be as widely compatible with as many existing tools, languages, and serialization formats as possible and avoid display issues in text editors and source control. *

A language is written using characters from a specific script. A script can be alphabetic, logographic, syllabic, or something else. Some languages use multiple scripts: Japanese uses Kanji, Hiragana, and Katakana. Some scripts are used by multiple languages: Han characters are used in Chinese, Japanese, and Korean. The script used by English and ASCII is known as the Latin script.

When converting text between languages there are multiple properties that can be preserved:

Meaning: Translation
Appearance: Preserving the visual appearance of characters when converting between scripts is rarely possible and requires readers to have knowledge of the source language.
Sound: Transcription uses the spelling and pronunciation rules of the target language to produce text that will be pronounced as accurately as possible to the original.
Spelling: Transliteration converts each character individually using predictable rules. A reversible transliteration allows for reconstruction of the original text by using unique mappings for each character.

Romanization is the conversion into the Latin script using transliteration or transcription or a mix of both. Romanization is most commonly used when representing the names of people and places.

South Korea's Ministry of Culture & Tourism: Clear to anyone, Romanization is for foreigners. Geographical names are Romanized to help foreigners find the place they intend to go to and help them remember cities, villages and mountains they visited and climbed. But it is Koreans who make up the Roman transcription of their proper names to print on their business cards and draw up maps for international tourists. Sometimes, they write the lyrics of a Korean song in Roman letters to help foreigners join in a singing session or write part of a public address (in Korean) in Roman letters for a visiting foreign VIP. In this sense, it is for both foreigners and the local public. The Romanization system must not be a code only for the native English-speaking community here but an important tool for international communication between Korean society, foreign residents in the country and the entire external world. If any method causes much confusion because it is unable to properly reflect the original sound to the extent that different words are transcribed into the same Roman characters too frequently, it definitely is not a good system. *

Details

Comprehensive: Supports as many Unicode characters as possible. The benefits of providing full support even for rare or historic characters outweighs the small overhead of including them.

Simple: Easy to use, understand, and update. Able to be implemented with consistent behavior across multiple different programming languages. Has benefits for performance and data size.

Useful: Provides reasonable approximations of the spelling or pronunciation. Based on popular romanization systems in general use.

Implementations

Any-Ascii is implemented in 8 different programming languages.

Go

package main

import (
    "github.com/hunterwb/any-ascii"
)

func main() {
    s := anyascii.Transliterate("άνθρωποι")
    // anthropoi
}

Go 1.10+ Compatible

Java

String s = AnyAscii.transliterate("άνθρωποι");
// anthropoi

Java 6+ compatible

Available from JitPack

JavaScript

Node.js

const anyAscii = require('any-ascii');

const s = anyAscii('άνθρωποι');
// anthropoi

Node.js 4.0+ compatible

Install latest release: npm install any-ascii

Install pre-release: npm install hunterwb/any-ascii

Python

from anyascii import anyascii

s = anyascii('άνθρωποι')
#  anthropoi

Python 3.3+ compatible

Install latest release: pip install anyascii

Install pre-release: pip install https://github.com/hunterwb/any-ascii/archive/master.zip#subdirectory=python

Ruby

require 'any_ascii'

s = AnyAscii.transliterate('άνθρωποι')
# anthropoi

Ruby 2.0+ compatible

Install latest release: gem install any_ascii

Use pre-release:

# Gemfile
gem 'any_ascii', git: 'https://github.com/hunterwb/any-ascii', glob: 'ruby/any_ascii.gemspec'

Rust

use any_ascii::any_ascii;

let s = any_ascii("άνθρωποι");
// anthropoi

Rust 1.20+ compatible

Use latest release:

# Cargo.toml
[dependencies]
any_ascii = "0.1.6"

Use pre-release:

# Cargo.toml
[dependencies]
any_ascii = { git = "https://github.com/hunterwb/any-ascii" }

CLI

$ anyascii άνθρωποι
anthropoi

Use cd rust && cargo build --release to build a native executable to rust/target/release/anyascii

Shell

$ anyascii άνθρωποι
anthropoi

POSIX-compliant

Download

.NET

Install from NuGet

C#

using AnyAscii;

string s = "άνθρωποι".Transliterate();
// anthropoi

Unidecode

Any-Ascii is an alternative to (and inspired by) Unidecode and its many ports. Any-Ascii is more up-to-date and supports more than twice as many characters. Unidecode was originally written in 2001 with minor updates through 2016. It does not support any characters outside of the BMP.

Compare table.tsv and unidecode/table.tsv for a complete comparison between Any-Ascii and Unidecode. Note that the Unidecode output has been modified slightly and that unknown characters are replaced by "[?] " while they are removed by Any-Ascii.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.3

Jun 29, 2025

0.3.2

Mar 16, 2023

0.3.1

Apr 6, 2022

0.3.0

Sep 3, 2021

0.2.0

Apr 18, 2021

0.1.7

Oct 19, 2020

This version

0.1.6

Jul 28, 2020

0.1.5

May 2, 2020

0.1.4

Mar 20, 2020

0.1.3

Feb 27, 2020

0.1.2

Feb 15, 2020

0.1.1

Jan 27, 2020

0.1.0

Jan 24, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anyascii-0.1.6.tar.gz (190.6 kB view details)

Uploaded Jul 28, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

anyascii-0.1.6-py3-none-any.whl (280.6 kB view details)

Uploaded Jul 28, 2020 Python 3

File details

Details for the file anyascii-0.1.6.tar.gz.

File metadata

Download URL: anyascii-0.1.6.tar.gz
Upload date: Jul 28, 2020
Size: 190.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for anyascii-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`d9c828861f1d97e1b5d3522272020496b4ed450eabaab3ae6d983d8ba90c5f84`
MD5	`67cdfec132934102a55e5f492b21ffe3`
BLAKE2b-256	`1101c82c31583eda28e2d18d3dad22685349516d8a4029b22c745c4084d4cb6c`

See more details on using hashes here.

File details

Details for the file anyascii-0.1.6-py3-none-any.whl.

File metadata

Download URL: anyascii-0.1.6-py3-none-any.whl
Upload date: Jul 28, 2020
Size: 280.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.8.2

File hashes

Hashes for anyascii-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9edcf503bdabb889a897c217f0225b374a36d58fd67d34a80fc3de7ae564af17`
MD5	`b8be83c40839b33a2c75c9ae4834260f`
BLAKE2b-256	`f1170310d6a72d27dfe05567835d087b0d99bc7f1f8a298590d4e7a6d7cf5ce0`

See more details on using hashes here.

anyascii 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Any-Ascii

Table of Contents

Description

Examples

Background

Details

Implementations

Go

Java

JavaScript

Node.js

Python

Ruby

Rust

CLI

Shell

.NET

C#

Unidecode

See Also

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes