A package for detecting the script (writing system) of given text.
Project description
GlotScript
-
GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.
-
GlotScript-Resource: provides a resource displaying the writing systems for various languages.
GlotScript Resource
What writing system is each language written in?
See metadata folder.
GlotScript Tool
Detect the script (writing system) of text based on ISO 15924.
- Unicode version: 15.0.0
- The codes were sourced from Wikipedia ISO_15924.
- Unicode ranges were extracted from Unicode Character Database.
Special codes
Zinh
code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.Zyyy
code is the Unicode script for "Common" characters.Zzzz
code is for Unicode script for "uncoded" script.
Install from pip
pip3 install GlotScript
Install from git
pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript
Usage: Script Detection
from GlotScript import get_script_predictor
sp = get_script_predictor()
OR
from GlotScript import sp
sp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1]
>> ('Latn', 1.0)
sp('මේක සිංහල')[0]
>> 'Sinh'
sp('𝄞𝄫 𒊕𒀸')
>> ('Xsux', 0.5, {'details': {'Xsux': 0.5, 'Zyyy': 0.5}, 'tie': True, 'interval': 0.0})
Usage: Script Separation
from GlotScript import separate_script
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
separate_script(sent)
>> {
"Latn":"Hello Salut ",
"Hebr":" שלום ",
"Arab":" سلام مرحبا",
"Hani":" 你好 ",
"Hira":" こんにちは "
}
Exploring Unicode Blocks: Related Sources
Click to Exapand
- List of Unicode characters - Wikipedia
- Lightweight Plain-Text Editor for macOS - CotEditor
- The Cygwin Terminal – terminal emulator for Cygwin, MSYS, and WSL - mintty
- ISO_15924 Wikipedia
- Unicode Character Database (Blocks) - Unicode
- Unicode Character Database (Scripts) - Unicode
- A free, web-based font editor, focusing on font design hobbyists. - Glyphr-Studio-1
- Kotlin - JetBrains
- UNIX-like reverse engineering framework and command-line toolset - radare2
- FreeOrion Game
- DOMinator - Firefox
- SHSans-derived CJK font family - glow-sans
- Unicode Subset Bitfields - Microsoft
- Stops - FAIR NLLB FB
- Gradient Boosting on Decision Trees - catboost
- Blender
- Unicode Wikipedia
Citation
If you use any part of this library in your research, please cite it using the following BibTex entry.
@article{kargaran2023glotscript,
title = {GlotScript: A Resource and Tool for Low Resource Writing System Identification},
author = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
year = 2023,
journal = {arXiv preprint arXiv:2309.13320}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
GlotScript-1.2.tar.gz
(14.3 kB
view details)
Built Distribution
GlotScript-1.2-py3-none-any.whl
(14.5 kB
view details)
File details
Details for the file GlotScript-1.2.tar.gz
.
File metadata
- Download URL: GlotScript-1.2.tar.gz
- Upload date:
- Size: 14.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 536acb3182f78f349b6af7bdc5a1292c55c6a8a4c48605f010663aa0930cba3f |
|
MD5 | 63caa359d9d533e42fc9fa1201453d16 |
|
BLAKE2b-256 | 88ed24fea61a982a6caedd03e6df4fad590e13f298e5a231bc04b193ee4953b2 |
File details
Details for the file GlotScript-1.2-py3-none-any.whl
.
File metadata
- Download URL: GlotScript-1.2-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.8.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a41cbbe0ef1e7317fa229da81f79f0e92adac940cbdc27ce5dd14328cf5aba74 |
|
MD5 | 5b3fdd107aaff07e019e964f32969ea6 |
|
BLAKE2b-256 | d39a2797ccee0eb8fd52ba9b060489049f34a50f939090f7d4166838ecfcfa53 |