Skip to main content

A package for detecting the script (writing system) of given text.

Project description

GlotScript

  • GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.

  • GlotScript-Resource: provides a resource displaying the writing systems for various languages.

GlotScript Resource

What writing system is each language written in?

See metadata folder.

GlotScript Tool

Detect the script (writing system) of text based on ISO 15924.

Special codes

  • Zinh code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
  • Zyyy code is the Unicode script for "Common" characters.
  • Zzzz code is for Unicode script for "uncoded" script.

Install from pip

pip3 install GlotScript

Install from git

pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript

Usage: Script Detection

from GlotScript import get_script_predictor
sp = get_script_predictor()

OR

from GlotScript import sp
sp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1]
>> ('Latn', 1.0)
sp('මේක සිංහල')[0]
>> 'Sinh'
sp('𝄞𝄫  𒊕𒀸')
>> ('Xsux', 0.5, {'details': {'Xsux': 0.5, 'Zyyy': 0.5}, 'tie': True, 'interval': 0.0})

Usage: Script Separation

from GlotScript import separate_script
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
separate_script(sent)
>> {
   "Latn":"Hello Salut     ",
   "Hebr":"     שלום ",
   "Arab":"  سلام    مرحبا",
   "Hani":"   你好   ",
   "Hira":"    こんにちは  "
}

Exploring Unicode Blocks: Related Sources

Click to Exapand

Citation

If you use any part of this library in your research, please cite it using the following BibTex entry.

@article{kargaran2023glotscript,
title        = {GlotScript: A Resource and Tool for Low Resource Writing System Identification},
author       = {Kargaran, Amir Hossein and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
year         = 2023,
journal      = {arXiv preprint arXiv:2309.13320}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GlotScript-1.2.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

GlotScript-1.2-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file GlotScript-1.2.tar.gz.

File metadata

  • Download URL: GlotScript-1.2.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for GlotScript-1.2.tar.gz
Algorithm Hash digest
SHA256 536acb3182f78f349b6af7bdc5a1292c55c6a8a4c48605f010663aa0930cba3f
MD5 63caa359d9d533e42fc9fa1201453d16
BLAKE2b-256 88ed24fea61a982a6caedd03e6df4fad590e13f298e5a231bc04b193ee4953b2

See more details on using hashes here.

File details

Details for the file GlotScript-1.2-py3-none-any.whl.

File metadata

  • Download URL: GlotScript-1.2-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.18

File hashes

Hashes for GlotScript-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a41cbbe0ef1e7317fa229da81f79f0e92adac940cbdc27ce5dd14328cf5aba74
MD5 5b3fdd107aaff07e019e964f32969ea6
BLAKE2b-256 d39a2797ccee0eb8fd52ba9b060489049f34a50f939090f7d4166838ecfcfa53

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page