Skip to main content

A package for detecting the script (writing system) of given text.

Project description

GlotScript

  • GlotScript-Resource: provides a resource displaying the writing systems for various languages.

  • GlotScript-Tool: determines the script (writing system) of input text using ISO 15924.

Resource

What writing system is each language written in?

Example:

Language CORE AUXILLARY
Turkish (tur) Latn Arab, Cyrl, Grek
Thai (tha) Thai Latn
Vietnamese (vie) Latn Hani

See metadata folder for more languages.

Tool

It's a Python library that detects the script (writing system) of text based on ISO 15924.

Special codes

  • Zinh code is the Unicode script property value of characters that may be used with multiple scripts, and that inherit their script from a preceding base character. In some cases, we opted to integrate parts of the Zinh code (e.g. ARABIC FATHATAN..ARABIC HAMZA BELOW, ARABIC LETTER SUPERSCRIPT ALEF) into a different block.
  • Zyyy code is the Unicode script for "Common" characters.
  • Zzzz code is for Unicode script for "uncoded" script.

Install

from pip

pip3 install GlotScript

from git

pip3 install GlotScript@git+https://github.com/cisnlp/GlotScript

Usage

Script Detection

from GlotScript import sp
sp('これは日本人です')
>> ('Hira', 0.625, {'details': {'Hira': 0.625, 'Hani': 0.375}, 'tie': False, 'interval': 0.25})
sp('This is Latin')[:1]
>> ('Latn', 1.0)
sp('මේක සිංහල')[0]
>> 'Sinh'

Script Separation

from GlotScript import sc 
sent = "Hello Salut سلام 你好 こんにちは שלום مرحبا"
sc(sent)
>> {
   "Latn":"Hello Salut     ",
   "Hebr":"     שלום ",
   "Arab":"  سلام    مرحبا",
   "Hani":"   你好   ",
   "Hira":"    こんにちは  "
}

Exploring Unicode Blocks: Related Sources

Click to Exapand

Citation

If you use any part of this our resource or tool in your research, please cite it using the following BibTex entry.

@inproceedings{kargaran-etal-2024-glotscript-resource,
    title = "{G}lot{S}cript: A Resource and Tool for Low Resource Writing System Identification",
    author = {Kargaran, Amir Hossein  and
      Yvon, Fran{\c{c}}ois  and
      Sch{\"u}tze, Hinrich},
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.687",
    pages = "7774--7784"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glotscript-2.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glotscript-2.0-py3-none-any.whl (14.1 kB view details)

Uploaded Python 3

File details

Details for the file glotscript-2.0.tar.gz.

File metadata

  • Download URL: glotscript-2.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for glotscript-2.0.tar.gz
Algorithm Hash digest
SHA256 88d84170aef4f2df7b2dd2ae8a2f84ec2a227533e01414385d08ad93dff172bc
MD5 fe0e40ea36967fecbf7b5a753281d065
BLAKE2b-256 4964a87bcb66b83297d5c3850dde33c3eac0c7d4a073c71552fad1d0af3ffb65

See more details on using hashes here.

File details

Details for the file glotscript-2.0-py3-none-any.whl.

File metadata

  • Download URL: glotscript-2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for glotscript-2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5b094e4f099e5449243b6596f02b9c3a4c603bddff300d2e860e65ccbf050a57
MD5 52d0f443cf9828d877cb812c012fc211
BLAKE2b-256 7a3a1c89ff9f941d87ec5892690a7d2ecc41ed548590f7d47154e1d427bf6371

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page