Skip to main content

A library for text normalization for a set of languages

Project description

NORMALIZATION V4.0 - 4/2021 Ramy Eskander rnd2110@columbia.edu

Currently Supported Languages:

  • English
  • Swahili (1A)
  • Tagalog (1B)
  • Somali (1S)
  • Lithuanian (2B)
  • Bulgarian (2S)
  • Pashto (2C)
  • Farsi (3C)
  • Kazakh (3C)
  • Georgian (3B)

Updates since V3.2

  • Added support for Kazakh
  • Better coverage of punctuation marks, numbers and symbols.
  • Numbers are checked before punctuation removal, so that numeric periods, for instance, remain intact.
  • '&' is treated as a punctuation mark (as Python built-in punctuation checks are used)
  • Isolated single quotations are differentiated from apostrophes.

Updates since V3.5

  • Added support for Georgian
  • Better coverage of punctuation marks, symbols and apostrophe.

Usage (default config):

import normalization text = "some text" normalized_text = normalization.process(language, text, letters_to_keep='', letters_to_remove='', lowercase=False, remove_repetitions_count=-1, remove_punct=False, remove_digits=False, remove_vowels=False, remove_diacritics=True, remove_spaces=False, remove_apostrophe=False, copy_through=True, keep_romanized_text=True)

Where the arguments are as follow: 1- language:string (case-insensitive): Material codes (e.g., 1A), ISO codes (e.g., SWA) and full language names (e.g., Swahili) are all accepted inputs. 2- text:string 3- letters_to_keep:string (case-sensitive): Letters needed to be kept, overwrites the removal of vowels, diacritics, non-alphabet characters and built-in language mappings -- "" means do not use this feature. 4- letters_to_remove:string (case-sensitive): Letters needed to be removed -- "" means do not use this feature. 5- lowercase:boolean 6- remove_repetitions_count:int: The maximum number of allowed character repetitions (in a sequence), e.g., when set to 2, “mannner” changes to “manner” -- 0 means do not use this feature (after the built-in mapping, lower-casing and removal of extras (e.g., non-zero width joiners) and before any other operations). 7- remove_punct:boolean: Covers both punctuation marks and symbols 8- remove_digits:boolean 9- remove_vowels:boolean: Does not cover the short-vowel diacritics in Pashto and Farsi, and does not affect non-alphabet characters of the underling languages 10- remove_diacritics:boolean 11- remove_spaces:boolean 12- remove_apostrophe:boolean 13- copy_through:boolean: When set to True, none of the foreign letters gets omitted. 14- keep_romanized_text:boolean: This argument works when the language has a non-Latin script (Bulgarian and Pashto). When set to True, none of the letters of the Romanized Bulgarian script (a-z + ŭĭui͡ei͡ai͡u), in the case of Bulgarian, and the Romanized Pashto script (a-z), in the case of Pashto) gets omitted. When set to False, the Romanized Bulgarian letters are transliterated into the Cyrillic script, in the case of Bulgarian, and the romanized Pashto letters are omitted (transliteration is Pashto is not supported).

Note: In the case of Pashto, text cleanup is always performed. This includes: 1- Converting some letters that are not in the original Pashto and Farsi scripts to their Pashto cognates 2- Normalizing the orthography by resolving the cases where two or more letters are used interchangeably (in an inconsistent manner) 3- (1) and (2) can be overwritten by using the letters_to_keep parameter. 4- In the case of Kazakh: --- Arabic letters are always transliterated into Cyrillic. --- Latin letters are transliterated when keep_romanized_text is set to False. --- Latin homoglyphs are transliterated if the text has at least one Cyrillic character that belongs to the Kazakh script. --- Kazakh extended Cyrillic letters are not transliterated into regular Cyrillic.

Project details


Release history Release notifications | RSS feed

This version

4.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scriptnorm-4.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

scriptnorm-4.0-py3-none-any.whl (10.1 kB view details)

Uploaded Python 3

File details

Details for the file scriptnorm-4.0.tar.gz.

File metadata

  • Download URL: scriptnorm-4.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.9

File hashes

Hashes for scriptnorm-4.0.tar.gz
Algorithm Hash digest
SHA256 47a71e1c7ca164a4b23e80458d9f1ceac2384c3e66e68d808919f3dd4f7d71b1
MD5 a53e16be98094160e20b364c1f666ee7
BLAKE2b-256 78c442bd7b0062844b62f6bc518f67e0d4dd26899894e01bcbf616169efe700e

See more details on using hashes here.

File details

Details for the file scriptnorm-4.0-py3-none-any.whl.

File metadata

  • Download URL: scriptnorm-4.0-py3-none-any.whl
  • Upload date:
  • Size: 10.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.6.9

File hashes

Hashes for scriptnorm-4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4b1a7aa15f2327b56583c256bd83165a1e57272f36fbe036eccc928650cc8f3a
MD5 875d0fc69c63b70d5311f88ab1651b16
BLAKE2b-256 1418e8cb078320560b4770aa5f11e38cdab1aff2c27c7f39989650cc22d347ba

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page