The wildebeest scripts investigate, repair and normalize a wide range of text file problems at the character level, e.g. encoding errors, normalization of characters into their canonical form, mapping digits and some punctuation to ASCII, deletion of some non-printable characters.
Project description
wildebeest
The wildebeest scripts investigate, repair and normalize text for a wide range of issues at the character level.
wb-ana (or wb_analysis.py)
This script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, zero-with characters, letters/numbers/punctuation/letter-modifiers from various scripts (e.g. Latin and Cyrillic), tokens with letters from different scripts, XML tokens, tokens with certain punctuation of interest, orphan letter modifiers, non-canonical character combinations.
wb-norm (or wb_normalize.py)
This script automatically corrects some of the issues raised by wb-ana. The script can repair common encoding errors, normalize characters into their UTF8-canonical form, map digits and some punctuation to ASCII, delete many non-printable characters and perform other repair, normalization and cleaning steps. A few steps are specific to Pashto, Farsi, or Devanagari (Hindi etc.). Normalization steps can be activated à la carte.
Installation
Click here for installation info
# Install from PyPi:
pip install wildebeest-nlp
# Alternatively, pip-install from GitHub master branch:
pip install git+https://github.com/uhermjakob/wildebeest.git
# Alternatively, clone GitHub, which might be useful for editing/development:
git clone https://github.com/uhermjakob/wildebeest.git
# or git clone git://github.com/uhermjakob/wildebeest.git
cd wildebeest
pip install --editable . # run it from dir having setup.py
A pip-install will provide commands wb-norm
and wb-ana
as well as their alternate forms wb_normalize.py
and wb_analysis.py
.
After a regular git clone
(without pip-install), in order to be able to call the Python scripts wb_normalize.py
and wb_analysis.py
, make sure that:
wb_normalize.py
andwb_analysis.py
are executable (i.e. 'x' mode bits are set)- your $PYTHONPATH includes the directory in which this README file resides in ("outer wildebeest") and
- your $PATH includes the directory that includes
wb_normalize.py
andwb_analysis.py
("inner wildebeest")
wb-norm (or wb_normalize.py)
The script repairs common encoding errors, normalizes characters into their canonical form, deletes many non-printable characters and performs other repair, normalization and cleaning steps. The script can be parameterized to include or exclude specific normalization steps (e.g. whether or not to map non-ASCII digits and punctuation to ASCII). A few steps are specific to Pashto, Farsi, or Devanagari (Hindi etc.).
Usage (click below for details)
CLI to normalize a file: wb-norm
or wb_normalize.py
usage: wb-norm [-h] [-i INPUT-FILENAME] [-o OUTPUT-FILENAME] [--lc LANGUAGE-CODE] [--skip NORM-STEPS]
[--add NORM-STEPS] [--all] [--all-except NORM-STEPS] [--only NORM-STEPS] [-v] [--version]
# or wb_normalize.py [-h] ...
Normalizes and cleans a given text
options:
-h, --help show this help message and exit
-i INPUT-FILENAME, --input INPUT-FILENAME
(default: STDIN)
-o OUTPUT-FILENAME, --output OUTPUT-FILENAME
(default: STDOUT)
--lc LANGUAGE-CODE ISO 639-3, e.g. 'fas' for Persian
--skip NORM-STEPS perform all default normalization/cleaning steps except those specified in comma-separated list
(default normalization/cleaning steps: repair-encoding-errors,del-surrogate,del-ctrl-char,
del-tatweel,core-compat,pres-form,hangul,repair-combining,combining-compose,combining-decompose,
repair-xml,repair-url-escapes)
--add NORM-STEPS perform all default normalization/cleaning steps plus those specified in comma-separated list
(non-default normalization/cleaning steps: del-zero-width,del-arabic-diacr,del-hebrew-diacr,
ligatures,signs-and-symbols,cjk,width,font,small,vertical,enclosure,punct,punct-dash,punct-arabic,
punct-cjk,punct-greek,punct-misc-f,space,digit,arabic-char,farsi-char,pashto-char,georgian-char,
look-alike,repair-token)
--all perform all normalization/cleaning steps, i.e. repair-encoding-errors,del-surrogate,
del-zero-width,del-ctrl-char,del-tatweel,del-arabic-diacr,del-hebrew-diacr,core-compat,pres-form,
ligatures,signs-and-symbols,cjk,width,font,small,vertical,enclosure,hangul,repair-combining,
combining-compose,combining-decompose,punct,punct-dash,punct-arabic,punct-cjk,punct-greek,
punct-misc-f,space,digit,arabic-char,farsi-char,pashto-char,georgian-char,look-alike,repair-xml,
repair-url-escapes,repair-token
--all-except NORM-STEPS
perform all normalization/cleaning steps except those specified in comma-separated list
--only NORM-STEPS perform only normalization/cleaning steps specified in comma-separated list
-v, --verbose write change log etc. to STDERR
--version show program's version number and exit
Examples:
wb-norm -h # for full usage info
wb-norm --version
cd `pip show wildebeest-nlp | grep ^Location | cut -d ' ' -f 2` # go to directory where wildebeest-nlp is installed
cd wildebeest/test/data
wb-norm --lc fas -i wildebeest-test.txt -o wildebeest-test-norm.txt
wb-norm --lc fas --verbose --skip del-ctrl-char,del-tatweel < wildebeest-test.txt > wildebeest-test-norm-custom.txt
wb-norm --all < wildebeest-test.txt > wildebeest-test-norm-all.txt
wb-norm --all-except del-arabic-diacr,del-hebrew-diacr < wildebeest-test.txt
wb-norm --only del-arabic-diacr,del-hebrew-diacr < wildebeest-test.txt
wb-norm --add del-arabic-diacr,del-hebrew-diacr --skip del-ctrl-char,del-tatweel < wildebeest-test.txt
Same for alternate script name wb_normalize.py
wb_normalize.py -h # for full usage info
wb_normalize.py --version
cd `pip show wildebeest-nlp | grep ^Location | cut -d ' ' -f 2`
cd wildebeest/test/data
wb_normalize.py --lc fas -i wildebeest-test.txt -o wildebeest-test-norm.txt
wb_normalize.py --lc fas --verbose --skip del-ctrl-char,del-tatweel < wildebeest-test.txt > wildebeest-test-norm-custom.txt
wb_normalize.py --all < wildebeest-test.txt > wildebeest-test-norm-all.txt
wb_normalize.py --all-except del-arabic-diacr,del-hebrew-diacr < wildebeest-test.txt
wb_normalize.py --only del-arabic-diacr,del-hebrew-diacr < wildebeest-test.txt
wb_normalize.py --add del-arabic-diacr,del-hebrew-diacr --skip del-ctrl-char,del-tatweel < wildebeest-test.txt
Note: For robustness regarding input files that do not fully conform to UTF8, please use -i (rather than STDIN), as it includes UTF8-encoding error handling.
norm_clean_string (Python function call to normalize a string)
Note: When working on a clone (as opposed to a pip-install), please make sure that your $PYTHONPATH includes the directory in which this README file resides.
from wildebeest.wb_normalize import Wildebeest
wb = Wildebeest()
ht = wb.build_norm_step_dict(base='ALL') # base values: 'NONE', 'DEFAULT', 'ALL' (normalization steps)
# ht = wb.build_norm_step_dict() # defaults: base = 'DEFAULT', skip = None, add = None
# ht = wb.build_norm_step_dict(base='NONE', add=['digit', 'enclosure']) # normalize only digits (to ASCII) and enclosures
# ht = wb.build_norm_step_dict(base='DEFAULT', skip=['del-tatweel'], add=['digit', 'space'])
# ht = wb.build_norm_step_dict(base='ALL', skip=['punct-dash', 'enclosure', 'del-arabic-diacr'])
wb.load_look_alike_file() # optional
print(wb.norm_clean_string('🄐…25km²', ht, lang_code='eng'))
print(wb.norm_clean_string('೧೯೨೩', ht, lang_code='kan'))
Normalization Steps
The script can perform a wide variety of normalization steps.
- 12 normalization steps are performed by default, including basic character repair and UTF8 encoding normalization. The default is generally suitable for applications that largely need to preserve the original text.
- Another 25 normalization steps are available through options
--add (list of steps)
,--all
,--all-except (list of steps)
. The--all
and--all-excpet
settings are suitable for many NLP applications. - Default normalization steps can be disabled by option
--skip (list of steps)
. - Option
--only (list of steps)
applies only the normalization steps listed (without default normalization steps unless explicitly listed). - Option
--all-except (list of steps)
is equivalent to--all --skip (list of steps)
List of normalization steps included by default
repair-encoding-errors
The script generally expects input encoded in UTF8. However, it will recognize and repair some common text encoding errors:- (Some) text is still encoded in Windows1252 or Latin1. Any byte that is not part of a well-formed UTF8 character will be interpreted as a Windows1252 character (and mapped to UTF8). This includes printable Latin1 characters as a subset.
- Text in Windows1252 was incorrectly converted to UTF8 by a Latin1-to-UTF8 converter. This maps Windows1252 characters \x80-\x9F to \u0080-\uu009F, which is the Unicode block of C1 control characters. These C1 control characters are extremely rare, and so our script will interpret such C1 control characters as ill-converted Windows1252 characters, as do many major software applications such as Google Chrome, Microsoft Outlook, Github (text files) and PyCharm (where they are often displayed in a slightly different form).
- Text in Windows1252 or Latin1 was converted twice, using some combination of Latin1-to-UTF8 converter and Windows1252-to-UTF converter; or a file already in UTF8 was incorrectly subjected to another conversion. Sample wildebeest repair:
- Input: Donât tell your âfiancéâ â Schöne GrüÃe aus Mähren⦠â Ma sÅur trouve ça «bête». ¡Coño! â¬50 ⢠25km² ⢠½µm
- Output: Don’t tell your “fiancé” — Schöne Grüße aus Mähren… – Ma sœur trouve ça «bête». ¡Coño! €50 • 25km² • ½µm
del-surrogate
deletes surrogate characters (representing non-UTF8 characters in input), alternative/backup to windows-1252del-ctrl-char
deletes control characters (expect tab and linefeed), some variation selectorsdel-tatweel
deletes Arabic tatweel (a text alignment character that increases the distance between Arabic letters)core-compat
normalizes Hangul Compatibility characters to Unicode standard Hangul characterspres-form
e.g. maps from presentation form (isolated, initial, medial, final) to standard formhangul
combine Hangul jamos onto Hangul syllablesrepair-combining
e.g. order of nukta/vowel-signcombining-compose
e.g. applies combining-modifiers to preceding character, e.g. ö (o + ̈) -> öcombining-decompose
e.g. for some Indian characters, splits off Nuktarepair-xml
e.g. repairs multi-escaped tokens such as " or &#x200C;repair-url-escapes
e.g. repairs multi-escaped url substrings such as Jo%25C3%25ABlle_Aubron
List of additional normalization steps included by --all option
del-zero-width
deletes zero-width characters, byte order mark, directional marks, join marksarabic-char
to Arabic canonical forms, e.g. maps Farsi kaf/yeh to Arabic versionsfarsi-char
to Farsi canonical forms, e.g. maps Arabic yeh, kaf to Farsi versionspashto-char
to Pashto canonical forms, e.g. maps Arabic kaf to Farsi versiongeorgian-char
to Georgian canonical forms, e.g. to standard script, map archaic charactersligatures
e.g. decomposes non-Arabic ligatures (e.g. ij, ffi, DŽ, ﬓ)signs-and-symbols
e.g. maps symbols (e.g. kappa symbol) and signs (e.g. micro sign µ)cjk
e.g. CJK square composites (e.g. ㋀㏾)width
e.g. maps fullwidth and halfwidth characters to ASCII, e.g. A to Afont
maps font-variations characters such as ℂ, ℹ, 𝒜 to regular characterssmall
maps small versions of characters to normal versions, such as small ampersand ﹠ to regular &vertical
maps vertical versions of punctuation characters with normal horizontal version, such as vertical em-dash ︱ to horizontal em-dash —enclosure
decomposes circled, squared and parenthesized characters, e.g. 🄐 to (A)del-arabic-diacr
e.g. deletes optional Arabic diacritics such as fatha, damma, kasradel-hebrew-diacr
e.g. deletes Hebrew pointsdigit
e.g. maps decimal-system digits of 54 scripts to ASCII digitspunct
e.g. maps ellipsis … to periods ... and two-dot-lead ‥ to ..; a few math symbols ∭; ⒛ 🄆punct-dash
e.g. maps various dashes, hyphens, minus signs to ASCII hyphen-minuspunct-arabic
e.g. Arabic exclamation mark etc. to ASCII equivalentpunct-cjk
e.g. Chinese Ideographic Full Stop etc. to ASCII equivalentpunct-greek
e.g. Greek question mark etc. to ASCII equivalentpunct-misc-f
e.g. Tibetan punctuation to ASCII equivalentspace
e.g. maps non-zero spaces to normal spacelook-alike
normalizes Latin/Cyrillic/Greek look-alike characters, e.g. Latin character A to Greek Α (capital alpha) in otherwise Greek wordrepair-token
e.g. splits +/-/*/digits off Arabic words; maps not-sign inside Arabic to token-separating hyphen
wb-ana (or wb_analysis.py)
Script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, zero-with characters, letters/numbers/punctuation/letter-modifiers from various scripts, tokens with letters from different scripts, XML tokens, tokens with certain punctuation of interest, orphan letter modifiers, non-canonical character combinations.
Usage
CLI to analyze a file: wb-ana
or wb_analysis.py
usage: wb-ana [-h] [-i INPUT-FILENAME] [--batch BATCH] [-s] [-o OUTPUT-FILENAME] [-j JSON-OUTPUT-FILENAME] [--file_id FILE_ID]
[--lc LANGUAGE-CODE] [-v] [-pb] [-n MAX_CASES] [-x MAX_EXAMPLES] [-r REF-FILENAME] [--version]
# or wb_analysis.py [-h] ...
Analyzes a given text for a wide range of anomalies
options:
-h, --help show this help message and exit
-i INPUT-FILENAME, --input INPUT-FILENAME
(default: STDIN)
--batch BATCH_DIR Directory with batch of input files (BATCH_DIR/*.txt)
-s, --summary single summary line per file
-o OUTPUT-FILENAME, --output OUTPUT-FILENAME
(default: STDOUT)
-j JSON-OUTPUT-FILENAME, --json JSON-OUTPUT-FILENAME
(default: None)
--file_id FILE_ID
--lc LANGUAGE-CODE ISO 639-3, e.g. 'fas' for Persian
-v, --verbose write change log etc. to STDERR
-pb, --progress_bar Show progress bar
-n MAX_CASES, --max_cases MAX_CASES
max number of cases per group
-x MAX_EXAMPLES, --max_examples MAX_EXAMPLES
max number of examples per line
-r REF-FILENAME, --ref_id_file REF-FILENAME
(optional file with sentence reference IDs)
--version show program's version number and exit
Examples:
wb-ana --help
echo 'Hеllο!' | wb-ana # 'Hеllο!' mischievously includes a Cyrillic and a Greek character
echo 'Hеllο!' | wb-norm --all | wb-ana # different result
cd `pip show wildebeest-nlp | grep ^Location | cut -d ' ' -f 2` # go to directory where wildebeest-nlp is installed
cd wildebeest/test/data
wb-ana -i hello.txt
wb-ana -i wildebeest-test.txt -o wildebeest-test-out
wb-ana --batch phrasebook -s -o phrasebook-dir-out
wb-ana -i phrasebook/deu.txt -r phrasebook/eng.txt -o phrasebook-deu-out
wb-ana -i wildebeest-test-invalid-utf8.txt
Same for alternate script name wb_analysis.py
wb_analysis.py --help
echo 'Hеllο!' | wb_analysis.py
echo 'Hеllο!' | wb_normalize.py --all | wb_analysis.py
cd `pip show wildebeest-nlp | grep ^Location | cut -d ' ' -f 2`
cd wildebeest/test/data
wb_analysis.py -i hello.txt
wb_analysis.py -i wildebeest-test.txt -o wildebeest-test-out
wb_analysis.py --batch phrasebook -s -o phrasebook-dir-out
wb_analysis.py -i phrasebook/deu.txt -r phrasebook/eng.txt -o phrasebook-deu-out
wb_analysis.py -i wildebeest-test-invalid-utf8.txt
wildebeest.wb_analysis.process (Python function call to analyze a string, a list of strings, or a file)
Note: When working on a clone (as opposed to a pip-install), please make sure that your $PYTHONPATH includes the directory in which this README file resides.
import pprint
import sys
import wildebeest.wb_analysis as wb_ana
wb = wb_ana.process(string="Hеllο!") # "Hеllο!" mischievously includes a Cyrillic and a Greek character
wb.pretty_print(sys.stdout) # custom pretty-print with OVERVIEW and DETAIL sections to STDOUT
pprint.pprint(wb.analysis) # generic pretty-print
import wildebeest.wb_analysis as wb_ana
wb = wb_ana.process(strings=["Hеllο!", "Tschüß"])
print(wb.analysis) # print analysis object (nested dictionary)
Assuming an input file corpus.txt
, e.g. built by:
printf 'Hеllο!\nTschüß\n' > corpus.txt
import wildebeest.wb_analysis as wb_ana
wb = wb_ana.process(in_file='corpus.txt')
print(wb.analysis)
import wildebeest.wb_analysis as wb_ana
with open(f'out.txt', 'w') as out, open('out.json', 'w') as json:
wb_ana.process(in_file='corpus.txt', pp_output=out, json_output=json)
wb-analysis.pl
Old Perl script searches a tokenized text for a range of potential problems, such as UTF-8 encoding violations, control characters, non-ASCII punctuation, characters from a variety of language groups, very long tokens, unsplit 's, unsplit punctuation, script mixing; split URLs, email addresses, filenames, XML tokens.
Reports the number of instances in each category and give examples. Currently available: wildebeest_analysis.pl (Perl) v2.6 (April 28, 2021)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wildebeest-nlp-0.9.2.tar.gz
.
File metadata
- Download URL: wildebeest-nlp-0.9.2.tar.gz
- Upload date:
- Size: 273.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c09c1b15999cd98e3db2f53b6c76949593efe2f0f33b997de479f3955835642 |
|
MD5 | 2eba4f320cf11cb737d98815ebe7a43b |
|
BLAKE2b-256 | 81c5037d4437c3ba7f9e560398c5913ace501959800cbd8d2b6a6f3c3740e52c |
File details
Details for the file wildebeest_nlp-0.9.2-py3-none-any.whl
.
File metadata
- Download URL: wildebeest_nlp-0.9.2-py3-none-any.whl
- Upload date:
- Size: 283.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 090754039aee379bc71295512cc642a44758499ce9556c4b2fec4ddd124d2f13 |
|
MD5 | 59a7bb7c00c8732810b7d7c729b0f92e |
|
BLAKE2b-256 | ba7447844ceeff7a267899b0889c908eb5a35cf1e918b70cfb8a13774a6491ab |