Skip to main content

Searches for [ab]using of Unicode glyphs

Project description

DirtyText

Searches for [ab]using of Unicode glyphs.

Installation

DirtyText package can be installed through pip :snake: :

$ pip install dirtytext

or downloaded from GitHub.

Quick tour:

Common options:

  • Read from file: -f <filename>
  • Save modified text: -s <file>
  • Text filter: --filter
  • Pipeline mode: -p

:mag_right: Looks for ZERO-WIDTH characters:

$> echo "This text‌‌‌‌‍‌‬‌‌‌‌‌‍‬‍‍ ‌‌‌‌‍‬‌contains‌‌‌‌‍‬‌‌‌‌‌‍‬‌‌‌‌‌‬‌‌‌‌‌‌‍‍‍‌‌‌‌‍‬ ‌‌‌‌‍‌‬‌‌‌‌‍‬‌zero-width‌‌‌‌‍‬‍‌ chars" | dirtytext --zero -v

will produce the following output:

Contains zero-width characters: True
JSON:    
[{"idx": 0, "char": "\ufeff", "cval": "FEFF", "infos": null}, 
{"idx": 10, "char": "\u200c", "cval": "200C", "infos": null}, 
{"idx": 11, "char": "\u200c", "cval": "200C", "infos": null}, ...]

:mag_right: Looks for CONFUSABLES characters:

$> echo "hello" | dirtytext --confusables greek -v

will produce the following output:

Contains confusables characters: True
JSON:
[{"idx": 2, "char": "l", "cval": "006C", "infos": [{"target": "0399", "description": "GREEK CAPITAL LETTER IOTA"}]}, 
{"idx": 3, "char": "l", "cval": "006C", "infos": [{"target": "0399", "description": "GREEK CAPITAL LETTER IOTA"}]}, 
{"idx": 4, "char": "o", "cval": "006F", "infos": [{"target": "03BF", "description": "GREEK SMALL LETTER OMICRON"}, 
{"target": "03C3", "description": "GREEK SMALL LETTER SIGMA"}]}]

:mag_right: Looks and filter anomalies in LATIN text:

example.txt:

It ⅽan be argueⅾ that the ⅽomputer ⅰs humanⅰty’s attempt to repⅼⅰⅽate the human brain.
This ⅰs perhaps an unattainable goal. 
However, unattainable goals often lead to outstanding accomplishment.
$> dirtytext -f example.txt --lsubs --filter -s out.txt
out.txt:

It can be argued that the computer is humanity’s attempt to replicate the human brain.
This is perhaps an unattainable goal. 
However, unattainable goals often lead to outstanding accomplishment.

UnicodeDB

The unicode data that composes dirtytext database are extracted from unicode consortium, in particular there are two database files into dirtytext/data directory:

  • categories.json: built from data extracted from here
  • confusables.json: built from data extracted from here

If dirtytext/data doesn't exist, DT downloads and build database before performing the required operations, after which you can force the database update by adding the --update option

License

Released under GPL-3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dirtytext-1.0.0.tar.gz (111.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page