Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Searches for [ab]using of Unicode glyphs

Project description

DirtyText

Searches for [ab]using of Unicode glyphs.

Installation

DirtyText package can be installed through pip :snake: :

$ pip install dirtytext

or downloaded from GitHub.

Quick tour:

Common options:

  • Read from file: -f <filename>
  • Save modified text: -s <file>
  • Text filter: --filter
  • Pipeline mode: -p

:mag_right: Looks for ZERO-WIDTH characters:

$> echo "This text‌‌‌‌‍‌‬‌‌‌‌‌‍‬‍‍ ‌‌‌‌‍‬‌contains‌‌‌‌‍‬‌‌‌‌‌‍‬‌‌‌‌‌‬‌‌‌‌‌‌‍‍‍‌‌‌‌‍‬ ‌‌‌‌‍‌‬‌‌‌‌‍‬‌zero-width‌‌‌‌‍‬‍‌ chars" | dirtytext --zero -v

will produce the following output:

Contains zero-width characters: True
JSON:    
[{"idx": 0, "char": "\ufeff", "cval": "FEFF", "infos": null}, 
{"idx": 10, "char": "\u200c", "cval": "200C", "infos": null}, 
{"idx": 11, "char": "\u200c", "cval": "200C", "infos": null}, ...]

:mag_right: Looks for CONFUSABLES characters:

$> echo "hello" | dirtytext --confusables greek -v

will produce the following output:

Contains confusables characters: True
JSON:
[{"idx": 2, "char": "l", "cval": "006C", "infos": [{"target": "0399", "description": "GREEK CAPITAL LETTER IOTA"}]}, 
{"idx": 3, "char": "l", "cval": "006C", "infos": [{"target": "0399", "description": "GREEK CAPITAL LETTER IOTA"}]}, 
{"idx": 4, "char": "o", "cval": "006F", "infos": [{"target": "03BF", "description": "GREEK SMALL LETTER OMICRON"}, 
{"target": "03C3", "description": "GREEK SMALL LETTER SIGMA"}]}]

:mag_right: Looks and filter anomalies in LATIN text:

example.txt:

It ⅽan be argueⅾ that the ⅽomputer ⅰs humanⅰty’s attempt to repⅼⅰⅽate the human brain.
This ⅰs perhaps an unattainable goal. 
However, unattainable goals often lead to outstanding accomplishment.
$> dirtytext -f example.txt --lsubs --filter -s out.txt
out.txt:

It can be argued that the computer is humanity’s attempt to replicate the human brain.
This is perhaps an unattainable goal. 
However, unattainable goals often lead to outstanding accomplishment.

UnicodeDB

The unicode data that composes dirtytext database are extracted from unicode consortium, in particular there are two database files into dirtytext/data directory:

  • categories.json: built from data extracted from here
  • confusables.json: built from data extracted from here

If dirtytext/data doesn't exist, DT downloads and build database before performing the required operations, after which you can force the database update by adding the --update option

License

Released under GPL-3.0

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for dirtytext, version 1.0.0
Filename, size File type Python version Upload date Hashes
Filename, size dirtytext-1.0.0.tar.gz (111.4 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page