Skip to main content

Searches for [ab]using of Unicode glyphs

Project description

DirtyText

Searches for [ab]using of Unicode glyphs.

Installation

DirtyText package can be installed through pip :snake: :

$ pip install dirtytext

or downloaded from GitHub.

Quick tour:

Common options:

  • Read from file: -f <filename>
  • Save modified text: -s <file>
  • Text filter: --filter
  • Pipeline mode: -p

:mag_right: Looks for ZERO-WIDTH characters:

$> echo "This text‌‌‌‌‍‌‬‌‌‌‌‌‍‬‍‍ ‌‌‌‌‍‬‌contains‌‌‌‌‍‬‌‌‌‌‌‍‬‌‌‌‌‌‬‌‌‌‌‌‌‍‍‍‌‌‌‌‍‬ ‌‌‌‌‍‌‬‌‌‌‌‍‬‌zero-width‌‌‌‌‍‬‍‌ chars" | dirtytext --zero -v

will produce the following output:

Contains zero-width characters: True
JSON:    
[{"idx": 0, "char": "\ufeff", "cval": "FEFF", "infos": null}, 
{"idx": 10, "char": "\u200c", "cval": "200C", "infos": null}, 
{"idx": 11, "char": "\u200c", "cval": "200C", "infos": null}, ...]

:mag_right: Looks for CONFUSABLES characters:

$> echo "hello" | dirtytext --confusables greek -v

will produce the following output:

Contains confusables characters: True
JSON:
[{"idx": 2, "char": "l", "cval": "006C", "infos": [{"target": "0399", "description": "GREEK CAPITAL LETTER IOTA"}]}, 
{"idx": 3, "char": "l", "cval": "006C", "infos": [{"target": "0399", "description": "GREEK CAPITAL LETTER IOTA"}]}, 
{"idx": 4, "char": "o", "cval": "006F", "infos": [{"target": "03BF", "description": "GREEK SMALL LETTER OMICRON"}, 
{"target": "03C3", "description": "GREEK SMALL LETTER SIGMA"}]}]

:mag_right: Looks and filter anomalies in LATIN text:

example.txt:

It ⅽan be argueⅾ that the ⅽomputer ⅰs humanⅰty’s attempt to repⅼⅰⅽate the human brain.
This ⅰs perhaps an unattainable goal. 
However, unattainable goals often lead to outstanding accomplishment.
$> dirtytext -f example.txt --lsubs --filter -s out.txt
out.txt:

It can be argued that the computer is humanity’s attempt to replicate the human brain.
This is perhaps an unattainable goal. 
However, unattainable goals often lead to outstanding accomplishment.

UnicodeDB

The unicode data that composes dirtytext database are extracted from unicode consortium, in particular there are two database files into dirtytext/data directory:

  • categories.json: built from data extracted from here
  • confusables.json: built from data extracted from here

If dirtytext/data doesn't exist, DT downloads and build database before performing the required operations, after which you can force the database update by adding the --update option

License

Released under GPL-3.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dirtytext-1.0.0.tar.gz (111.4 kB view details)

Uploaded Source

File details

Details for the file dirtytext-1.0.0.tar.gz.

File metadata

  • Download URL: dirtytext-1.0.0.tar.gz
  • Upload date:
  • Size: 111.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for dirtytext-1.0.0.tar.gz
Algorithm Hash digest
SHA256 559b2f0d04890070230639352d0eb40c64eb8bbf125a408bda6fc521ce91e3a6
MD5 36689a57b99fc1e91b70128c5351e87c
BLAKE2b-256 e0e6776b07af3b3d0c61765f0d8a8a9c139e984310c061d7687ac13ea5710cf8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page