Skip to main content

CLI tool to remove invalid chars from a corpus.

Project description

Demeuk

Documentation Status Tests

Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings. Example use cases are: cleaning up language dictionaries, password sets (like for example RockYou) or any file / stdin containing plain text strings.

In those corpora you'll find encoding mistakes that have been made, or you want to remove some parts of a line. Instead of creating a huge bash oneliner you can use demeuk to do all your cleaning.

Example usages:

  • Cutting
  • Length checking
  • Encoding fixing

Demeuk is written in Python3, this means of course that it is slower than for example cut. However, Demeuk is written multithreaded and thus can use all your cores. Besides this Demeuk can easily be extended to match your needs.

This application is part of the CERBERUS project that has received funding from the European Union's Internal Security Fund - Police under grant agreement No. 82201

Please read the docs for more information.

Quick start

The recommended way to install demeuk is to install it in a virtual environment.

# Create virtual environment
virtualenv <virtual environment name>
# Activate the virtual environment
source <virtual environment name>/bin/activate
pip3 install -r requirements.txt

Now you can run bin/demeuk.py:

Examples:

    demeuk -i inputfile.tmp -o outputfile.dict -l droppedfile.txt
    demeuk -i inputfile -o outputfile -j 24 -l logfile.log
    demeuk -i inputfile.tmp -o outputfile.dict -l droppedfile.txt --leak
    demeuk -i inputfile -o outputfile -j 24 -l logfile.log --leak-full
    demeuk -i inputdir/*.txt -o outputfile.dict -l logfile.log
    demeuk -o outputfile.dict -l logfile.log

Docs

The docs are available at: http://demeuk.rtfd.io/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

demeuk-4.3.0.tar.gz (20.1 kB view details)

Uploaded Source

Built Distribution

demeuk-4.3.0-py3-none-any.whl (19.3 kB view details)

Uploaded Python 3

File details

Details for the file demeuk-4.3.0.tar.gz.

File metadata

  • Download URL: demeuk-4.3.0.tar.gz
  • Upload date:
  • Size: 20.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for demeuk-4.3.0.tar.gz
Algorithm Hash digest
SHA256 0edac69959fb076e756c97535dfbf952f0f33aad1767f83529506b44259d0867
MD5 2ac23281fa8b76f6af544a2babbbfa1f
BLAKE2b-256 b8e2f1219b115c1131d5dc62603c6a8c4c4c5120692a3959319fcc8ac4047ede

See more details on using hashes here.

File details

Details for the file demeuk-4.3.0-py3-none-any.whl.

File metadata

  • Download URL: demeuk-4.3.0-py3-none-any.whl
  • Upload date:
  • Size: 19.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for demeuk-4.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0742edf5180eee5d6d5368dc24e0424400aab7f7076ac8fc296553a3c2cd0ca8
MD5 1cc7402863ec1193549ccf5477da48e0
BLAKE2b-256 61f6bf5a5316d4772729b70220c277457ba6406c58308c139c6e1f0345287f4d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page