CLI tool to remove invalid chars from a corpus.
Project description
Demeuk
Demeuk is a simple tool to clean up corpora (like dictionaries) or any dataset containing plain text strings. Example use cases are: cleaning up language dictionaries, password sets (like for example RockYou) or any file / stdin containing plain text strings.
In those corpora you'll find encoding mistakes that have been made, or you want to remove some parts of a line. Instead of creating a huge bash oneliner you can use demeuk to do all your cleaning.
Example usages:
- Cutting
- Length checking
- Encoding fixing
Demeuk is written in Python3, this means of course that it is slower than for example cut. However, Demeuk is written multithreaded and thus can use all your cores. Besides this Demeuk can easily be extended to match your needs.
This application is part of the CERBERUS project that has received funding from the European Union's Internal Security Fund - Police under grant agreement No. 82201
Please read the docs for more information.
Quick start
The recommended way to install demeuk is to install it in a virtual environment.
# Create virtual environment
virtualenv <virtual environment name>
# Activate the virtual environment
source <virtual environment name>/bin/activate
pip3 install -r requirements.txt
Now you can run bin/demeuk.py:
Examples:
demeuk -i inputfile.tmp -o outputfile.dict -l droppedfile.txt
demeuk -i inputfile -o outputfile -j 24 -l logfile.log
demeuk -i inputfile.tmp -o outputfile.dict -l droppedfile.txt --leak
demeuk -i inputfile -o outputfile -j 24 -l logfile.log --leak-full
demeuk -i inputdir/*.txt -o outputfile.dict -l logfile.log
demeuk -o outputfile.dict -l logfile.log
Docs
The docs are available at: http://demeuk.rtfd.io/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file demeuk-4.3.0.tar.gz
.
File metadata
- Download URL: demeuk-4.3.0.tar.gz
- Upload date:
- Size: 20.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0edac69959fb076e756c97535dfbf952f0f33aad1767f83529506b44259d0867 |
|
MD5 | 2ac23281fa8b76f6af544a2babbbfa1f |
|
BLAKE2b-256 | b8e2f1219b115c1131d5dc62603c6a8c4c4c5120692a3959319fcc8ac4047ede |
File details
Details for the file demeuk-4.3.0-py3-none-any.whl
.
File metadata
- Download URL: demeuk-4.3.0-py3-none-any.whl
- Upload date:
- Size: 19.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.14
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0742edf5180eee5d6d5368dc24e0424400aab7f7076ac8fc296553a3c2cd0ca8 |
|
MD5 | 1cc7402863ec1193549ccf5477da48e0 |
|
BLAKE2b-256 | 61f6bf5a5316d4772729b70220c277457ba6406c58308c139c6e1f0345287f4d |