cleaning text from noise for nlp tasks
Project description
cleantxt
cleaning text from noise for nlp tasks
installation
with pip
pip install cleantxt
install from source
git clone https://github.com/jemiaymen/cleantxt.git
go to the cleantxt directory
cd cleantxt
install with pip
pip install .
cli usage
cleantxt --doc=[path_to_doc] --out=[path_out_file] --f=[0] --t=[100] --do_lower=True --white_space=True --punctuation=True --duplicated_chars=True --alpha_num=True --accent=True --escape key,value ə,a œ,oe
check example
api usage
import text module
from cleantxt import text
clean text
txt = text.clean_text('mella 7ayawaaanéé hadddddda mta3@@@@@ @tfih')
print the result
mella 7ayawane hada mta3 tfih
params
text : -> (str) raw text
whitespace : -> (boolean) escape spaces [default True ]
punctuation : -> (boolean) escape punctuation [default True ]
duplicated : -> (boolean) escape duplicated chars [default True ]
alphanum : -> (boolean) escape non alpha numeric chars [default True ]
accent : -> (boolean) escape accent [default True ]
do_lower : -> (boolean) lower case text [default True ]
others : -> ( list( tuple() ) ) escape rules [ default [('ə', 'a')] ]
new function
word count
wc => (word count) params (path :str, unique=False, both=False)
from cleartxt import text
print( text.wc('file.txt',both=True) )
(51515,5547)
output tuple (all words , unique words)
word frequency
word_frequency => params (path : str, top=100)
from cleartxt import text
print( text.word_frequency('file.txt') )
[('\n', 54898), ('', 48757), ('w', 27717), ('el', 16679), ('fi', 9399), ('ya', 8611)]
output list of tuples (word , frequency)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cleantxt-0.0.7.tar.gz
.
File metadata
- Download URL: cleantxt-0.0.7.tar.gz
- Upload date:
- Size: 4.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0f55747c051db28f76f87e9d5be8a46454e6b4b1af424e78d64962d7ac63c6b |
|
MD5 | b62823e3bb847fd2bd90a0599edd3991 |
|
BLAKE2b-256 | 0be0c75f403c800ccdba204edd04434413b2e0d4a77b130dd51fbceacfa91b35 |
File details
Details for the file cleantxt-0.0.7-py3-none-any.whl
.
File metadata
- Download URL: cleantxt-0.0.7-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.24.0 setuptools/45.2.0.post20200210 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.7.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b2abefddb1815a61ed48fdb1696ab00f172fc80e7732837ff8be732ee5e1802 |
|
MD5 | b19a35447f447baab67ee6e7dbb48123 |
|
BLAKE2b-256 | e67e9784fbfeddcd456ffdff3afd599b81ac49eda02d8ab4c40812e40d31dd43 |