a light-weight python package to pre-process turkish twitter statuses(tweets).
Project description
twitter-turkish-preprocess
a light-weight python package to pre-process turkish twitter statuses(tweets).
note: this package is not completely build yet. i'm publishing it for internal use purposes, however you are more than welcome to use it on your own. this package is designed for processing text data fetched from twitter to feed turkish nlp models. if you have any questions or concerns you can reach me from emskaplann@gmail.com.
installation
on your terminal download the package to your local workspace via pip:
pip install turkish-twitter-preprocess
after having the package in your workspace you can simply import it to use it right away!
import ttp
stopwords = ['with', 'are', ...]
ttp.preprocess_sentence("Example sentence!", stopwords)
functions you can use with the package
lower(text)
this function should lower all the characters given in the string.
remove_emoji(text)
this function should remove every emoji from the given string.
resubComma(text)
this function should replacecommas from the given string with whitespace.
vanish_punc(text)
this function should remove every punctuation from the given string.
replace_emoticon(text, positive_str="SMILEYPOSITIVE", negative_str="SMILEYNEGATIVE")
this function should replace every emoticon from the given string as shown below.
positive emoticon example =>:D, :), :d and similars are replaced with "SMILEYPOSITIVE"
negative emoticon example => -_-, =(, :(and similars are replaced with "SMILEYNEGATIVE"
remove_emoticon(text)
this function should remove every emoticon from the given string.
remove_user_handle(text)
this function should every word that starts with '@' from the given string. we use this for removing user handles from the twitter data.
remove_digits_and_extensions(text)
this function should remove every digits and their extensions from the string, for example if we have a word like this "100'de 1 sanslari yok!", this function would transform this into "sanslari yok".
remove_digits(text)
this function should remove every digit from the given string.
remove_hashtag_and_word(text)
this function should remove every word that starts with '#' from the given string. we use it to remove hashtags from the twitter data because hashtags does not mean a lot to a nlp model. and hashtags are not really permanent, it can lower your accuracy score in the long run.
remove_newline_char(text)
this function should remove every "\n" character from the given string. we use it because it doesn't make any sense to use it in a nlp model.
remove_extra_spaces(text)
this function should remove every extra space from the given string.
dup_vanish(text)
this function should normalize words that contains repeated nonsense characters in it. for example this string "boyylleeee hukumeetttinnnn gellmiisssini gecmisiniii!" would be transformed to "boyle hukumetin gelmisini gecmisini!".
preprocess_sentence(text, stopwords)
this function should preprocess the text with stopwords.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file turkish-twitter-preprocess-0.0.7.tar.gz.
File metadata
- Download URL: turkish-twitter-preprocess-0.0.7.tar.gz
- Upload date:
- Size: 47.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f36de428724150ea3ec76151d3b28242379b1c142c5d8d430e7722df811fcc0f
|
|
| MD5 |
174b55e5a11147def375f297b52d9029
|
|
| BLAKE2b-256 |
d20b3eb8424b9a82a3a502cf4c3858cf4db715e57c9734c4a378560cf7ec09e9
|
File details
Details for the file turkish_twitter_preprocess-0.0.7-py3-none-any.whl.
File metadata
- Download URL: turkish_twitter_preprocess-0.0.7-py3-none-any.whl
- Upload date:
- Size: 48.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d7ffa43d59d7a77f16ba0b2368cea7636ca2d89ab21e797d5f24d621f233471
|
|
| MD5 |
7df02015512eaef8ed76e1c034c26c69
|
|
| BLAKE2b-256 |
b98c297c872333cc9bd21918f009641a02399f7d5f01e8c24c99fb54476b5042
|