Skip to main content

a light-weight python package to pre-process turkish twitter statuses(tweets).

Project description

twitter-turkish-preprocess

a light-weight python package to pre-process turkish twitter statuses(tweets).

note: this package is not completely build yet. i'm publishing it for internal use purposes, however you are more than welcome to use it on your own. this package is designed for processing text data fetched from twitter to feed turkish nlp models. if you have any questions or concerns you can reach me from emskaplann@gmail.com.

installation

on your terminal download the package to your local workspace via pip:
pip install turkish-twitter-preprocess

after having the package in your workspace you can simply import it to use it right away!
import ttp
stopwords = ['with', 'are', ...]
ttp.preprocess_sentence("Example sentence!", stopwords)

functions you can use with the package

lower(text)

this function should lower all the characters given in the string.

remove_emoji(text)

this function should remove every emoji from the given string.

resubComma(text)

this function should replacecommas from the given string with whitespace.

vanish_punc(text)

this function should remove every punctuation from the given string.

replace_emoticon(text, positive_str="SMILEYPOSITIVE", negative_str="SMILEYNEGATIVE")

this function should replace every emoticon from the given string as shown below. positive emoticon example =>:D, :), :d and similars are replaced with "SMILEYPOSITIVE" negative emoticon example => -_-, =(, :(and similars are replaced with "SMILEYNEGATIVE"

remove_emoticon(text)

this function should remove every emoticon from the given string.

remove_user_handle(text)

this function should every word that starts with '@' from the given string. we use this for removing user handles from the twitter data.

remove_digits_and_extensions(text)

this function should remove every digits and their extensions from the string, for example if we have a word like this "100'de 1 sanslari yok!", this function would transform this into "sanslari yok".

remove_digits(text)

this function should remove every digit from the given string.

remove_hashtag_and_word(text)

this function should remove every word that starts with '#' from the given string. we use it to remove hashtags from the twitter data because hashtags does not mean a lot to a nlp model. and hashtags are not really permanent, it can lower your accuracy score in the long run.

remove_newline_char(text)

this function should remove every "\n" character from the given string. we use it because it doesn't make any sense to use it in a nlp model.

remove_extra_spaces(text)

this function should remove every extra space from the given string.

dup_vanish(text)

this function should normalize words that contains repeated nonsense characters in it. for example this string "boyylleeee hukumeetttinnnn gellmiisssini gecmisiniii!" would be transformed to "boyle hukumetin gelmisini gecmisini!".

preprocess_sentence(text, stopwords)

this function should preprocess the text with stopwords.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turkish-twitter-preprocess-0.0.7.tar.gz (47.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turkish_twitter_preprocess-0.0.7-py3-none-any.whl (48.9 kB view details)

Uploaded Python 3

File details

Details for the file turkish-twitter-preprocess-0.0.7.tar.gz.

File metadata

  • Download URL: turkish-twitter-preprocess-0.0.7.tar.gz
  • Upload date:
  • Size: 47.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.3

File hashes

Hashes for turkish-twitter-preprocess-0.0.7.tar.gz
Algorithm Hash digest
SHA256 f36de428724150ea3ec76151d3b28242379b1c142c5d8d430e7722df811fcc0f
MD5 174b55e5a11147def375f297b52d9029
BLAKE2b-256 d20b3eb8424b9a82a3a502cf4c3858cf4db715e57c9734c4a378560cf7ec09e9

See more details on using hashes here.

File details

Details for the file turkish_twitter_preprocess-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: turkish_twitter_preprocess-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 48.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.24.0 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.3

File hashes

Hashes for turkish_twitter_preprocess-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 7d7ffa43d59d7a77f16ba0b2368cea7636ca2d89ab21e797d5f24d621f233471
MD5 7df02015512eaef8ed76e1c034c26c69
BLAKE2b-256 b98c297c872333cc9bd21918f009641a02399f7d5f01e8c24c99fb54476b5042

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page