Skip to main content

NLP toolkit for tweets

Project description

ci

tweet_nlp_toolkit

Tweet NLP toolkit

It can handle:

  • mentions
  • hashtags
  • emojis
  • emoticons
  • emails
  • HTML entities
  • digits
  • urls
  • punctuations
  • customized words to filter

Installation

python3 -m venv .env
source .env/bin/activate
python -m pip install -U pip
pip install tweet_nlp_toolkit

Usage

Text Parsing

>>> from tweet_nlp_toolkit import parse_text
>>> text = parse_text("123 @hello #world www.url.com 😰 :) abc@gmail.com")
>>> text.tokens
['123', '@hello', '#world', 'www.url.com', '😰', ':)', 'abc@gmail.com']
>>> text.hashtags
['world']
>>> text.mentions
['@hello']
>>> text.urls
['www.url.com']
>>> text.emojis
['😰']
>>> text.emoticons
[':)']
>>> text.digits
['123']
>>> text.emails
['abc@gmail.com']

Tagging entities

>>> from tweet_nlp_toolkit import parse_text
>>> parse_text(
...     "123 @hello #world www.url.com 😰 :) abc@gmail.com",
...     emojis="tag",
...     hashtags="tag",
...     mentions="tag"
... ).tokens
>>> ['123', '<MENTION>', '<HASHTAG>', 'www.url.com', '<EMOJI>', ':)', 'abc@gmail.com']

Preprocessing

>>> from tweet_nlp_toolkit import prep
>>> prep(
        "123 @hello #world www.url.com 😰 :) abc@gmail.com",
        emojis="demojize",
        mentions="remove",
        hashtags="remove",
        urls="remove",
        digits="tag",
        emails="remove"
... )
>>> '<DIGIT> :anxious_face_with_sweat: :)'
>>> from tweet_nlp_toolkit import prep_file
>>> prep_file("input.txt", "output.txt")

More

parse_text, prep and prep_file share the same parameters, parse_text returns an instance of ParsedText, prep returns the preprocessed string and prep_file preprocesses the file.

Parameters
----------
text: str
    The text to preprocess.
tokenizer: Callable[[str], List[Token]]
    Tokenizer
encoding: str
    The encoding of the text.
    Default "utf-8".
remove_unencodable_char: bool
    In case of encoding error of a character it is replaced with '�'. This option allows removing the '�'.
    Otherwise a sequence of '�' is replaced by a single one
    Default False
to_lower: bool
    Whether to convert the text to lowercase.
    Default True
strip_accents: bool
    Whether to remove accents from latin characters.
    Default False
reduce_len: bool
    Whether to remove repeated character sequences.
    Default False
filters: set
    Tokens to filter (case sensitive).
    Default None
emojis: Optional[str]
    How to handle emojis.
    Options:
        - "remove": remove all emojis
        - "tag": replaces the emoji by a tag <EMOJI>
        - "demojize": replaces the emoji by its textual representation, e.g. :musical_keyboard:
            list of emojis: https://www.webfx.com/tools/emoji-cheat-sheet/
        - "emojize": replaces the emoji by its unicode representation, e.g. 😰
    Default None
hashtags: Optional[str]
    How to handle hashtags.
    Options:
        - "remove": delete all hashtags
        - "tag"replaces the hashtag by a tag <HASHTAG>
    Default None
urls: Optional[str]
    How to handle urls.
    Options:
        - "remove": delete all urls
        - "tag"replaces the url by a tag <URL>
    Default None
mentions: Optional[str]
    How to handle mentions.
    Options:
        - "remove": delete all mentions
        - "tag"replaces the mention by a tag <MENTION>
    Default None
digits: Optional[str]
    How to handle digits.
    Options:
        - "remove": delete all digits
        - "tag"replaces the digit by a tag <DIGIT>
    Default None
emoticons: Optional[str]
    How to handle emoticons.
    Options:
        - "remove": delete all emoticons
        - "tag"replaces the emoticon by a tag <EMOTICON>
    Default None
puncts: Optional[str]
    How to handle puncts.
    Options:
        - "remove": delete all puncts
        - "tag"replaces the puncts by a tag <PUNCT>
    Default None
emails: Optional[str]
    How to handle puncts.
    Options:
        - "remove": delete all emails
        - "tag"replaces the emails by a tag <EMAIL>
    Default None
html_tags: Optional[str]
    How to handle HTML tags like <div>.
    Options:
        - "remove": delete all HTML tags
    Default None
html_tags: Optional[str]
    How to handle HTML tags like <div>.
    Options:
        - "remove": delete all HTML tags
    Default None
stop_words: Optional[str]
    How to handle stop words.
    Options:
        - "remove": delete all HTML tags
    Default None
stop_words
    How to handle stop words. Only English stop words are supported
    Options:
        - "remove"
    Default None

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweet_nlp_toolkit-1.0.5.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tweet_nlp_toolkit-1.0.5-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file tweet_nlp_toolkit-1.0.5.tar.gz.

File metadata

  • Download URL: tweet_nlp_toolkit-1.0.5.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.15

File hashes

Hashes for tweet_nlp_toolkit-1.0.5.tar.gz
Algorithm Hash digest
SHA256 a4c65488f7ee341acac88a4c48dd3d0af82a0beca1c2a5eae62f051be0e5741b
MD5 846c3576401042c5bfac0724d4ece767
BLAKE2b-256 6a434e33346dfbf4939feddc9ca2eca3c741bc3b797b599e1e45d1293c6edf0f

See more details on using hashes here.

File details

Details for the file tweet_nlp_toolkit-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: tweet_nlp_toolkit-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.15

File hashes

Hashes for tweet_nlp_toolkit-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 058601ed0ca6449ce5e8052e8ec471e6ffcb2b37451621584eaf57ec72786cd0
MD5 04f7dc6ed2e33c9709351161b8fab1c4
BLAKE2b-256 9dc6396980e30581ec64e326d7655becd7d3fe0d4c20a67c41878faac0941075

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page