Skip to main content

NLP toolkit for tweets

Project description

ci

tweet_nlp_toolkit

Tweet NLP toolkit

It can handle:

  • mentions
  • hashtags
  • emojis
  • emoticons
  • emails
  • HTML entities
  • digits
  • urls
  • punctuations
  • customized words to filter

Installation

python3 -m venv .env
source .env/bin/activate
python -m pip install -U pip
pip install tweet_nlp_toolkit

Usage

Text Parsing

>>> from tweet_nlp_toolkit import parse_text
>>> text = parse_text("123 @hello #world www.url.com 😰 :) abc@gmail.com")
>>> text.tokens
['123', '@hello', '#world', 'www.url.com', '😰', ':)', 'abc@gmail.com']
>>> text.hashtags
['world']
>>> text.mentions
['@hello']
>>> text.urls
['www.url.com']
>>> text.emojis
['😰']
>>> text.emoticons
[':)']
>>> text.digits
['123']
>>> text.emails
['abc@gmail.com']

Tagging entities

>>> from tweet_nlp_toolkit import parse_text
>>> parse_text(
...     "123 @hello #world www.url.com 😰 :) abc@gmail.com",
...     emojis="tag",
...     hashtags="tag",
...     mentions="tag"
... ).tokens
>>> ['123', '<MENTION>', '<HASHTAG>', 'www.url.com', '<EMOJI>', ':)', 'abc@gmail.com']

Preprocessing

>>> from tweet_nlp_toolkit import prep
>>> prep(
        "123 @hello #world www.url.com 😰 :) abc@gmail.com",
        emojis="demojize",
        mentions="remove",
        hashtags="remove",
        urls="remove",
        digits="tag",
        emails="remove"
... )
>>> '<DIGIT> :anxious_face_with_sweat: :)'
>>> from tweet_nlp_toolkit import prep_file
>>> prep_file("input.txt", "output.txt")

More

parse_text, prep and prep_file share the same parameters, parse_text returns an instance of ParsedText, prep returns the preprocessed string and prep_file preprocesses the file.

Parameters
----------
text: str
    The text to preprocess.
tokenizer: Callable[[str], List[Token]]
    Tokenizer
encoding: str
    The encoding of the text.
    Default "utf-8".
remove_unencodable_char: bool
    In case of encoding error of a character it is replaced with '�'. This option allows removing the '�'.
    Otherwise a sequence of '�' is replaced by a single one
    Default False
to_lower: bool
    Whether to convert the text to lowercase.
    Default True
strip_accents: bool
    Whether to remove accents from latin characters.
    Default False
reduce_len: bool
    Whether to remove repeated character sequences.
    Default False
filters: set
    Tokens to filter (case sensitive).
    Default None
emojis: Optional[str]
    How to handle emojis.
    Options:
        - "remove": remove all emojis
        - "tag": replaces the emoji by a tag <EMOJI>
        - "demojize": replaces the emoji by its textual representation, e.g. :musical_keyboard:
            list of emojis: https://www.webfx.com/tools/emoji-cheat-sheet/
        - "emojize": replaces the emoji by its unicode representation, e.g. 😰
    Default None
hashtags: Optional[str]
    How to handle hashtags.
    Options:
        - "remove": delete all hashtags
        - "tag"replaces the hashtag by a tag <HASHTAG>
    Default None
urls: Optional[str]
    How to handle urls.
    Options:
        - "remove": delete all urls
        - "tag"replaces the url by a tag <URL>
    Default None
mentions: Optional[str]
    How to handle mentions.
    Options:
        - "remove": delete all mentions
        - "tag"replaces the mention by a tag <MENTION>
    Default None
digits: Optional[str]
    How to handle digits.
    Options:
        - "remove": delete all digits
        - "tag"replaces the digit by a tag <DIGIT>
    Default None
emoticons: Optional[str]
    How to handle emoticons.
    Options:
        - "remove": delete all emoticons
        - "tag"replaces the emoticon by a tag <EMOTICON>
    Default None
puncts: Optional[str]
    How to handle puncts.
    Options:
        - "remove": delete all puncts
        - "tag"replaces the puncts by a tag <PUNCT>
    Default None
emails: Optional[str]
    How to handle puncts.
    Options:
        - "remove": delete all emails
        - "tag"replaces the emails by a tag <EMAIL>
    Default None
html_tags: Optional[str]
    How to handle HTML tags like <div>.
    Options:
        - "remove": delete all HTML tags
    Default None
html_tags: Optional[str]
    How to handle HTML tags like <div>.
    Options:
        - "remove": delete all HTML tags
    Default None
stop_words: Optional[str]
    How to handle stop words.
    Options:
        - "remove": delete all HTML tags
    Default None
stop_words
    How to handle stop words. Only English stop words are supported
    Options:
        - "remove"
    Default None

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweet_nlp_toolkit-1.0.5.tar.gz (18.8 kB view hashes)

Uploaded Source

Built Distribution

tweet_nlp_toolkit-1.0.5-py3-none-any.whl (21.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page