NLP toolkit for tweets

These details have not been verified by PyPI

Project links

Homepage

Project description

tweet_nlp_toolkit

Tweet NLP toolkit

It can handle:

mentions
hashtags
emojis
emoticons
emails
HTML entities
digits
urls
punctuations
customized words to filter

Installation

python3 -m venv .env
source .env/bin/activate
python -m pip install -U pip
pip install tweet_nlp_toolkit

Usage

Text Parsing

>>> from tweet_nlp_toolkit import parse_text
>>> text = parse_text("123 @hello #world www.url.com 😰 :) abc@gmail.com")
>>> text.tokens
['123', '@hello', '#world', 'www.url.com', '😰', ':)', 'abc@gmail.com']
>>> text.hashtags
['world']
>>> text.mentions
['@hello']
>>> text.urls
['www.url.com']
>>> text.emojis
['😰']
>>> text.emoticons
[':)']
>>> text.digits
['123']
>>> text.emails
['abc@gmail.com']

Tagging entities

>>> from tweet_nlp_toolkit import parse_text
>>> parse_text(
...     "123 @hello #world www.url.com 😰 :) abc@gmail.com",
...     emojis="tag",
...     hashtags="tag",
...     mentions="tag"
... ).tokens
>>> ['123', '<MENTION>', '<HASHTAG>', 'www.url.com', '<EMOJI>', ':)', 'abc@gmail.com']

Preprocessing

>>> from tweet_nlp_toolkit import prep
>>> prep(
        "123 @hello #world www.url.com 😰 :) abc@gmail.com",
        emojis="demojize",
        mentions="remove",
        hashtags="remove",
        urls="remove",
        digits="tag",
        emails="remove"
... )
>>> '<DIGIT> :anxious_face_with_sweat: :)'

>>> from tweet_nlp_toolkit import prep_file
>>> prep_file("input.txt", "output.txt")

parse_text, prep and prep_file share the same parameters, parse_text returns an instance of ParsedText, prep returns the preprocessed string and prep_file preprocesses the file.

Parameters
----------
text: str
    The text to preprocess.
tokenizer: Callable[[str], List[Token]]
    Tokenizer
encoding: str
    The encoding of the text.
    Default "utf-8".
remove_unencodable_char: bool
    In case of encoding error of a character it is replaced with '�'. This option allows removing the '�'.
    Otherwise a sequence of '�' is replaced by a single one
    Default False
to_lower: bool
    Whether to convert the text to lowercase.
    Default True
strip_accents: bool
    Whether to remove accents from latin characters.
    Default False
reduce_len: bool
    Whether to remove repeated character sequences.
    Default False
filters: set
    Tokens to filter (case sensitive).
    Default None
emojis: Optional[str]
    How to handle emojis.
    Options:
        - "remove": remove all emojis
        - "tag": replaces the emoji by a tag <EMOJI>
        - "demojize": replaces the emoji by its textual representation, e.g. :musical_keyboard:
            list of emojis: https://www.webfx.com/tools/emoji-cheat-sheet/
        - "emojize": replaces the emoji by its unicode representation, e.g. 😰
    Default None
hashtags: Optional[str]
    How to handle hashtags.
    Options:
        - "remove": delete all hashtags
        - "tag"replaces the hashtag by a tag <HASHTAG>
    Default None
urls: Optional[str]
    How to handle urls.
    Options:
        - "remove": delete all urls
        - "tag"replaces the url by a tag <URL>
    Default None
mentions: Optional[str]
    How to handle mentions.
    Options:
        - "remove": delete all mentions
        - "tag"replaces the mention by a tag <MENTION>
    Default None
digits: Optional[str]
    How to handle digits.
    Options:
        - "remove": delete all digits
        - "tag"replaces the digit by a tag <DIGIT>
    Default None
emoticons: Optional[str]
    How to handle emoticons.
    Options:
        - "remove": delete all emoticons
        - "tag"replaces the emoticon by a tag <EMOTICON>
    Default None
puncts: Optional[str]
    How to handle puncts.
    Options:
        - "remove": delete all puncts
        - "tag"replaces the puncts by a tag <PUNCT>
    Default None
emails: Optional[str]
    How to handle puncts.
    Options:
        - "remove": delete all emails
        - "tag"replaces the emails by a tag <EMAIL>
    Default None
html_tags: Optional[str]
    How to handle HTML tags like <div>.
    Options:
        - "remove": delete all HTML tags
    Default None
html_tags: Optional[str]
    How to handle HTML tags like <div>.
    Options:
        - "remove": delete all HTML tags
    Default None
stop_words: Optional[str]
    How to handle stop words.
    Options:
        - "remove": delete all HTML tags
    Default None
stop_words
    How to handle stop words. Only English stop words are supported
    Options:
        - "remove"
    Default None

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.5

Apr 26, 2023

1.0.4

Jan 5, 2023

1.0.3

Jul 4, 2022

1.0.2

Mar 5, 2022

1.0.1

Mar 2, 2022

1.0.0

Mar 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweet_nlp_toolkit-1.0.5.tar.gz (18.8 kB view details)

Uploaded Apr 26, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tweet_nlp_toolkit-1.0.5-py3-none-any.whl (21.3 kB view details)

Uploaded Apr 26, 2023 Python 3

File details

Details for the file tweet_nlp_toolkit-1.0.5.tar.gz.

File metadata

Download URL: tweet_nlp_toolkit-1.0.5.tar.gz
Upload date: Apr 26, 2023
Size: 18.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.15

File hashes

Hashes for tweet_nlp_toolkit-1.0.5.tar.gz
Algorithm	Hash digest
SHA256	`a4c65488f7ee341acac88a4c48dd3d0af82a0beca1c2a5eae62f051be0e5741b`
MD5	`846c3576401042c5bfac0724d4ece767`
BLAKE2b-256	`6a434e33346dfbf4939feddc9ca2eca3c741bc3b797b599e1e45d1293c6edf0f`

See more details on using hashes here.

File details

Details for the file tweet_nlp_toolkit-1.0.5-py3-none-any.whl.

File metadata

Download URL: tweet_nlp_toolkit-1.0.5-py3-none-any.whl
Upload date: Apr 26, 2023
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.15

File hashes

Hashes for tweet_nlp_toolkit-1.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`058601ed0ca6449ce5e8052e8ec471e6ffcb2b37451621584eaf57ec72786cd0`
MD5	`04f7dc6ed2e33c9709351161b8fab1c4`
BLAKE2b-256	`9dc6396980e30581ec64e326d7655becd7d3fe0d4c20a67c41878faac0941075`

See more details on using hashes here.

tweet-nlp-toolkit 1.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tweet_nlp_toolkit

Installation

Usage

Text Parsing

Tagging entities

Preprocessing

More

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes