NLP toolkit for tweets
Project description
tweet_nlp_toolkit
Tweet NLP toolkit
It can handle:
- mentions
- hashtags
- emojis
- emoticons
- emails
- HTML entities
- digits
- urls
- punctuations
- customized words to filter
Installation
python3 -m venv .env
source .env/bin/activate
python -m pip install -U pip
pip install tweet_nlp_toolkit
Usage
Text Parsing
>>> from tweet_nlp_toolkit import parse_text
>>> text = parse_text("123 @hello #world www.url.com 😰 :) abc@gmail.com")
>>> text.tokens
['123', '@hello', '#world', 'www.url.com', '😰', ':)', 'abc@gmail.com']
>>> text.hashtags
['world']
>>> text.mentions
['@hello']
>>> text.urls
['www.url.com']
>>> text.emojis
['😰']
>>> text.emoticons
[':)']
>>> text.digits
['123']
>>> text.emails
['abc@gmail.com']
Tagging entities
>>> from tweet_nlp_toolkit import parse_text
>>> parse_text(
... "123 @hello #world www.url.com 😰 :) abc@gmail.com",
... emojis="tag",
... hashtags="tag",
... mentions="tag"
... ).tokens
>>> ['123', '<MENTION>', '<HASHTAG>', 'www.url.com', '<EMOJI>', ':)', 'abc@gmail.com']
Preprocessing
>>> from tweet_nlp_toolkit import prep
>>> prep(
"123 @hello #world www.url.com 😰 :) abc@gmail.com",
emojis="demojize",
mentions="remove",
hashtags="remove",
urls="remove",
digits="tag",
emails="remove"
... )
>>> '<DIGIT> :anxious_face_with_sweat: :)'
>>> from tweet_nlp_toolkit import prep_file
>>> prep_file("input.txt", "output.txt")
More
parse_text, prep and prep_file share the same parameters, parse_text returns an instance of ParsedText,
prep returns the preprocessed string and prep_file preprocesses the file.
Parameters
----------
text: str
The text to preprocess.
tokenizer: Callable[[str], List[Token]]
Tokenizer
encoding: str
The encoding of the text.
Default "utf-8".
remove_unencodable_char: bool
In case of encoding error of a character it is replaced with '�'. This option allows removing the '�'.
Otherwise a sequence of '�' is replaced by a single one
Default False
to_lower: bool
Whether to convert the text to lowercase.
Default True
strip_accents: bool
Whether to remove accents from latin characters.
Default False
reduce_len: bool
Whether to remove repeated character sequences.
Default False
filters: set
Tokens to filter (case sensitive).
Default None
emojis: Optional[str]
How to handle emojis.
Options:
- "remove": remove all emojis
- "tag": replaces the emoji by a tag <EMOJI>
- "demojize": replaces the emoji by its textual representation, e.g. :musical_keyboard:
list of emojis: https://www.webfx.com/tools/emoji-cheat-sheet/
- "emojize": replaces the emoji by its unicode representation, e.g. 😰
Default None
hashtags: Optional[str]
How to handle hashtags.
Options:
- "remove": delete all hashtags
- "tag"replaces the hashtag by a tag <HASHTAG>
Default None
urls: Optional[str]
How to handle urls.
Options:
- "remove": delete all urls
- "tag"replaces the url by a tag <URL>
Default None
mentions: Optional[str]
How to handle mentions.
Options:
- "remove": delete all mentions
- "tag"replaces the mention by a tag <MENTION>
Default None
digits: Optional[str]
How to handle digits.
Options:
- "remove": delete all digits
- "tag"replaces the digit by a tag <DIGIT>
Default None
emoticons: Optional[str]
How to handle emoticons.
Options:
- "remove": delete all emoticons
- "tag"replaces the emoticon by a tag <EMOTICON>
Default None
puncts: Optional[str]
How to handle puncts.
Options:
- "remove": delete all puncts
- "tag"replaces the puncts by a tag <PUNCT>
Default None
emails: Optional[str]
How to handle puncts.
Options:
- "remove": delete all emails
- "tag"replaces the emails by a tag <EMAIL>
Default None
html_tags: Optional[str]
How to handle HTML tags like <div>.
Options:
- "remove": delete all HTML tags
Default None
html_tags: Optional[str]
How to handle HTML tags like <div>.
Options:
- "remove": delete all HTML tags
Default None
stop_words: Optional[str]
How to handle stop words.
Options:
- "remove": delete all HTML tags
Default None
stop_words
How to handle stop words. Only English stop words are supported
Options:
- "remove"
Default None
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
tweet_nlp_toolkit-1.0.5.tar.gz
(18.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tweet_nlp_toolkit-1.0.5.tar.gz.
File metadata
- Download URL: tweet_nlp_toolkit-1.0.5.tar.gz
- Upload date:
- Size: 18.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4c65488f7ee341acac88a4c48dd3d0af82a0beca1c2a5eae62f051be0e5741b
|
|
| MD5 |
846c3576401042c5bfac0724d4ece767
|
|
| BLAKE2b-256 |
6a434e33346dfbf4939feddc9ca2eca3c741bc3b797b599e1e45d1293c6edf0f
|
File details
Details for the file tweet_nlp_toolkit-1.0.5-py3-none-any.whl.
File metadata
- Download URL: tweet_nlp_toolkit-1.0.5-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.9.6 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.10.1 urllib3/1.26.15 tqdm/4.64.1 importlib-metadata/4.8.3 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.5 CPython/3.6.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
058601ed0ca6449ce5e8052e8ec471e6ffcb2b37451621584eaf57ec72786cd0
|
|
| MD5 |
04f7dc6ed2e33c9709351161b8fab1c4
|
|
| BLAKE2b-256 |
9dc6396980e30581ec64e326d7655becd7d3fe0d4c20a67c41878faac0941075
|