An open-source NLP library: fast text cleaning and preprocessing.
Project description
Description
An open-source NLP library: fast text cleaning and preprocessing.
Overview
This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and etc.
Installation
Getting it
To download dobbi, either fork this github repo or simply use Pypi via pip.
$ pip install dobbi
Usage
Import the library.
import dobbi
Interaction
The library uses method chaining in order to simplify text processing:
dobbi.clean()\
.hashtag()\
.nickname()\
.url()\
.execute('Check here: https://some-url.com')
Supported patterns
The library supports the following patterns:
- URL
- Punctuation
- Emoji & emoticons
- Hashtags
- Whitespaces
- Nicknames
- HTML
- Custom regexp
Examples
- Clean a twitter message
dobbi.clean()\
.hashtag()\
.nickname()\
.url()\
.execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result: 'Why is so funny? Check here:'
- Replace nickname and url with tokens
dobbi.replace()\
.hashtag('')\
.nickname()\
.url('CUSTOM_URL_TOKEN')\
.execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result: 'Why TOKEN_NICKNAME is so funny? Check here: CUSTOM_URL_TOKEN'
- Get text cleanup function
func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol Why @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')
Result: 'Why Alex33 is so funny Check here'
(!) Please, try to avoid the in-line method chaining, as it is significantly less readable.
Additional
Please pay attention that the functions are applied in the order you specify. So, you're better to chain .punctuation() as one of the last functions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.