Skip to main content

An open-source NLP library: fast text cleaning and preprocessing.

Project description

Description

An open-source NLP library: fast text cleaning and preprocessing.

Overview

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and etc.

Installation

Getting it

To download dobbi, either fork this github repo or simply use Pypi via pip.

$ pip install dobbi

Usage

Import the library.

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

dobbi.clean()\
    .hashtag()\
    .nickname()\
    .url()\
    .execute('Check here: https://some-url.com')

Supported methods and patterns

The process consists of three stages:

  1. Initialization methods: initialize a dobbi Work object
  2. Intermediate methods: chain needed patterns in the needed order
  3. Terminal methods:

Initialization functions:

  • dobbi.clean()
  • dobbi.collect()
  • dobbi.replace()

Intermediate methods (pattern processing choice):

  • regexp() - custom regular expressions
  • url() - URLs
  • html() - HTML and "<...>" type markups
  • punctuation() - punctuation
  • hashtag() - hashtags
  • emoji() - emoji
  • emoticons() - emoticons
  • whitespace() - whitespaces
  • nickname() - @-starting nicknames

Terminal methods:

  • execute(str) - executes chosen methods on the provided string.
  • function() - returns a function which is a combination of the chosen methods.

Examples

  1. Clean a twitter message
dobbi.clean()\
    .hashtag()\
    .nickname()\
    .url()\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why is so funny? Check here:'

  1. Replace nickname and url with tokens
dobbi.replace()\
    .hashtag('')\
    .nickname()\
    .url('CUSTOM_URL_TOKEN')\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why TOKEN_NICKNAME is so funny? Check here: CUSTOM_URL_TOKEN'

  1. Get text cleanup function
func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol    Why  @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')

Result: 'Why Alex33 is so funny Check here'

(!) Please, try to avoid the in-line method chaining, as it is significantly less readable.

  1. Chain regexp methods
dobbi.clean()\
    .regexp('#\w+')\
    .regexp('@\w+')\
    .regexp('https?://\S+')\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why is so funny? Check here:'

Additional

Please pay attention that the functions are applied in the order you specify. So, you're better to chain .punctuation() as one of the last functions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dobbi-0.6.tar.gz (9.2 kB view details)

Uploaded Source

File details

Details for the file dobbi-0.6.tar.gz.

File metadata

  • Download URL: dobbi-0.6.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for dobbi-0.6.tar.gz
Algorithm Hash digest
SHA256 fa48a95fbf9f36fc0342995ac4c01e932efa82a7c5453cf1773b50896711be52
MD5 dc9fb3a48bd87f80262f773869174072
BLAKE2b-256 9597e0fee0e27201d2f6faf11cd317233b5aff8dd5114b4e9c2ccf5dccbae88c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page