Skip to main content

An open-source NLP library: fast text cleaning and preprocessing.

Project description

Description

An open-source NLP library: fast text cleaning and preprocessing.

Overview

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and etc.

Installation

Getting it

To download dobbi, either fork this github repo or simply use Pypi via pip.

$ pip install dobbi

Usage

Import the library.

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

dobbi.clean()\
    .hashtag()\
    .nickname()\
    .url()\
    .execute('Check here: https://some-url.com')

Supported methods and patterns

The process consists of three stages:

  1. Initialization methods: initialize a dobbi Work object
  2. Intermediate methods: chain needed patterns in the needed order
  3. Terminal methods:

Initialization functions:

  • dobbi.clean()
  • dobbi.collect()
  • dobbi.replace()

Intermediate methods (pattern processing choice):

  • regexp() - custom regular expressions
  • url() - URLs
  • html() - HTML and "<...>" type markups
  • punctuation() - punctuation
  • hashtag() - hashtags
  • emoji() - emoji
  • emoticons() - emoticons
  • whitespace() - whitespaces
  • nickname() - @-starting nicknames

Terminal methods:

  • execute(str) - executes chosen methods on the provided string.
  • function() - returns a function which is a combination of the chosen methods.

Examples

  1. Clean a twitter message
dobbi.clean()\
    .hashtag()\
    .nickname()\
    .url()\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why is so funny? Check here:'

  1. Replace nickname and url with tokens
dobbi.replace()\
    .hashtag('')\
    .nickname()\
    .url('CUSTOM_URL_TOKEN')\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why TOKEN_NICKNAME is so funny? Check here: CUSTOM_URL_TOKEN'

  1. Get text cleanup function
func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol    Why  @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')

Result: 'Why Alex33 is so funny Check here'

(!) Please, try to avoid the in-line method chaining, as it is significantly less readable.

  1. Chain regexp methods
dobbi.clean()\
    .regexp('#\w+')\
    .regexp('@\w+')\
    .regexp('https?://\S+')\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why is so funny? Check here:'

Additional

Please pay attention that the functions are applied in the order you specify. So, you're better to chain .punctuation() as one of the last functions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dobbi-0.5.tar.gz (9.2 kB view details)

Uploaded Source

File details

Details for the file dobbi-0.5.tar.gz.

File metadata

  • Download URL: dobbi-0.5.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for dobbi-0.5.tar.gz
Algorithm Hash digest
SHA256 07289108130d9e297eb29d9d3f5f068692a316648ab4770e62a394fb208d94d3
MD5 a1b77855e250a4543be3462157a6b3d5
BLAKE2b-256 9c810cbc42e0e67b5131111e2eeea74cd696e8ca08b08a95991ae5c7d8c76e02

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page