Skip to main content

An open-source NLP library: fast text cleaning and preprocessing.

Project description

Description

An open-source NLP library: fast text cleaning and preprocessing.

Overview

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and etc.

Installation

Getting it

To download dobbi, either fork this github repo or simply use Pypi via pip.

$ pip install dobbi

Usage

Import the library.

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

dobbi.clean()\
    .hashtag()\
    .nickname()\
    .url()\
    .execute('Check here: https://some-url.com')

Supported methods and patterns

The process consists of three stages:

  1. Initialization methods: initialize a dobbi Work object
  2. Intermediate methods: chain needed patterns in the needed order
  3. Terminal methods:

Initialization functions:

  • dobbi.clean()
  • dobbi.collect()
  • dobbi.replace()

Intermediate methods (pattern processing choice):

  • regexp() - custom regular expressions
  • url() - URLs
  • html() - HTML and "<...>" type markups
  • punctuation() - punctuation
  • hashtag() - hashtags
  • emoji() - emoji
  • emoticons() - emoticons
  • whitespace() - whitespaces
  • nickname() - @-starting nicknames

Terminal methods:

  • execute(str) - executes chosen methods on the provided string.
  • function() - returns a function which is a combination of the chosen methods.

Examples

  1. Clean a twitter message
dobbi.clean()\
    .hashtag()\
    .nickname()\
    .url()\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why is so funny? Check here:'

  1. Replace nickname and url with tokens
dobbi.replace()\
    .hashtag('')\
    .nickname()\
    .url('CUSTOM_URL_TOKEN')\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why TOKEN_NICKNAME is so funny? Check here: CUSTOM_URL_TOKEN'

  1. Get text cleanup function
func = dobbi.clean().url().hashtag().punctuation().whitespace().html().function()
func('\t #fun #lol    Why  @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')

Result: 'Why Alex33 is so funny Check here'

(!) Please, try to avoid the in-line method chaining, as it is significantly less readable.

  1. Chain regexp methods
dobbi.clean()\
    .regexp('#\w+')\
    .regexp('@\w+')\
    .regexp('https?://\S+')\
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result: 'Why is so funny? Check here:'

Additional

Please pay attention that the functions are applied in the order you specify. So, you're better to chain .punctuation() as one of the last functions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dobbi-0.10.tar.gz (32.2 kB view details)

Uploaded Source

File details

Details for the file dobbi-0.10.tar.gz.

File metadata

  • Download URL: dobbi-0.10.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.5

File hashes

Hashes for dobbi-0.10.tar.gz
Algorithm Hash digest
SHA256 aefbc95bff883c6e76f6fd4e8cbee4333d6e288890f12c2de7b2ec6eb43f7f20
MD5 9a5ffb5ee377e06ee87ee12d0836a175
BLAKE2b-256 d548a0e343b16e15e1af5a471c7d4fec599090699d719d84456804851bcbd3c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page