Skip to main content

textpipe: clean and extract metadata from text

Project description

textpipe: clean and extract metadata from text

Build Status

The textpipe logo

textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.

Vision: the zen of textpipe

  • Designed for use in production pipelines without adult supervision.
  • Rechargeable batteries included: provide sane defaults and clear examples to adapt.
  • A uniform interface with thin wrappers around state-of-the-art NLP packages.
  • As language-agnostic as possible.
  • Bring your own models.

Features

  • Clean raw text by removing HTML and other unreadable constructs
  • Identify the language of text
  • Extract the number of words, number of sentences, named entities from a text
  • Calculate the complexity of a text
  • Obtain text metadata by specifying a pipeline containing all desired elements
  • Obtain sentiment (polarity and a subjectivity score)
  • Generates word counts
  • Computes minhash for cheap similarity estimation of documents

Installation

It is recommended that you install textpipe using a virtual environment.

python3 -m venv .venv
  • Using virtualenv.
virtualenv venv -p python3.6
  • Using virtualenvwrapper
mkvirtualenv textpipe -p python3.6
  • Install textpipe using pip.
pip install textpipe
  • Install the required packages using requirements.txt.
pip install -r requirements.txt

A note on spaCy download model requirement

While the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.

Usage example

>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2

>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 3}

In order to extend the existing Textpipe operations with your own proprietary operations;

test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
    return 1

custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))

Contributing

See CONTRIBUTING for guidelines for contributors.

Changes

0.12.1

  • Bumps redis, tqdm, pyling

0.12.0

  • Bumps versions of many dependencies including textacy. Results for keyterm extraction changed.

0.11.9

  • Exposes arbitrary SpaCy ents properties

0.11.8

  • Exposes SpaCy's cats attribute

0.11.7

  • Bumps spaCy and redis versions

0.11.6

  • Fixes bug where gensim model is not cached in pipeline

0.11.5

  • Raise TextpipeMissingModelException instead of KeyError

0.11.4

  • Bumps spaCy and datasketch dependencies

0.11.1

  • Replaces codacy with pylint on CI
  • Fixes pylint issues

0.11.0

  • Adds wrapper around Gensim keyed vectors to construct document embeddings from Redis cache

0.9.0

  • Adds functionality to compute document embeddings using a Gensim word2vec model

0.8.6

  • Removes non standard utf chars before detecting language

0.8.5

  • Bump spaCy to 2.1.3

0.8.4

  • Fix broken install command

0.8.3

  • Fix broken install command

0.8.2

  • Fix copy-paste error in word vector aggregation (#118)

0.8.1

  • Fixes bugs in several operations that didn't accept kwargs

0.8.0

  • Bumps Spacy to 2.1

0.7.2

  • Pins Spacy and Pattern versions (with pinned lxml)

0.7.0

  • change operation's registry from list to dict
  • global pipeline data is available across operations via the context kwarg
  • load custom operations using register_operation in pipeline
  • custom steps (operations) with arguments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textpipe-0.12.2.tar.gz (37.1 kB view details)

Uploaded Source

File details

Details for the file textpipe-0.12.2.tar.gz.

File metadata

  • Download URL: textpipe-0.12.2.tar.gz
  • Upload date:
  • Size: 37.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/52.0.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.6.7

File hashes

Hashes for textpipe-0.12.2.tar.gz
Algorithm Hash digest
SHA256 28b7e738fcfea094e58bd53bd734c56108e400004c24a218c2854cf61217c902
MD5 aa1f96e03ee1dc6660cb854a22a46636
BLAKE2b-256 65b15b5544ce361dd1c440f0538b5ee0821a5a3d1c74983bd742d23e720c22ab

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page