textpipe: clean and extract metadata from text

These details have not been verified by PyPI

Project links

Homepage

Project description

textpipe: clean and extract metadata from text

textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.

Vision: the zen of textpipe

Designed for use in production pipelines without adult supervision.
Rechargeable batteries included: provide sane defaults and clear examples to adapt.
A uniform interface with thin wrappers around state-of-the-art NLP packages.
As language-agnostic as possible.
Bring your own models.

Features

Clean raw text by removing HTML and other unreadable constructs
Identify the language of text
Extract the number of words, number of sentences, named entities from a text
Calculate the complexity of a text
Obtain text metadata by specifying a pipeline containing all desired elements
Obtain sentiment (polarity and a subjectivity score)
Generates word counts
Computes minhash for cheap similarity estimation of documents

Usage example

>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2

>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 2}

In order to extend the existing Textpipe operations with your own proprietary operations;

test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
    return 1

custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))

Contributing

See CONTRIBUTING for guidelines for contributors.

Changes

0.11.1

Replaces codacy with pylint on CI
Fixes pylint issues

0.11.0

Adds wrapper around Gensim keyed vectors to construct document embeddings from Redis cache

0.9.0

Adds functionality to compute document embeddings using a Gensim word2vec model

0.8.6

Removes non standard utf chars before detecting language

0.8.5

Bump spaCy to 2.1.3

0.8.4

Fix broken install command

0.8.3

Fix broken install command

0.8.2

Fix copy-paste error in word vector aggregation (#118)

0.8.1

Fixes bugs in several operations that didn't accept kwargs

0.8.0

Bumps Spacy to 2.1

0.7.2

Pins Spacy and Pattern versions (with pinned lxml)

0.7.0

change operation's registry from list to dict
global pipeline data is available across operations via the context kwarg
load custom operations using register_operation in pipeline
custom steps (operations) with arguments

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.2

Jan 25, 2021

0.12.1

Oct 12, 2020

0.12.0

Oct 1, 2020

0.11.10

Feb 18, 2020

0.11.9

Feb 18, 2020

0.11.8

Feb 13, 2020

0.11.7

Feb 5, 2020

0.11.6

Dec 12, 2019

0.11.5

Aug 6, 2019

0.11.4

Jul 18, 2019

This version

0.11.3

Jul 18, 2019

0.11.2

Jul 18, 2019

0.10.1

Jul 11, 2019

0.10.0

Jun 13, 2019

0.9.0

May 9, 2019

0.8.6

Apr 1, 2019

0.8.5

Apr 1, 2019

0.8.4

Mar 29, 2019

0.8.3

Mar 29, 2019

0.8.2

Mar 29, 2019

0.8.1

Mar 27, 2019

0.8.0

Mar 27, 2019

0.7.2

Mar 27, 2019

0.7.1

Jan 14, 2019

0.7.0

Jan 14, 2019

0.6.3

Dec 4, 2018

0.6.2

Dec 4, 2018

0.6.1

Nov 1, 2018

0.6.0

Sep 28, 2018

0.5.2

Sep 27, 2018

0.5.1

Sep 26, 2018

0.4.1

Sep 25, 2018

0.3.2

Aug 28, 2018

0.3.1

Aug 28, 2018

0.3.0

Aug 27, 2018

0.2.0

Aug 7, 2018

0.1.2

Jul 24, 2018

0.1.0

Jun 28, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textpipe-0.11.3.tar.gz (35.2 kB view details)

Uploaded Jul 18, 2019 Source

File details

Details for the file textpipe-0.11.3.tar.gz.

File metadata

Download URL: textpipe-0.11.3.tar.gz
Upload date: Jul 18, 2019
Size: 35.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.7

File hashes

Hashes for textpipe-0.11.3.tar.gz
Algorithm	Hash digest
SHA256	`cf7f435c3b25ae26725bd286a0c0f23be17a1416d959e3aaf569180a1bbdc41f`
MD5	`80f26441e36f0f9059abd32c47eaa06f`
BLAKE2b-256	`5f7b21840c061defc809f7e3561d957c455653de2e6ff67724e59da5f867208c`