Skip to main content

NLP Preprocessing Pipeline Wrappers

Project description

NLP Preprocessing Wrappers

Open in Visual Studio Code PyTorch Stanza SpaCy Code style: black

Upload to PyPi PyPi Version DeepSource

Preprocessing Wrappers

How to use

Install

Install the library from PyPI:

pip install nlp-preprocessing-wrappers

Usage

NLP Preprocessing Wrappers is a Python library that provides a set of preprocessing wrappers for Stanza and spaCy, providing a unified API for both libraries, making them interchangeable.

Let's start with a simple example. Here we are using the SpacyTokenizer wrapper to preprocess a text:

from nlp_preprocessing_wrappers import SpacyTokenizer

spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""

You can load any model from spaCy, with its canonical name, en_core_web_sm, or with a simple alias, as we did here, like en. By default, the simpler alias loads the smaller version of each model. For a complete list of available models, see spaCy documentation.

In the very same way, you can load any model from Stanza using the StanzaTokenizer wrapper:

from nlp_preprocessing_wrappers import StanzaTokenizer

stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""

For more simple scenarios, you can use the WhiteSpaceTokenizer wrapper, which will just split the text by whitespace:

from nlp_preprocessing_wrappers import WhitespaceTokenizer

whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
    print("{:<5} {:<10}".format(word.index, word.text))

"""
0    Mary
1    sold
2    the
3    car
4    to
5    John
6    .
"""

Features

Complete preprocessing pipeline

SpacyTokenizer and StanzaTokenizer provide a unified API for both libraries, exposing most of their features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate and deactivate any of these using return_pos_tags, return_lemmas and return_deps. So, for example,

StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)

will return a list of Token objects, with the pos and lemma fields filled.

while

StanzaTokenizer(language="en")

will return a list of Token objects, with only the text field filled.

GPU support

With use_gpu=True, the library will use the GPU if it is available. To set up the environment for the GPU, refer to the Stanza documentation and the spaCy documentation.

API

Tokenizers

SpacyTokenizer

class SpacyTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):

StanzaTokenizer

class StanzaTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):

WhitespaceTokenizer

class WhitespaceTokenizer(BaseTokenizer):
    def __init__(self):

Sentence Splitter

SpacySentenceSplitter

class SpacySentenceSplitter(BaseSentenceSplitter):
    def __init__(self, language: str = "en", model_type: str = "statistical"):

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp_preprocessing_wrappers-0.1.3.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

File details

Details for the file nlp_preprocessing_wrappers-0.1.3.tar.gz.

File metadata

  • Download URL: nlp_preprocessing_wrappers-0.1.3.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for nlp_preprocessing_wrappers-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2e5bdb01e3e1accb34c8efff72c84a60cc27063efee6251769742a55859f905b
MD5 b78b1df712f1ae55d6ab00343710e584
BLAKE2b-256 dec875756fcdf9fba4b06dbc2d6dacfe0fad012b9be5871168b58b38fbc84da3

See more details on using hashes here.

File details

Details for the file nlp_preprocessing_wrappers-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: nlp_preprocessing_wrappers-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for nlp_preprocessing_wrappers-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 005ae7039951e7a9935240dc81656b1335f543a18f7de31d27497048181d3ad2
MD5 4aaee7efe7d3387ea529e27adff96c23
BLAKE2b-256 b6751a17f5bbba2ef68b3be2285a617fbbc1b193afb6cefc25fa5b4424e51cef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page