Skip to main content

NLP Preprocessing Pipeline Wrappers

Project description

🍺IPA: import, preprocess, accelerate

PyTorch Stanza SpaCy Code style: black

Upload to PyPi PyPi Version DeepSource

🍺IPA: import, preprocess, accelerate

How to use

Install

Install the library from PyPI:

pip install ipa-core

Usage

IPA is a Python library that provides a set of preprocessing wrappers for Stanza and spaCy, providing a unified API for both libraries, making them interchangeable.

Let's start with a simple example. Here we are using the SpacyTokenizer wrapper to preprocess a text:

from ipa import SpacyTokenizer

spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""

You can load any model from spaCy, with its canonical name, en_core_web_sm, or with a simple alias, as we did here, like en. By default, the simpler alias loads the smaller version of each model. For a complete list of available models, see spaCy documentation.

In the very same way, you can load any model from Stanza using the StanzaTokenizer wrapper:

from ipa import StanzaTokenizer

stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""

For more simple scenarios, you can use the WhiteSpaceTokenizer wrapper, which will just split the text by whitespace:

from ipa import WhitespaceTokenizer

whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
    print("{:<5} {:<10}".format(word.index, word.text))

"""
0    Mary
1    sold
2    the
3    car
4    to
5    John
6    .
"""

Features

Complete preprocessing pipeline

SpacyTokenizer and StanzaTokenizer provide a unified API for both libraries, exposing most of their features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate and deactivate any of these using return_pos_tags, return_lemmas and return_deps. So, for example,

StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)

will return a list of Token objects, with the pos and lemma fields filled.

while

StanzaTokenizer(language="en")

will return a list of Token objects, with only the text field filled.

GPU support

With use_gpu=True, the library will use the GPU if it is available. To set up the environment for the GPU, refer to the Stanza documentation and the spaCy documentation.

API

Tokenizers

SpacyTokenizer

class SpacyTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):

StanzaTokenizer

class StanzaTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):

WhitespaceTokenizer

class WhitespaceTokenizer(BaseTokenizer):
    def __init__(self):

Sentence Splitter

SpacySentenceSplitter

class SpacySentenceSplitter(BaseSentenceSplitter):
    def __init__(self, language: str = "en", model_type: str = "statistical"):

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipa-core-0.1.3.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

ipa_core-0.1.3-py3-none-any.whl (16.3 kB view details)

Uploaded Python 3

File details

Details for the file ipa-core-0.1.3.tar.gz.

File metadata

  • Download URL: ipa-core-0.1.3.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for ipa-core-0.1.3.tar.gz
Algorithm Hash digest
SHA256 a267dceb7ef5c91802735d1a40f09300d03256f5a74fe0d08fae6beba81ab4ae
MD5 8226aa5196a8c1db0e46bafbd889ce82
BLAKE2b-256 624d9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4

See more details on using hashes here.

File details

Details for the file ipa_core-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: ipa_core-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 16.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for ipa_core-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 690c3a8ca174ef79ce6df9d090a4097a1fef5bed812a42ae77f3f5b8b0523565
MD5 27358c68bce1365a3a75c0a5cdcb0b08
BLAKE2b-256 3ddbc45f50252666cd5ffe5a4b494806468179982d8267512ec0d65187d405ed

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page