NLP Preprocessing Pipeline Wrappers

These details have not been verified by PyPI

Project links

Homepage

Project description

🍺IPA: import, preprocess, accelerate

How to use

Install

Install the library from PyPI:

pip install ipa-core

Usage

IPA is a Python library that provides a set of preprocessing wrappers for Stanza and spaCy, providing a unified API for both libraries, making them interchangeable.

Let's start with a simple example. Here we are using the SpacyTokenizer wrapper to preprocess a text:

from ipa import SpacyTokenizer

spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""

You can load any model from spaCy, with its canonical name, en_core_web_sm, or with a simple alias, as we did here, like en. By default, the simpler alias loads the smaller version of each model. For a complete list of available models, see spaCy documentation.

In the very same way, you can load any model from Stanza using the StanzaTokenizer wrapper:

from ipa import StanzaTokenizer

stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
    print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))

"""
0    Mary       PROPN      Mary
1    sold       VERB       sell
2    the        DET        the
3    car        NOUN       car
4    to         ADP        to
5    John       PROPN      John
6    .          PUNCT      .
"""

For more simple scenarios, you can use the WhiteSpaceTokenizer wrapper, which will just split the text by whitespace:

from ipa import WhitespaceTokenizer

whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
    print("{:<5} {:<10}".format(word.index, word.text))

"""
0    Mary
1    sold
2    the
3    car
4    to
5    John
6    .
"""

Features

Complete preprocessing pipeline

SpacyTokenizer and StanzaTokenizer provide a unified API for both libraries, exposing most of their features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate and deactivate any of these using return_pos_tags, return_lemmas and return_deps. So, for example,

StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)

will return a list of Token objects, with the pos and lemma fields filled.

while

StanzaTokenizer(language="en")

will return a list of Token objects, with only the text field filled.

GPU support

With use_gpu=True, the library will use the GPU if it is available. To set up the environment for the GPU, refer to the Stanza documentation and the spaCy documentation.

API

Tokenizers

SpacyTokenizer

class SpacyTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):

StanzaTokenizer

class StanzaTokenizer(BaseTokenizer):
    def __init__(
        self,
        language: str = "en",
        return_pos_tags: bool = False,
        return_lemmas: bool = False,
        return_deps: bool = False,
        split_on_spaces: bool = False,
        use_gpu: bool = False,
    ):

WhitespaceTokenizer

class WhitespaceTokenizer(BaseTokenizer):
    def __init__(self):

Sentence Splitter

SpacySentenceSplitter

class SpacySentenceSplitter(BaseSentenceSplitter):
    def __init__(self, language: str = "en", model_type: str = "statistical"):

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.3

May 12, 2023

0.1.2

Apr 18, 2023

0.1.1

Dec 21, 2022

0.1.0

Dec 12, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipa-core-0.1.3.tar.gz (14.7 kB view details)

Uploaded May 12, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ipa_core-0.1.3-py3-none-any.whl (16.3 kB view details)

Uploaded May 12, 2023 Python 3

File details

Details for the file ipa-core-0.1.3.tar.gz.

File metadata

Download URL: ipa-core-0.1.3.tar.gz
Upload date: May 12, 2023
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for ipa-core-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`a267dceb7ef5c91802735d1a40f09300d03256f5a74fe0d08fae6beba81ab4ae`
MD5	`8226aa5196a8c1db0e46bafbd889ce82`
BLAKE2b-256	`624d9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4`

See more details on using hashes here.

File details

Details for the file ipa_core-0.1.3-py3-none-any.whl.

File metadata

Download URL: ipa_core-0.1.3-py3-none-any.whl
Upload date: May 12, 2023
Size: 16.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.11.3

File hashes

Hashes for ipa_core-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`690c3a8ca174ef79ce6df9d090a4097a1fef5bed812a42ae77f3f5b8b0523565`
MD5	`27358c68bce1365a3a75c0a5cdcb0b08`
BLAKE2b-256	`3ddbc45f50252666cd5ffe5a4b494806468179982d8267512ec0d65187d405ed`

See more details on using hashes here.

ipa-core 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🍺IPA: import, preprocess, accelerate

How to use

Install

Usage

Features

Complete preprocessing pipeline

GPU support

API

Tokenizers

Sentence Splitter

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes