NLP Preprocessing Pipeline Wrappers
Project description
🍺IPA: import, preprocess, accelerate
How to use
Install
Install the library from PyPI:
pip install ipa-core
Usage
IPA is a Python library that provides a set of preprocessing wrappers for Stanza and spaCy, providing a unified API for both libraries, making them interchangeable.
Let's start with a simple example. Here we are using the SpacyTokenizer
wrapper to preprocess a text:
from ipa import SpacyTokenizer
spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))
"""
0 Mary PROPN Mary
1 sold VERB sell
2 the DET the
3 car NOUN car
4 to ADP to
5 John PROPN John
6 . PUNCT .
"""
You can load any model from spaCy, with its canonical name, en_core_web_sm
, or with a simple alias, as
we did here, like en
. By default, the simpler alias loads the smaller version of each model. For a complete
list of available models, see spaCy documentation.
In the very same way, you can load any model from Stanza using the StanzaTokenizer
wrapper:
from ipa import StanzaTokenizer
stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))
"""
0 Mary PROPN Mary
1 sold VERB sell
2 the DET the
3 car NOUN car
4 to ADP to
5 John PROPN John
6 . PUNCT .
"""
For more simple scenarios, you can use the WhiteSpaceTokenizer
wrapper, which will just split the text
by whitespace:
from ipa import WhitespaceTokenizer
whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
print("{:<5} {:<10}".format(word.index, word.text))
"""
0 Mary
1 sold
2 the
3 car
4 to
5 John
6 .
"""
Features
Complete preprocessing pipeline
SpacyTokenizer
and StanzaTokenizer
provide a unified API for both libraries, exposing most of their
features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate
and deactivate any of these using return_pos_tags
, return_lemmas
and return_deps
. So, for example,
StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
will return a list of Token
objects, with the pos
and lemma
fields filled.
while
StanzaTokenizer(language="en")
will return a list of Token
objects, with only the text
field filled.
GPU support
With use_gpu=True
, the library will use the GPU if it is available. To set up the environment for the GPU,
refer to the Stanza documentation and the
spaCy documentation.
API
Tokenizers
SpacyTokenizer
class SpacyTokenizer(BaseTokenizer):
def __init__(
self,
language: str = "en",
return_pos_tags: bool = False,
return_lemmas: bool = False,
return_deps: bool = False,
split_on_spaces: bool = False,
use_gpu: bool = False,
):
StanzaTokenizer
class StanzaTokenizer(BaseTokenizer):
def __init__(
self,
language: str = "en",
return_pos_tags: bool = False,
return_lemmas: bool = False,
return_deps: bool = False,
split_on_spaces: bool = False,
use_gpu: bool = False,
):
WhitespaceTokenizer
class WhitespaceTokenizer(BaseTokenizer):
def __init__(self):
Sentence Splitter
SpacySentenceSplitter
class SpacySentenceSplitter(BaseSentenceSplitter):
def __init__(self, language: str = "en", model_type: str = "statistical"):
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ipa-core-0.1.3.tar.gz
.
File metadata
- Download URL: ipa-core-0.1.3.tar.gz
- Upload date:
- Size: 14.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a267dceb7ef5c91802735d1a40f09300d03256f5a74fe0d08fae6beba81ab4ae |
|
MD5 | 8226aa5196a8c1db0e46bafbd889ce82 |
|
BLAKE2b-256 | 624d9926c1f3dabec4aff2ccedb869b5db867908eda64a37d625104252c295b4 |
File details
Details for the file ipa_core-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: ipa_core-0.1.3-py3-none-any.whl
- Upload date:
- Size: 16.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 690c3a8ca174ef79ce6df9d090a4097a1fef5bed812a42ae77f3f5b8b0523565 |
|
MD5 | 27358c68bce1365a3a75c0a5cdcb0b08 |
|
BLAKE2b-256 | 3ddbc45f50252666cd5ffe5a4b494806468179982d8267512ec0d65187d405ed |