NLP Preprocessing Pipeline Wrappers
Project description
Preprocessing Wrappers
How to use
Install
Install the library from PyPI:
pip install nlp-preprocessing-wrappers
Usage
NLP Preprocessing Wrappers is a Python library that provides a set of preprocessing wrappers for Stanza and spaCy, providing a unified API for both libraries, making them interchangeable.
Let's start with a simple example. Here we are using the SpacyTokenizer
wrapper to preprocess a text:
from nlp_preprocessing_wrappers import SpacyTokenizer
spacy_tokenizer = SpacyTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = spacy_tokenizer("Mary sold the car to John.")
for word in tokenized:
print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))
"""
0 Mary PROPN Mary
1 sold VERB sell
2 the DET the
3 car NOUN car
4 to ADP to
5 John PROPN John
6 . PUNCT .
"""
You can load any model from spaCy, with its canonical name, en_core_web_sm
, or with a simple alias, as
we did here, like en
. By default, the simpler alias loads the smaller version of each model. For a complete
list of available models, see spaCy documentation.
In the very same way, you can load any model from Stanza using the StanzaTokenizer
wrapper:
from nlp_preprocessing_wrappers import StanzaTokenizer
stanza_tokenizer = StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
tokenized = stanza_tokenizer("Mary sold the car to John.")
for word in tokenized:
print("{:<5} {:<10} {:<10} {:<10}".format(word.index, word.text, word.pos, word.lemma))
"""
0 Mary PROPN Mary
1 sold VERB sell
2 the DET the
3 car NOUN car
4 to ADP to
5 John PROPN John
6 . PUNCT .
"""
For more simple scenarios, you can use the WhiteSpaceTokenizer
wrapper, which will just split the text
by whitespace:
from nlp_preprocessing_wrappers import WhitespaceTokenizer
whitespace_tokenizer = WhitespaceTokenizer()
tokenized = whitespace_tokenizer("Mary sold the car to John .")
for word in tokenized:
print("{:<5} {:<10}".format(word.index, word.text))
"""
0 Mary
1 sold
2 the
3 car
4 to
5 John
6 .
"""
Features
Complete preprocessing pipeline
SpacyTokenizer
and StanzaTokenizer
provide a unified API for both libraries, exposing most of their
features, like tokenization, Part-of-Speech tagging, lemmatization and dependency parsing. You can activate
and deactivate any of these using return_pos_tags
, return_lemmas
and return_deps
. So, for example,
StanzaTokenizer(language="en", return_pos_tags=True, return_lemmas=True)
will return a list of Token
objects, with the pos
and lemma
fields filled.
while
StanzaTokenizer(language="en")
will return a list of Token
objects, with only the text
field filled.
GPU support
With use_gpu=True
, the library will use the GPU if it is available. To set up the environment for the GPU,
refer to the Stanza documentation and the
spaCy documentation.
API
Tokenizers
SpacyTokenizer
class SpacyTokenizer(BaseTokenizer):
def __init__(
self,
language: str = "en",
return_pos_tags: bool = False,
return_lemmas: bool = False,
return_deps: bool = False,
split_on_spaces: bool = False,
use_gpu: bool = False,
):
StanzaTokenizer
class StanzaTokenizer(BaseTokenizer):
def __init__(
self,
language: str = "en",
return_pos_tags: bool = False,
return_lemmas: bool = False,
return_deps: bool = False,
split_on_spaces: bool = False,
use_gpu: bool = False,
):
WhitespaceTokenizer
class WhitespaceTokenizer(BaseTokenizer):
def __init__(self):
Sentence Splitter
SpacySentenceSplitter
class SpacySentenceSplitter(BaseSentenceSplitter):
def __init__(self, language: str = "en", model_type: str = "statistical"):
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file nlp_preprocessing_wrappers-0.1.3.tar.gz
.
File metadata
- Download URL: nlp_preprocessing_wrappers-0.1.3.tar.gz
- Upload date:
- Size: 12.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2e5bdb01e3e1accb34c8efff72c84a60cc27063efee6251769742a55859f905b |
|
MD5 | b78b1df712f1ae55d6ab00343710e584 |
|
BLAKE2b-256 | dec875756fcdf9fba4b06dbc2d6dacfe0fad012b9be5871168b58b38fbc84da3 |
File details
Details for the file nlp_preprocessing_wrappers-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: nlp_preprocessing_wrappers-0.1.3-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 005ae7039951e7a9935240dc81656b1335f543a18f7de31d27497048181d3ad2 |
|
MD5 | 4aaee7efe7d3387ea529e27adff96c23 |
|
BLAKE2b-256 | b6751a17f5bbba2ef68b3be2285a617fbbc1b193afb6cefc25fa5b4424e51cef |