Skip to main content

NLPiper, a lightweight package integrated with a universe of frameworks to pre-process documents.

Project description

Test License: MIT codecov Package Version Python Version

NLPiper is a package that agglomerates different NLP tools and applies their transformations in the target document.

Goal

Lightweight package integrated with a universe of frameworks to pre-process documents.


Installation

You can install NLPiper from PyPi with pip or your favorite package manager:

pip install nlpiper

Optional Dependencies

Some transformations require the installation of additional packages. The following table explains the optional dependencies that can be installed:

Package Description
bs4 Used in CleanMarkup to remove HTML and XML from the document.
gensim Used in GensimEmbeddings for document embedding extraction.
hunspell Used in Stemmer and SpellCheck to normalize the document.
nltk Used in RemoveStopWords to remove stop words from the document.
numpy Used in some document's transformations.
sacremoses Used in MosesTokenizer to tokenize the document.
spacy Used in SpacyTokenizer to tokenize the document and could also be used for extracting entities, tags, etc..
stanza Used in StanzaTokenizer to tokenize the document and could also be used for extracting entities, tags, etc.
torchtext Used in TorchTextEmbeddings for document embedding extraction.

To install the optional dependency needed for your purpose you can run:

pip install nlpiper[<package>]

You can install all of these dependencies at once with:

pip install nlpiper[all]

The package can be installed using pip:

pip install nlpiper

For all transforms be available: pip install 'nlpiper[all]', otherwise, just install the packages needed.

Usage

Define a Pipeline:

>>> from nlpiper.core import Compose
>>> from nlpiper.transformers import cleaners, normalizers, tokenizers
>>> pipeline = Compose([
...                    cleaners.CleanNumber(),
...                    tokenizers.BasicTokenizer(),
...                    normalizers.CaseTokens()
... ])
>>> pipeline
Compose([CleanNumber(), BasicTokenizer(), CaseTokens(mode='lower')])

Generate a Document and Document structure:

>>> from nlpiper.core import Document
>>> doc = Document("The following character is a number: 1 and the next one is not a.")
>>> doc
Document(
    original='The following character is a number: 1 and the next one is not a.',
    cleaned='The following character is a number: 1 and the next one is not a.',
    tokens=None,
    embedded=None,
    steps=[]
)

Apply Pipeline to a Document:

>>> doc = pipeline(doc)
>>> doc
Document(
    original='The following character is a number: 1 and the next one is not a.',
    cleaned='The following character is a number:  and the next one is not a.',
    tokens=[
        Token(original='The', cleaned='the', lemma=None, stem=None, embedded=None),
        Token(original='following', cleaned='following', lemma=None, stem=None, embedded=None),
        Token(original='character', cleaned='character', lemma=None, stem=None, embedded=None),
        Token(original='is', cleaned='is', lemma=None, stem=None, embedded=None),
        Token(original='a', cleaned='a', lemma=None, stem=None, embedded=None),
        Token(original='number:', cleaned='number:', lemma=None, stem=None, embedded=None),
        Token(original='and', cleaned='and', lemma=None, stem=None, embedded=None),
        Token(original='the', cleaned='the', lemma=None, stem=None, embedded=None),
        Token(original='next', cleaned='next', lemma=None, stem=None, embedded=None),
        Token(original='one', cleaned='one', lemma=None, stem=None, embedded=None),
        Token(original='is', cleaned='is', lemma=None, stem=None, embedded=None),
        Token(original='not', cleaned='not', lemma=None, stem=None, embedded=None),
        Token(original='a.', cleaned='a.', lemma=None, stem=None, embedded=None)
    ],
    embedded=None,
    steps=['CleanNumber()', 'BasicTokenizer()', "CaseTokens(mode='lower')"]
)

Available Transformers

Cleaners

Clean document as a whole, e.g. remove HTML, remove accents, remove emails, etc.

  • CleanURL: remove URL from the text.
  • CleanEmail: remove email from the text.
  • CleanNumber: remove numbers from text.
  • CleanPunctuation: remove punctuation from text.
  • CleanEOF: remove end of file from text.
  • CleanMarkup: remove HTML or XML from text.
  • CleanAccents: remove accents from the text.

Tokenizers

Tokenize a document after cleaning is done (Split document into tokens)

Normalizer

Applies on the token level, e.g. remove stop-words, spell-check, etc.

  • CaseTokens: lower or upper case all tokens.
  • RemovePunctuation: Remove punctuation from resulting tokens.
  • RemoveStopWords: Remove stop-words as tokens.
  • VocabularyFilter: Only allow tokens from a pre-defined vocabulary.
  • Stemmer: Get the stem from the tokens.
  • SpellCheck: Spell check the token, if given max distance will calculate the Levenshtein distance from the token with the suggested word and if lower the token is replaced by the suggestion else will keep the token. If no maximum distance is given if the word is not correctly spelt then will be replaced by an empty string.

Embeddings

Applies on the token level, converting words by embeddings

  • GensimEmbeddings: Use Gensim word embeddings.
  • TorchTextEmbeddings: Applies word embeddings using torchtext models Glove, CharNGram and FastText.

Document

Document is a dataclass that contains all the information used during text preprocessing.

Document attributes:

  • original: original text to be processed.
  • cleaned: original text to be processed when document is initiated and then attribute which Cleaners and Tokenizers work.
  • tokens: list of tokens that where obtained using a Tokenizer.
  • steps: list of transforms applied on the document.
  • embedded: document embedding.

token:

  • original: original token.
  • cleaned: original token at initiation, then modified according with Normalizers.
  • lemma: token lemma (need to use a normalizer or tokenizer to obtain).
  • stem: token stem (need to use a normalizer to obtain).
  • ner: token entity (need to use a normalizer or tokenizer to obtain).
  • embedded: token embedding.

Compose

Compose applies the chosen transformers into a given document. It restricts the order that the transformers can be applied, first are the Cleaners, then the Tokenizers and lastly the Normalizers and Embeddings.

It is possible to create a compose using the steps from a processed document:

>>> doc.steps
['CleanNumber()', 'BasicTokenizer()', "CaseTokens(mode='lower')"]
>>> new_pipeline = Compose.create_from_steps(doc.steps)
>>> new_pipeline
Compose([CleanNumber(), BasicTokenizer(), CaseTokens(mode='lower')])

It is also possible to rollback the steps applied to a document:

>>> new_doc = Compose.rollback_document(doc, 2)
>>> new_doc
Document(
    original='The following character is a number: 1 and the next one is not a.',
    cleaned='The following character is a number:  and the next one is not a.',
    tokens=None,
    embedded=None,
    steps=['CleanNumber()']
)
>>> doc
Document(
    original='The following character is a number: 1 and the next one is not a.',
    cleaned='The following character is a number:  and the next one is not a.',
    tokens=[
        Token(original='The', cleaned='the', lemma=None, stem=None, embedded=None),
        Token(original='following', cleaned='following', lemma=None, stem=None, embedded=None),
        Token(original='character', cleaned='character', lemma=None, stem=None, embedded=None),
        Token(original='is', cleaned='is', lemma=None, stem=None, embedded=None),
        Token(original='a', cleaned='a', lemma=None, stem=None, embedded=None),
        Token(original='number:', cleaned='number:', lemma=None, stem=None, embedded=None),
        Token(original='and', cleaned='and', lemma=None, stem=None, embedded=None),
        Token(original='the', cleaned='the', lemma=None, stem=None, embedded=None),
        Token(original='next', cleaned='next', lemma=None, stem=None, embedded=None),
        Token(original='one', cleaned='one', lemma=None, stem=None, embedded=None),
        Token(original='is', cleaned='is', lemma=None, stem=None, embedded=None),
        Token(original='not', cleaned='not', lemma=None, stem=None, embedded=None),
        Token(original='a.', cleaned='a.', lemma=None, stem=None, embedded=None)
    ],
    embedded=None,
    steps=['CleanNumber()', 'BasicTokenizer()', "CaseTokens(mode='lower')"]
)

Development Installation

git clone https://github.com/dlite-tools/NLPiper.git
cd NLPiper
poetry install

To install an optional dependency you can run:

poetry install --extras <package>

To install all the optional dependencies run:

poetry install --extras all

Contributions

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide on GitHub.


Issues

Go here to submit feature requests or bugfixes.


License and Credits

NLPiper is licensed under the MIT license and is written and maintained by Tomás Osório (@tomassosorio), Daniel Ferrari (@FerrariDG), Carlos Alves (@cmalves, João Cunha (@jfecunha)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpiper-0.3.1.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlpiper-0.3.1-py3-none-any.whl (17.8 kB view details)

Uploaded Python 3

File details

Details for the file nlpiper-0.3.1.tar.gz.

File metadata

  • Download URL: nlpiper-0.3.1.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.12 Linux/5.13.0-1021-azure

File hashes

Hashes for nlpiper-0.3.1.tar.gz
Algorithm Hash digest
SHA256 0f949ac8db3b1c69ee6048d059f00d2351977548371531d577d4781cff86345c
MD5 06ab3abd6b2c3ca86cee87a6c944e761
BLAKE2b-256 5cf837547c93202ac10fd5a7dea4b29f06c61d8d70c90551177b63d15da1d430

See more details on using hashes here.

File details

Details for the file nlpiper-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: nlpiper-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 17.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.13 CPython/3.8.12 Linux/5.13.0-1021-azure

File hashes

Hashes for nlpiper-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 07d75d0b9fb7b38eeee3088d3278204d33099ae1aabec483d83fda40cdee4044
MD5 498696c6b3cb7dd37071adffbe33f6b0
BLAKE2b-256 9cbec709201aa0b0e1ce13a7e6def28452b37f10c82ed6a35846c2c82ef4f594

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page