Skip to main content

Simplify NLP pre-processing.

Project description

cbc-nlp : The "consileon NLP framework"

Installation

  • Install via: py -m pip install --index-url https://test.pypi.org/simple/ --no-deps cbc-nlp or using the requirements.txt
  • Install the relevant spaCy model through $ python -m spacy download [model]. For further details, see the spaCy Website

Why Consileon NLP Framework?

NLP models are developed based on text sources which contain (long) sequences of texts. A major part of the development is the pre-processing of input data. Most effort and time is spent on transforming text into other objects (lists of tokens) in order to be handled by NLP algorithms. This is where Consileon’s NLP Framework comes into play.

Consileon NLP Framework contains packages that simplify the development of NLP models through modularization and encapsulation of frequent pre-processing tasks. In that way, you avoid repeating yourself or ending up with a bulk of unstructured sample code that you might not understand or be able to explain later on. Focus on your concept and leave the implementation on us.

Features:

Consileon NLP Framework offers all preprocessing tasks you need to develop your own NLP Model:

  • Split texts into smaller chunks (sentences, paragraphs)
  • Split chunks of text into tokens (e.g. single words)
  • Bring tokens into a canonical form (lower-casing)
  • Filter out unwanted tokens and remove stop words.
  • "Lemmatization": map words to their base/dictionary form (imported also for many non-english languages)
  • Perform (other kinds of) mappings to tokens
  • Remove "garbage", i.e. artifacts which are contained in the source but don’t add meaning to the use case at hand (e.g. remove tables of numbers from texts when spoken language is required)
  • Append tags to tokens (e.g. specify the source or some semantic information)
  • Choose subsets of the input sequence for development (or other) reasons
  • Merge several data sources.

and many more.

All these transformation steps can be pipelined in few coding lines and fed into NLP-algorithms to generate your NLP model.

Getting started:

The following tutorial will walk you through developing your own NLP-Model using Consileon’s NLP Framework:
See getting_startet.ipynb

License

cbc-nlp is licensed under Apache 2.0 as described in the LICENSE file.


Developer Notes

Set-up

Create a virtual environment

py -3 -m venv .venv
.venv\scripts\activate

Now install the package i) as an editible install (so code changes come into effect without a re-install) and ii) with the dev option (to have access to dev requirements such as pytest)

python -m pip install -e .[dev]

Distribution/ Versioning

If necessary, update the version number in the pyproject.toml.

Next, update the software and build package in dist\ folder

pip install --upgrade build
python -m build

Finally, upload to the distribution archive using twine. Note, for experimental changes you can upload to testPyPI first, before uploading to PyPI.

pip install --upgrade twine
python -m twine upload --repository testpypi dist/*

When asked, set username to "__token__" and your password to the respective token.

If this doesn't work, add token directly into CLI command

python -m twine upload --repository testpypi dist/* -u __token__ -p YOUR_RESPECTIVE_TOKEN

requirements.txt file

For development purposes, there also exists a set of requirements.txt files, where the dev-requirements.txt file again includes additional packages such as pytest.

Generally, the requirements.txt are maintained and updated via pip-compile using the following command

pip-compile --no-annotate --output-file=requirements.txt pyproject.toml

To update the dev-requirements.txt, use

pip-compile --no-annotate --extra dev --output-file=dev-requirements.txt pyproject.toml

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cbc-nlp-0.0.1.tar.gz (163.2 kB view details)

Uploaded Source

Built Distribution

cbc_nlp-0.0.1-py3-none-any.whl (165.2 kB view details)

Uploaded Python 3

File details

Details for the file cbc-nlp-0.0.1.tar.gz.

File metadata

  • Download URL: cbc-nlp-0.0.1.tar.gz
  • Upload date:
  • Size: 163.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for cbc-nlp-0.0.1.tar.gz
Algorithm Hash digest
SHA256 2c8590f6da876dbc373fc2a9b9fd199b091dff6c04209701d50f754866d74001
MD5 5eaae723348ebbbd57232cb455561fab
BLAKE2b-256 f62656f27b4afe37670767ab7934d7b742b8dc435e72fa77596bf7d2e01f42e0

See more details on using hashes here.

File details

Details for the file cbc_nlp-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: cbc_nlp-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 165.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for cbc_nlp-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 86994d3290fc7af2d6584a3d781bc02c6581d7a3cb6033a84b0e159565d07d8f
MD5 e4ded56b75e8d002df6d81ebbffc86c2
BLAKE2b-256 3dcad6af16d19630970a559f812d4128149d8b3edaadaafa079e0cab27b7b7f7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page