Simplify NLP pre-processing.
Project description
cbc-nlp : The "consileon NLP framework"
Installation
- Install via:
py -m pip install --index-url https://test.pypi.org/simple/ --no-deps cbc-nlp
or using the requirements.txt - Install the relevant
spaCy
model through$ python -m spacy download [model]
. For further details, see the spaCy Website
Why Consileon NLP Framework?
NLP models are developed based on text sources which contain (long) sequences of texts. A major part of the development is the pre-processing of input data. Most effort and time is spent on transforming text into other objects (lists of tokens) in order to be handled by NLP algorithms. This is where Consileon’s NLP Framework comes into play.
Consileon NLP Framework contains packages that simplify the development of NLP models through modularization and encapsulation of frequent pre-processing tasks. In that way, you avoid repeating yourself or ending up with a bulk of unstructured sample code that you might not understand or be able to explain later on. Focus on your concept and leave the implementation on us.
Features:
Consileon NLP Framework offers all preprocessing tasks you need to develop your own NLP Model:
- Split texts into smaller chunks (sentences, paragraphs)
- Split chunks of text into tokens (e.g. single words)
- Bring tokens into a canonical form (lower-casing)
- Filter out unwanted tokens and remove stop words.
- "Lemmatization": map words to their base/dictionary form (imported also for many non-english languages)
- Perform (other kinds of) mappings to tokens
- Remove "garbage", i.e. artifacts which are contained in the source but don’t add meaning to the use case at hand (e.g. remove tables of numbers from texts when spoken language is required)
- Append tags to tokens (e.g. specify the source or some semantic information)
- Choose subsets of the input sequence for development (or other) reasons
- Merge several data sources.
and many more.
All these transformation steps can be pipelined in few coding lines and fed into NLP-algorithms to generate your NLP model.
Getting started:
The following tutorial will walk you through developing your own NLP-Model using Consileon’s NLP Framework:
See getting_startet.ipynb
License
cbc-nlp
is licensed under Apache 2.0 as described in the LICENSE file.
Developer Notes
Set-up
Create a virtual environment
py -3 -m venv .venv
.venv\scripts\activate
Now install the package i) as an editible install (so code changes come into effect without a re-install) and ii) with the dev option (to have access to dev requirements such as pytest
)
python -m pip install -e .[dev]
Distribution/ Versioning
If necessary, update the version number in the pyproject.toml
.
Next, update the software and build package in dist\
folder
pip install --upgrade build
python -m build
Finally, upload to the distribution archive using twine
. Note, for experimental changes you can upload to testPyPI
first, before uploading to PyPI
.
pip install --upgrade twine
python -m twine upload --repository testpypi dist/*
When asked, set username to "__token__
" and your password to the respective token.
If this doesn't work, add token directly into CLI command
python -m twine upload --repository testpypi dist/* -u __token__ -p YOUR_RESPECTIVE_TOKEN
requirements.txt file
For development purposes, there also exists a set of requirements.txt
files, where the dev-requirements.txt
file again includes additional packages such as pytest
.
Generally, the requirements.txt
are maintained and updated via pip-compile
using the following command
pip-compile --no-annotate --output-file=requirements.txt pyproject.toml
To update the dev-requirements.txt
, use
pip-compile --no-annotate --extra dev --output-file=dev-requirements.txt pyproject.toml
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cbc-nlp-0.0.1.tar.gz
.
File metadata
- Download URL: cbc-nlp-0.0.1.tar.gz
- Upload date:
- Size: 163.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c8590f6da876dbc373fc2a9b9fd199b091dff6c04209701d50f754866d74001 |
|
MD5 | 5eaae723348ebbbd57232cb455561fab |
|
BLAKE2b-256 | f62656f27b4afe37670767ab7934d7b742b8dc435e72fa77596bf7d2e01f42e0 |
File details
Details for the file cbc_nlp-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: cbc_nlp-0.0.1-py3-none-any.whl
- Upload date:
- Size: 165.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 86994d3290fc7af2d6584a3d781bc02c6581d7a3cb6033a84b0e159565d07d8f |
|
MD5 | e4ded56b75e8d002df6d81ebbffc86c2 |
|
BLAKE2b-256 | 3dcad6af16d19630970a559f812d4128149d8b3edaadaafa079e0cab27b7b7f7 |