Skip to main content

Python extension for the NLP++ text analysis engine

Project description

NLPPlus

NLP++ Textbook

First Textbook on the NLP++ Programming Langauge

The first textbook on NLP++ is now available world-wide by BPB Online. NLP++ can replace LLMs when used in agentic flows. The code must be written by a human like any other programming language and this book will facilitate this process. NLP++ is no a statistical system that needs training. It relies on the ingenuity of the programmer to create a program that can parse text and extract information in a deterministic way.

The NLPPlus Python Package

PyPI Downloads

The NLPPlus Python Package is the package that allows for python scripts to call text and NLP analyzers created using NLP++. The package uses the C++ libraries for the NLP Engine making the calling more efficient than using the NLP++ python class that calls command line version of the NLP Engine "nlp.exe".

The major advantage of NLPPlus over other NLP packages is that is 100% rule-based and modifiable and allows for any non-linguistic programmer to create text analyzers 100% taylored to their needs.

Analyzers can be run in two modes: interpreted (the default, runs straight from the .nlp source) or compiled (analyzer code is compiled to a native shared library once and loaded at runtime). See Compiled Mode below for the cloud_compile() one-call build path.

Long-Term, Open-Source, Glass-Box Project

NLP++ allows any programmer to write text and NLP programs that can be shared by everyone. It represents the first universal programming language for text and NLP. As the community grows, the number of open-source solutions including dictionaries, knowledge bases, and analyzers will grow - all of which can be modified by any programmer using the NLP++ Language Extension for VSCode.

READ FIRST

It is important to understand that the NLPPlus package for Python is very different from ALL other NLP packages in a very important and practical way.

Current NLP python packages have the "intention" of being plug-and-play systems that perform natural language tasks without modification. The problem is that when these systems ultimately fail in critical situations, coders are left with no real way to fix these systems and they are quickly abandoned.

The problem is that most all of these packages rely on statistical methods such as machine learning or neural networks, or in the simpler cases, they rely on Regex. Statistical systems cannot logically be corrected and Regex is extremely limited and unreadable and impossible to maintain or extend. Plus, these systems offer little if any means to modify them even though every NLP task is slightly different in important ways.

The NLPPlus Python Package is different from all other NLP Python packages. All its analyzers are 100% human readable and modifiable code that allows any non-NLP coder to become a NLP programmer using the NLP++ VSCode Language Extension appropriately called "VisualText". The VisualText extension allows for the visualization of any NLP process. Coders can "see" the syntactic parse tree along each step of the process, see rule matches directly in the text, and print out the knowledge base at any point in the process. Plus, dictionaries and knowledge bases are human readable unlike json files or databases.

NLPPlus comes with five starter analyzers: telephone numbers, links, emails, addresses, and a full English parser. And because NLP++ is a glassbox, all analyzers can easily be modified by any coder.

If for example, the telephone number analyzer is not working properly for your application, you can use the NLP++ VSCode extension to edit and test the NLP++ code, and then use updated code instantly. Universities around the world are starting to use NLP++ to write human digital readers for many different applications.

Learn More About NLP++

Requirements

  • Python 3.10 or newer

Installation

Installation

The NLPPlus python package is registered in pypi.org. NLPPlus can be installed using pip:

pip install nlpplus

Installing By Downloading the Package Manually

You can find the installable "wheel" files under each release in the Releases page. Choose the correct version for your platform and Python version based on the filename, for instance, wheels for Python 3.12 and MacOS will have cp312 and macos in the filename, for Windows you will find cp312 and win, and for Linux linux. These files can be installed with pip on the command line, for example:

pip install nlpplus-0.1.2-cp310-cp310-win_amd64.whl

For the most recent version you can also download them from the GitHub actions page. Click on the link at the top of the list of "workflow run results" under "Build and upload to PyPI". After scrolling to the bottom of the page, you should see a section marked "Artifacts". Click on the appropriate link for your platform:

  • For Linux: cibw-wheels-linux
  • For MacOS 11 and later: cibw-wheels-macos
  • For Windows 10 and later: cibw-wheels-windows

This will download a ZIP file containing installation files for each supported version of Python on your platform. The version number is shown in the filename, for instance, for Python 3.10 on Windows you will see a file with a name like nlpplus-0.1.dev1+g55d691d-cp310-cp310-win_amd64.whl - the cp310 means Python 3.10. For Python 3.12 it would be cp312, and so forth.

For specific instructions on setting up Python on your platform please consult the Python documentation.

If your platform is not supported you can also compile it from source, which will require a working C++ compiler. See the platform specific instructions below for the requirements to build.

Why Use NLP++?

There are many reasons to consider using NLP++. Whether it be to be able to write Regex-like rule patterns, to having the ability to modify 100% of the NLP code, or to visualize the NLP analyzer in an intunitive way, NLP++ should be in every coder and programmer's toolkit.

To put it simply, NLP++ turns any coder or programmer into an NLP engineer.

1000 Times Better than Regex

For matching patterns in text, NLP++ is a Regex killer. The rule matching system in NLP++ is human readable and is performed by calling rules in a sequence, making creating and debugging rule-based patterns a breeze. Along with

100% Modifiable

The main reason to use NLP++ it is to engineer an NLP system to a specific task. Most all extraction or understanding tasks in NLP require specific processing that is never included in "generic" systems. NLP++ allows for the creation or modification of any NLP++ system.

It must be emphasized that what separates NLPPlus from all the other NLP packages in Python is that fact that all parsers are 100% modifiable using the VSCode NLP++ Language Extension. Other NLP packages use regex patterns which are impossible to modify or use trained machine learning or neural network systems which cannot be fixed when

VisualText Editor

Writing an NLP system from scratch is thought to be for only those in computational linguistics. But VisualText, NLP++, and the conceptual Grammar changes all that.

Taking full advantage of the familiar VSCode environment, the NLP++ language extension makes NLP a visual process and logical process that is easy to understand.

Usng the NLPPlus Python Package

Very basic usage, which runs the default parser for US English and returns parsing results as xML:

import NLPPlus
xml = NLPPlus.analyze("Hello world.")

This may be less useful than using a domain-specific analyzer. Several of these are included with the module:

  • address-parser: Extract addresses from text
  • emailaddress: Extract email addresses from text
  • links: Extract hyperlinks from text
  • telephone: Extract telephone numbers from text

In contrast to the default analyzer these do not return any text by default. You will have to use the extended API to get the parse tree or JSON output from them:

import NLPPlus
results = NLPPlus.engine.analyze("Reach me at hello@example.com","emailaddress")
parsed_address = results.output["email_address"][0]
parse_tree = results.final_tree

NLPPlus Engine Functions

These are the current functions that come with the NLPPlus package.

set_analyzer_folder(analyzer_folder_path: str)

This is used to set the folder where your analyzers are located.

analyze(text: str, parser: str = "parse-en-us", develop: bool = False, compiled: bool = False): str

This calls one of the analyzers in the analyzer folder on the text. If the analyzer folder was not set, it will use the library analyzers that come with NLPPlus. If you are planning to modify the library analyzers, it is recommended that you use the function copy_library_analyzers to copy the analyzers to avoid having them overwritten when a new version of NLPPlus is installed.

If compiled=True, the engine loads the analyzer's compiled shared libraries (bin/run.<ext> and bin/kb.<ext>) instead of running interpreted from the .nlp source. See compile() and cloud_compile() below for producing those libraries.

The analyze function returns a results object that make the analyzer output files easily accessible to python. (see reults below)

compile(analyzer: str = "parse-en-us", develop: bool = False, kb_only: bool = False)

Generates C++ source files for the analyzer by running the engine in -COMPILE mode. The output lands under <analyzer>/run/*.cpp and <analyzer>/kb/*.cpp (or just <analyzer>/kb/*.cpp if kb_only=True). The generated files still need to be built into shared libraries before analyze(..., compiled=True) can load them — see cloud_compile() for the one-call end-to-end path.

cloud_compile(analyzer: str = "parse-en-us", dispatcher_url: Optional[str] = None, kb_only: bool = False, develop: bool = False, poll_interval: float = 2.0, timeout: float = 1800, skip_local_compile: bool = False)

End-to-end compile via the public nlp-compile-service cloud build: runs compile() to produce the C++ trees, tars them up, submits to a Cloudflare-Worker dispatcher, polls the GitHub-Actions runner build, downloads the resulting shared library and stages it into <analyzer>/bin/ as run.<ext> + runu.<ext> + kb.<ext> + kbu.<ext> (or just kb.<ext> + kbu.<ext> for kb_only=True). After it returns, analyze(..., compiled=True) will pick up the staged libraries.

dispatcher_url defaults to the same public Cloudflare-Worker the VSCode NLP++ extension uses; override per-call to point at a self-hosted deployment. timeout caps the wait for the runner build (default 30 minutes — GitHub-Actions Windows free-tier queues can stall 5-10 minutes before the build even starts).

copy_library_analyzers(self, to_dir: str, overwrite: bool=True)

This function copies the NLPPlus library analyzers into a safe folder away from where they can be overwritten by newer versions of the NLPPlus package. This allows coders to edit and modify the analyzers to their liking. Remember to use the set_analyzers_folder if you want to call your versions of these library analyzers using the NLPPlus package.

input_text(analyzer_name: str, file_name: str)

When developing or editing NLP++ analyzers and calling them from Python, it is convenient to test your python code on text you have used to develop your analyzer in in the NLP++ VisualText extension for VSCode. This function retrieves the text from a file in the analyzer's input directory for easy access while developing your python code in conjunction with and NLP++ analyzer.

NLPPlus Engine Results

output

This returns a json object based on the parsed output.json file producted by the analyzer. The analyzer has to purposely construct the output.json file for this to work.

output.json

The output file produced by the analyzer that is a string, not a json object. This file must explicity be created by the analyzer.

final.tree

All analyzers output a final tree of the text that is being processed. This file is in the NLP++ tree format.

Compiled Mode

Analyzers normally run interpreted from their .nlp source — fine for development, but slower on large inputs and unaffected by source edits (i.e., you can't ship a "frozen" version without bundling the sources). NLPPlus now supports compiled mode: generate native shared libraries from the analyzer's .nlp files once, then load them at analyze time. Source edits after the build don't change the output until you re-compile.

The simplest path is one call to cloud_compile, which uses the public nlp-compile-service to build the right shared library for your platform:

import NLPPlus

# Generate run/*.cpp + kb/*.cpp, ship to the cloud builder, download
# the .so/.dylib/.dll, stage into <analyzer>/bin/.
NLPPlus.cloud_compile("parse-en-us")

# Now run with the compiled artifacts instead of the interpreter.
xml = NLPPlus.analyze("Hello world.", compiled=True)

The cloud build takes anywhere from ~1 minute (small analyzer, cache hit) up to ~10 minutes (parse-en-us, cold Windows runner queue). The first build for a given source hash is the slow one — subsequent builds against the same code hit the dispatcher's cache.

If you'd rather generate the C++ trees and build them yourself (e.g. air-gapped, custom toolchain), use compile() for the codegen step and run cmake against the engine's published compile-libs to produce the shared library, then stage the result as <analyzer>/bin/run.<ext> and <analyzer>/bin/kb.<ext>. See the nlp-compile-service emit-cmake.sh for the exact CMake invocation the cloud uses.

NLP++ Development

By default the NLPPlus module will create a temporary working directory with the default parser and the small set of analyzers mentioned above. If you are developing NLP++ code, you can also point it at an existing working folder using set_working_folder:

import NLPPlus
NLPPlus.set_working_folder("somewhere/else")

This working folder is expected to contain the directories analyzers and data. If you wish to initialize a new working folder with the default analyzers and data, you can pass initialize=True:

import NLPPlus
NLPPlus.set_working_folder("somewhere/else", initialize=True)

Module Development

This module is built using scikit-build-core and nanobind. To set up for development, make sure you have a C++ compiler that works, and clone the source with:

git clone --recursive-submodules https://github.com/VisualText/py-package-nlpengine.git

For development it is convenient to disable build isolation, so install the necessary build dependencies. We suggest doing this in a virtual environment:

cd py-package-nlpengine
python -m venv venv
. venv/bin/activate
pip install -r requirements-dev.txt

Linux Setup

On Linux, generally, you can simply install the ICU development libraries system-wide:

# On Ubuntu / Debian /etc
sudo apt install libicu-dev
# On CentOS / RHEL / etc
sudo yum install libicu-devel

Now you can build the module as a "writable" install, which will allow you to test changes as you make them:

pip install --no-build-isolation -ve .

MacOS and other Unix Setup

If you were not able to install ICU above (such as on MacOS), you have to use vcpkg:

git clone --depth 1 https://github.com/Microsoft/vcpkg.git
./vcpkg/bootstrap-vcpkg.sh

Additionally, on MacOS, you'll probably need a whole lot of other things to use vcpkg:

brew install autoconf-archive autoconf automake pkg-config

Now you can install with this somewhat more complicated command:

pip install --no-build-isolation \
    -C cmake.args=-DCMAKE_TOOLCHAIN_FILE=./nlp-engine/vcpkg/scripts/buildsystems/vcpkg.cmake \
    -ve .

Windows Setup

On Windows, everything is vastly more complicated for a number of reasons:

  • The ICU library on which NLP++ depends is built as DLLs, and these have to be included with the package
  • Python won't load arbitrary DLLs from the current directory, unlike the rest of Windows (this is a good thing)
  • Builds take 10x longer on Windows than on reasonable operating systems, so you will wait a long time to find out that the module you built actually doesn't work

For this reason "editable" installs (the -e option to pip install) do not work on Windows and can't be expected to work. Instead it is necessary to build a wheel file and "repair" it with delvewheel to package the DLLs correctly, then install that wheel.

If that sounds like too much trouble then just install from PyPI or the wheel files as described above

Testing

Verify that it works:

python -m unittest discover -s tests

Note that you might get undefined C++ symbols if you are using Python from miniconda on Linux. In this case, please use the system Python instead.

Making a release

For developer reference: the release process is managed using GitHub actions. To make a release from the main branch, make an annotated tag (with -m and -a, this is important) of the form vX.Y or vX.Y.Z (e.g. v0.1.3) and push the tag and the branch:

git tag -m 'Release 0.1.3' -a v0.1.3
git push --follow-tags

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlpplus-2.0.2.tar.gz (21.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nlpplus-2.0.2-cp314-cp314-macosx_15_0_arm64.whl (16.3 MB view details)

Uploaded CPython 3.14macOS 15.0+ ARM64

nlpplus-2.0.2-cp312-cp312-win_amd64.whl (16.2 MB view details)

Uploaded CPython 3.12Windows x86-64

nlpplus-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

nlpplus-2.0.2-cp311-cp311-win_amd64.whl (16.2 MB view details)

Uploaded CPython 3.11Windows x86-64

nlpplus-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

nlpplus-2.0.2-cp310-cp310-win_amd64.whl (16.2 MB view details)

Uploaded CPython 3.10Windows x86-64

nlpplus-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

File details

Details for the file nlpplus-2.0.2.tar.gz.

File metadata

  • Download URL: nlpplus-2.0.2.tar.gz
  • Upload date:
  • Size: 21.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nlpplus-2.0.2.tar.gz
Algorithm Hash digest
SHA256 9ef43db83ff183d7a3d08f0d196d2ed7e5d1b2d304e5f55fb6ffd96ea828e382
MD5 e72461d4497ce3f04cbad7f31bf5eb2e
BLAKE2b-256 92cbe1f409359c4165befe57a06a7b021706ccc7b73c9c13bae3f1494f76b394

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlpplus-2.0.2.tar.gz:

Publisher: publish.yml on VisualText/py-package-nlpengine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlpplus-2.0.2-cp314-cp314-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for nlpplus-2.0.2-cp314-cp314-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 f9f2ea767e2273c2c2d6da6b18d6c83fdb9253e8539b992b8806e2ac6dc56ff6
MD5 2276e3980f5b8fd8c83acb0b4b60bd46
BLAKE2b-256 75a138df2d6c2d9cdb91a0db85287f9dffa91b18a3aed8786dc15c6b38edfa2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlpplus-2.0.2-cp314-cp314-macosx_15_0_arm64.whl:

Publisher: publish.yml on VisualText/py-package-nlpengine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlpplus-2.0.2-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: nlpplus-2.0.2-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 16.2 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nlpplus-2.0.2-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 10b4cbcda0ed4c6557bcae9650cd76c9c8bf5c21dc1130dce4b69244bccc2677
MD5 95e2a27e2a70d0e6a9c444a2e34b6cc3
BLAKE2b-256 956b2855c19c9bf330e20efb0a8d18b36488da2511a3283f4d19512914aa7b51

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlpplus-2.0.2-cp312-cp312-win_amd64.whl:

Publisher: publish.yml on VisualText/py-package-nlpengine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlpplus-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nlpplus-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 33a0cc21272fb7ed72f66482bd139a81a49b2d26d6dc0b5dbf2084b60b1b3495
MD5 8c957b3f0640d5068a781680a06ac953
BLAKE2b-256 25ea058c12498d984dfe6a7b96d7464e10b06bce272202425cc6b3057c2d44b5

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlpplus-2.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on VisualText/py-package-nlpengine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlpplus-2.0.2-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: nlpplus-2.0.2-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 16.2 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nlpplus-2.0.2-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 214ae9a45dc1e33dbcc8ce35e363de045df355bf3f535bf441e5c4e09a286eb8
MD5 81ce92da69f0c5db7a4cc8c360283ea6
BLAKE2b-256 81d65f72f506ad6d6a89394986b3fead0613f2b0edb44e6ad6181ca419f24ccd

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlpplus-2.0.2-cp311-cp311-win_amd64.whl:

Publisher: publish.yml on VisualText/py-package-nlpengine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlpplus-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nlpplus-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5c1c95a541ac23b1db6697fcc80270052009a0bad21739375377a78025bdd71a
MD5 643e2d93c46c0d16aa4e7cd65549407a
BLAKE2b-256 f9894fd601ca524e14a5742197a48042d7e17244bde22c6b866407e5b2e9a7fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlpplus-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on VisualText/py-package-nlpengine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlpplus-2.0.2-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: nlpplus-2.0.2-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 16.2 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nlpplus-2.0.2-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 5891e13bca52cba00164acc3f423e0914e834e9f54d9a5bc370d071aed24c5b6
MD5 b887ecc9b8eeaeb417af65546c797e0d
BLAKE2b-256 b97597b6f24923f981cc8034dcf63bed7a977d87d4e6d89a7bac85ebeaed8bc2

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlpplus-2.0.2-cp310-cp310-win_amd64.whl:

Publisher: publish.yml on VisualText/py-package-nlpengine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlpplus-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for nlpplus-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ec5b8a6367cfbde504565a08fec1dffd1d4eb2b8579a7e7cc570a9ab27a94c3d
MD5 1fd2987421cccdb2ddc507ead2cf12c3
BLAKE2b-256 c63a265cf2b10516ea43b3060f9a024d39d5f917333f0c85a9fc06533907ef61

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlpplus-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: publish.yml on VisualText/py-package-nlpengine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page