Skip to main content

Asyncio support for Stanford CoreNLP

Project description

aiocorenlp

High-fidelity asyncio capable Stanford CoreNLP library.

Heavily based on ner and nltk.

Rationale and differences from nltk

For every tag operation (in other words, every call to StanfordTagger.tag*), nltk runs a Stanford JAR (stanford-ner.jar/stanford-postagger.jar) in a newly spawned Java subprocess. In order to pass the input text to these JARs, nltk first writes it to a tempfile and includes its path in the Java command line using the -textFile flag.

This method works well in sequential applications, however once scaled up by concurrency and stress problems begin to arise:

  1. Python's tempfile.mkstemp doesn't work very well on Windows to begin with and starts to break down under stress.
    • Calls to tempfile.mkstemp start to fail which in turn results in Stanford code failing (no input file to read).
    • Temporary files get leaked resulting in negative impact on disk usage.
  2. Repeated calls to subprocess mean:
    • Multiple Java processes run in parallel causing negative impact on CPU and memory usage.
    • OS-level subprocess and Java startup code has to be run every time causing additional negative impact on CPU usage.

All this causes unnecessary slowdown and bad reliability to user-written code.

Patching nltk's code to use tempfile.TemporaryDirectory instead of tempfile.mkstemp seemed to resolve issue 1 but issue 2 would require more work.

This library runs the Stanford code in a server mode and sends input text over TCP, meaning:

  1. Filesystem operations and temporary files/directories are avoided entirely.
  2. There's no need to run a Java subprocess more than once.
  3. The only synchronization bottleneck is offloaded to Java's SocketServer class which is used in the Stanford code.
  4. CPU, memory and disk usage is greatly reduced.

Differences from ner

Basic Usage

>>> from aiocorenlp import ner_tag
>>> await ner_tag("I complained to Microsoft about Bill Gates.")
[('O', 'I'), ('O', 'complained'), ('O', 'to'), ('ORGANIZATION', 'Microsoft'), ('O', 'about'), ('PERSON', 'Bill'), ('PERSON', 'Gates.')]

This usage doesn't require interfacing with the server and socket directly and is suitable for low frequency/one-time tagging.

Advanced Usage

To fully take advantage of this library's benefits the AsyncNerServer and AsyncPosServer classes should be used:

from aiocorenlp.async_ner_server import AsyncNerServer
from aiocorenlp.async_corenlp_socket import AsyncCorenlpSocket

server = AsyncNerServer()
port = server.start()
print(f"Server started on port {port}")

socket: AsyncCorenlpSocket = server.get_socket()

while True:
    text = input("> ")
    if text == "exit":
        break

    print(await socket.tag(text))

server.stop()

Context manager is supported as well:

from aiocorenlp.async_ner_server import AsyncNerServer

server: AsyncNerServer
async with AsyncNerServer() as server:
    socket = server.get_socket()
    
    while True:
        text = input("> ")
        if text == "exit":
            break
    
        print(await socket.tag(text))

Configuration

As seen above, all classes and functions this library exposes may be used without arguments (default values).

Optionally, the following arguments may be passed to AsyncNerServer (and by extension ner_tag/pos_tag):

  • port: Server bind port. Leave None for random port.
  • model_path: Path to language model. Leave None to let nltk find the model (supports STANFORD_MODELS environment variable).
  • jar_path: Path to stanford-*.jar. Leave None to let nltk find the jar (supports STANFORD_POSTAGGER environment variable, for NER as well).
  • output_format: Output format. See OutputFormat enum for values. Default is slashTags.
  • encoding: Output encoding.
  • java_options: Additional JVM options.

It is not possible to configure the server bind interface. This is a limitation imposed by the Stanford code.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aiocorenlp-1.0.1.tar.gz (9.7 kB view details)

Uploaded Source

Built Distribution

aiocorenlp-1.0.1-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file aiocorenlp-1.0.1.tar.gz.

File metadata

  • Download URL: aiocorenlp-1.0.1.tar.gz
  • Upload date:
  • Size: 9.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.0

File hashes

Hashes for aiocorenlp-1.0.1.tar.gz
Algorithm Hash digest
SHA256 61bd576a6c6147027860abe37ae16cff59ee0f6f44d4c78515ce0b759b1aeb73
MD5 326ed4427f2c8f4062513bf45769dd5d
BLAKE2b-256 27c272f464b27818c06f0570fbf866cc702f5257da407f1bb825a301ceb83005

See more details on using hashes here.

File details

Details for the file aiocorenlp-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: aiocorenlp-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.0

File hashes

Hashes for aiocorenlp-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a9b1b17d2ec12a949a9f5a6168234793c5b3c8af826f1f244934bc3c60d09bbb
MD5 b269cc17b4193c305d09e84ac3651969
BLAKE2b-256 0debf3b48a46132a4a1a7fc3813b91d43a4552f130d2506f93950a996d2b7a76

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page