Skip to main content

Asyncio support for Stanford CoreNLP

Project description

aiocorenlp

High-fidelity asyncio capable Stanford CoreNLP library.

Heavily based on ner and nltk.

Rationale and differences from nltk

For every tag operation (in other words, every call to StanfordTagger.tag*), nltk runs a Stanford JAR (stanford-ner.jar/stanford-postagger.jar) in a newly spawned Java subprocess. In order to pass the input text to these JARs, nltk first writes it to a tempfile and includes its path in the Java command line using the -textFile flag.

This method works well in sequential applications, however once scaled up by concurrency and stress problems begin to arise:

  1. Python's tempfile.mkstemp doesn't work very well on Windows to begin with and starts to break down under stress.
    • Calls to tempfile.mkstemp start to fail which in turn results in Stanford code failing (no input file to read).
    • Temporary files get leaked resulting in negative impact on disk usage.
  2. Repeated calls to subprocess mean:
    • Multiple Java processes run in parallel causing negative impact on CPU and memory usage.
    • OS-level subprocess and Java startup code has to be run every time causing additional negative impact on CPU usage.

All this causes unnecessary slowdown and bad reliability to user-written code.

Patching nltk's code to use tempfile.TemporaryDirectory instead of tempfile.mkstemp seemed to resolve issue 1 but issue 2 would require more work.

This library runs the Stanford code in a server mode and sends input text over TCP, meaning:

  1. Filesystem operations and temporary files/directories are avoided entirely.
  2. There's no need to run a Java subprocess more than once.
  3. The only synchronization bottleneck is offloaded to Java's SocketServer class which is used in the Stanford code.
  4. CPU, memory and disk usage is greatly reduced.

Differences from ner

Differences from stanza

  • asyncio support.
  • Stanza aims to provide a wider range of uses.

Basic Usage

>>> from aiocorenlp import ner_tag
>>> await ner_tag("I complained to Microsoft about Bill Gates.")
[('O', 'I'), ('O', 'complained'), ('O', 'to'), ('ORGANIZATION', 'Microsoft'), ('O', 'about'), ('PERSON', 'Bill'), ('PERSON', 'Gates.')]

This usage doesn't require interfacing with the server and socket directly and is suitable for low frequency/one-time tagging.

Advanced Usage

To fully take advantage of this library's benefits the AsyncNerServer and AsyncPosServer classes should be used:

from aiocorenlp.async_ner_server import AsyncNerServer
from aiocorenlp.async_corenlp_socket import AsyncCorenlpSocket

server = AsyncNerServer()
port = server.start()
print(f"Server started on port {port}")

socket: AsyncCorenlpSocket = server.get_socket()

while True:
    text = input("> ")
    if text == "exit":
        break

    print(await socket.tag(text))

server.stop()

Context manager is supported as well:

from aiocorenlp.async_ner_server import AsyncNerServer

server: AsyncNerServer
async with AsyncNerServer() as server:
    socket = server.get_socket()
    
    while True:
        text = input("> ")
        if text == "exit":
            break
    
        print(await socket.tag(text))

Configuration

As seen above, all classes and functions this library exposes may be used without arguments (default values).

Optionally, the following arguments may be passed to AsyncNerServer (and by extension ner_tag/pos_tag):

  • port: Server bind port. Leave None for random port.
  • model_path: Path to language model. Leave None to let nltk find the model (supports STANFORD_MODELS environment variable).
  • jar_path: Path to stanford-*.jar. Leave None to let nltk find the jar (supports STANFORD_POSTAGGER environment variable, for NER as well).
  • output_format: Output format. See OutputFormat enum for values. Default is slashTags.
  • encoding: Output encoding.
  • java_options: Additional JVM options.

It is not possible to configure the server bind interface. This is a limitation imposed by the Stanford code.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aiocorenlp-1.0.2.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

aiocorenlp-1.0.2-py3-none-any.whl (9.4 kB view details)

Uploaded Python 3

File details

Details for the file aiocorenlp-1.0.2.tar.gz.

File metadata

  • Download URL: aiocorenlp-1.0.2.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.0

File hashes

Hashes for aiocorenlp-1.0.2.tar.gz
Algorithm Hash digest
SHA256 ecafeb9a0320562bbe1ff91b2dd5f6050a5756191aa6b1b997029e7caa74a544
MD5 7af6d799eb63e1639951661963b34232
BLAKE2b-256 290adb0cba09f29d0f87f5ddc40f176a997dfe3b6308c7efcc08ddd09b99fb6e

See more details on using hashes here.

File details

Details for the file aiocorenlp-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: aiocorenlp-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.0

File hashes

Hashes for aiocorenlp-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 32937b053fd3baeb7c09876fa280485c1ffe8596656e44447263513b00e8dbec
MD5 9965bbd3f23077afe24f620a84744834
BLAKE2b-256 e55fbe15be76fc602ed7ae9bcedb83b03fb72d8a39e4e1d09e26230ea6405e6a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page