Asyncio support for Stanford CoreNLP
Project description
aiocorenlp
High-fidelity asyncio
capable Stanford CoreNLP library.
Heavily based on ner and nltk.
Rationale and differences from nltk
For every tag operation (in other words, every call to StanfordTagger.tag*
), nltk
runs a Stanford JAR (stanford-ner.jar
/stanford-postagger.jar
) in a newly spawned Java subprocess.
In order to pass the input text to these JARs, nltk
first writes it to a tempfile
and includes its path in the Java command line using the -textFile
flag.
This method works well in sequential applications, however once scaled up by concurrency and stress problems begin to arise:
- Python's
tempfile.mkstemp
doesn't work very well on Windows to begin with and starts to break down under stress.- Calls to
tempfile.mkstemp
start to fail which in turn results in Stanford code failing (no input file to read). - Temporary files get leaked resulting in negative impact on disk usage.
- Calls to
- Repeated calls to
subprocess
mean:- Multiple Java processes run in parallel causing negative impact on CPU and memory usage.
- OS-level subprocess and Java startup code has to be run every time causing additional negative impact on CPU usage.
All this causes unnecessary slowdown and bad reliability to user-written code.
Patching nltk
's code to use tempfile.TemporaryDirectory
instead of tempfile.mkstemp
seemed to resolve issue 1 but issue 2 would require more work.
This library runs the Stanford code in a server mode and sends input text over TCP, meaning:
- Filesystem operations and temporary files/directories are avoided entirely.
- There's no need to run a Java subprocess more than once.
- The only synchronization bottleneck is offloaded to Java's
SocketServer
class which is used in the Stanford code. - CPU, memory and disk usage is greatly reduced.
Differences from ner
asyncio
support.- Method name mangling is inexplicably enabled in the
ner.client.NER
class, making subclassing not practical. - The ner library appears to be abandoned.
Differences from stanza
asyncio
support.- Stanza aims to provide a wider range of uses.
Basic Usage
>>> from aiocorenlp import ner_tag
>>> await ner_tag("I complained to Microsoft about Bill Gates.")
[('O', 'I'), ('O', 'complained'), ('O', 'to'), ('ORGANIZATION', 'Microsoft'), ('O', 'about'), ('PERSON', 'Bill'), ('PERSON', 'Gates.')]
This usage doesn't require interfacing with the server and socket directly and is suitable for low frequency/one-time tagging.
Advanced Usage
To fully take advantage of this library's benefits the AsyncNerServer
and AsyncPosServer
classes should be used:
from aiocorenlp.async_ner_server import AsyncNerServer
from aiocorenlp.async_corenlp_socket import AsyncCorenlpSocket
server = AsyncNerServer()
port = server.start()
print(f"Server started on port {port}")
socket: AsyncCorenlpSocket = server.get_socket()
while True:
text = input("> ")
if text == "exit":
break
print(await socket.tag(text))
server.stop()
Context manager is supported as well:
from aiocorenlp.async_ner_server import AsyncNerServer
server: AsyncNerServer
async with AsyncNerServer() as server:
socket = server.get_socket()
while True:
text = input("> ")
if text == "exit":
break
print(await socket.tag(text))
Configuration
As seen above, all classes and functions this library exposes may be used without arguments (default values).
Optionally, the following arguments may be passed to AsyncNerServer
(and by extension ner_tag
/pos_tag
):
port
: Server bind port. LeaveNone
for random port.model_path
: Path to language model. LeaveNone
to letnltk
find the model (supportsSTANFORD_MODELS
environment variable).jar_path
: Path tostanford-*.jar
. LeaveNone
to letnltk
find the jar (supportsSTANFORD_POSTAGGER
environment variable, for NER as well).output_format
: Output format. SeeOutputFormat
enum for values. Default isslashTags
.encoding
: Output encoding.java_options
: Additional JVM options.
It is not possible to configure the server bind interface. This is a limitation imposed by the Stanford code.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file aiocorenlp-1.0.2.tar.gz
.
File metadata
- Download URL: aiocorenlp-1.0.2.tar.gz
- Upload date:
- Size: 9.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ecafeb9a0320562bbe1ff91b2dd5f6050a5756191aa6b1b997029e7caa74a544 |
|
MD5 | 7af6d799eb63e1639951661963b34232 |
|
BLAKE2b-256 | 290adb0cba09f29d0f87f5ddc40f176a997dfe3b6308c7efcc08ddd09b99fb6e |
File details
Details for the file aiocorenlp-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: aiocorenlp-1.0.2-py3-none-any.whl
- Upload date:
- Size: 9.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 32937b053fd3baeb7c09876fa280485c1ffe8596656e44447263513b00e8dbec |
|
MD5 | 9965bbd3f23077afe24f620a84744834 |
|
BLAKE2b-256 | e55fbe15be76fc602ed7ae9bcedb83b03fb72d8a39e4e1d09e26230ea6405e6a |