Skip to main content

Use the latest StanfordNLP research models directly in spaCy

Project description

spaCy + StanfordNLP

This package wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline. The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labelled dependency parsing in 58 languages.

PyPi GitHub Code style: black

Using this wrapper, you'll be able to use the following annotations, computed by your pretrained stanfordnlp model:

  • Statistical tokenization (reflected in the Doc and its tokens)
  • Lemmatization (token.lemma and token.lemma_)
  • Part-of-speech tagging (token.tag, token.tag_, token.pos, token.pos_)
  • Dependency parsing (token.dep, token.dep_, token.head)
  • Sentence segmentation (doc.sents)

️️️⌛️ Installation

pip install spacy-stanfordnlp

Make sure to also install one of the pre-trained StanfordNLP models. It's recommended to run StanfordNLP on Python 3.6.8+ or Python 3.7.2+.

📖 Usage & Examples

The StanfordNLPLanguage class can be initialized with a loaded StanfordNLP pipeline and returns a spaCy Language object, i.e. the nlp object you can use to process text and create a Doc object.

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage

snlp = stanfordnlp.Pipeline(lang="en")
nlp = StanfordNLPLanguage(snlp)

doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)

If language data for the given language is available in spaCy, the respective language class will be used as the base for the nlp object – for example, English(). This lets you use spaCy's lexical attributes like is_stop or like_num. The nlp object follows the same API as any other spaCy Language class – so you can visualize the Doc objects with displaCy, add custom components to the pipeline, use the rule-based matcher and do pretty much anything else you'd normally do in spaCy.

# Access spaCy's lexical attributes
print([token.is_stop for token in doc])
print([token.like_num for token in doc])

# Visualize dependencies
from spacy import displacy
displacy.serve(doc)  # or displacy.render if you're in a Jupyter notebook

# Efficient processing with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]):
    print(doc.text)

# Combine with your own custom pipeline components
def custom_component(doc):
    # Do something to the doc here
    return doc

nlp.add_pipe(custom_component)

# Serialize it to a numpy array
np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])

Experimental: Mixing and matching pipeline components

By default, the nlp object's pipeline will be empty, because all attributes are computed once and set in the custom Tokenizer. But since it's a regular nlp object, you can add your own components to the pipeline.

For example, the entity recognizer from one of spaCy's pre-trained models:

import spacy
import spacy_stanfordnlp
import stanfordnlp

snlp = stanfordnlp.Pipeline(lang="en", models_dir="./models")
nlp = StanfordNLPLanguage(snlp)

# Load spaCy's pre-trained en_core_web_sm model, get the entity recognizer and
# add it to the StanfordNLP model's pipeline
spacy_model = spacy.load("en_core_web_sm")
ner = spacy_model.get_pipe("ner")
nlp.add_pipe(ner)

doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Barack Obama', 'PERSON'), ('Hawaii', 'GPE'), ('2008', 'DATE')]

You could also add and train your own custom text classification component.

Advanced: serialization and entry points

The spaCy nlp object created by StanfordNLPLanguage exposes its language as stanfordnlp_xx.

from spacy.util import get_lang_class
lang_cls = get_lang_class("stanfordnlp_en")

Normally, the above would fail because spaCy doesn't include a language class stanfordnlp_en. But because this package exposes a spacy_languages entry point in its setup.py that points to StanfordNLPLanguage, spaCy knows how to initialize it.

This means that saving to and loading from disk works:

snlp = stanfordnlp.Pipeline(lang="en")
nlp = StanfordNLPLanguage(snlp)
nlp.to_disk("./stanfordnlp-spacy-model")

Additional arguments on spacy.load are automatically passed down to the language class and pipeline components. So when loading the saved model, you can pass in the snlp argument:

snlp = stanfordnlp.Pipeline(lang="en")
nlp = spacy.load("./stanfordnlp-spacy-model", snlp=snlp)

Note that this will not save any model data by default. The StanfordNLP models are very large, so for now, this package expects that you load them separately.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy-stanfordnlp-0.1.3.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

spacy_stanfordnlp-0.1.3-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file spacy-stanfordnlp-0.1.3.tar.gz.

File metadata

  • Download URL: spacy-stanfordnlp-0.1.3.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.2

File hashes

Hashes for spacy-stanfordnlp-0.1.3.tar.gz
Algorithm Hash digest
SHA256 99d770be8203463f8d506cd5a1a7c6826d55446257abc868771ab12dcfb2c4bd
MD5 70c65c99edd3327a3ed6c69011fb0862
BLAKE2b-256 90abeb0f658fc9974c70e104b4d81b5b287b468700160824b91795bc25be70fe

See more details on using hashes here.

File details

Details for the file spacy_stanfordnlp-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: spacy_stanfordnlp-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.7.1 requests-toolbelt/0.9.1 tqdm/4.30.0 CPython/3.7.2

File hashes

Hashes for spacy_stanfordnlp-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4335f7df96dc13a4df871fa7a344d6501f3ff1e78511f34b93a4226725f02db7
MD5 7dbda82f8228a98353b41639e6cb715d
BLAKE2b-256 ccb914959d1df871432446089a46f1bee4f35f23e02355f47ed530b5c9c5b52f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page