Use the latest Stanza (StanfordNLP) research models directly in spaCy
Project description
spaCy + Stanza (formerly StanfordNLP)
This package wraps the Stanza (formerly StanfordNLP) library, so you can use Stanford's models as a spaCy pipeline. The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labelled dependency parsing in 58 languages. As of v1.0, Stanza also supports named entity recognition for selected languages.
⚠️ Previous version of this package were available as
spacy-stanfordnlp
.
Using this wrapper, you'll be able to use the following annotations, computed by
your pretrained stanza
model:
- Statistical tokenization (reflected in the
Doc
and its tokens) - Lemmatization (
token.lemma
andtoken.lemma_
) - Part-of-speech tagging (
token.tag
,token.tag_
,token.pos
,token.pos_
) - Dependency parsing (
token.dep
,token.dep_
,token.head
) - Named entity recognition (
doc.ents
,token.ent_type
,token.ent_type_
,token.ent_iob
,token.ent_iob_
) - Sentence segmentation (
doc.sents
)
️️️⌛️ Installation
pip install spacy-stanza
Make sure to also install one of the pre-trained Stanza models.
📖 Usage & Examples
The StanzaLanguage
class can be initialized with a loaded Stanza
pipeline and returns a spaCy Language
object,
i.e. the nlp
object you can use to process text and create a
Doc
object.
import stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang="en")
nlp = StanzaLanguage(snlp)
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)
If language data for the given language is available in spaCy, the respective
language class will be used as the base for the nlp
object – for example,
English()
. This lets you use spaCy's lexical attributes like is_stop
or
like_num
. The nlp
object follows the same API as any other spaCy Language
class – so you can visualize the Doc
objects with displaCy, add custom
components to the pipeline, use the rule-based matcher and do pretty much
anything else you'd normally do in spaCy.
# Access spaCy's lexical attributes
print([token.is_stop for token in doc])
print([token.like_num for token in doc])
# Visualize dependencies
from spacy import displacy
displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook
# Efficient processing with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]):
print(doc.text)
# Combine with your own custom pipeline components
def custom_component(doc):
# Do something to the doc here
return doc
nlp.add_pipe(custom_component)
# Serialize it to a numpy array
np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])
Experimental: Mixing and matching pipeline components
By default, the nlp
object's pipeline will be empty, because all attributes
are computed once and set in the custom Tokenizer
.
But since it's a regular nlp
object, you can add your own components to the
pipeline. For example, you could add and train
your own custom text classification component
and use nlp.add_pipe
to add it to the pipeline, or augment the named
entities with your own rule-based patterns using the
EntityRuler
component.
Advanced: serialization and entry points
The spaCy nlp
object created by StanzaLanguage
exposes its language as
stanza_xx
.
from spacy.util import get_lang_class
lang_cls = get_lang_class("stanza_en")
Normally, the above would fail because spaCy doesn't include a language class
stanza_en
. But because this package exposes a spacy_languages
entry
point in its setup.py
that points to StanzaLanguage
, spaCy
knows how to initialize it.
This means that saving to and loading from disk works:
snlp = stanza.Pipeline(lang="en")
nlp = StanzaLanguage(snlp)
nlp.to_disk("./stanza-spacy-model")
Additional arguments on spacy.load
are automatically passed down to the
language class and pipeline components. So when loading the saved model, you can
pass in the snlp
argument:
snlp = stanza.Pipeline(lang="en")
nlp = spacy.load("./stanza-spacy-model", snlp=snlp)
Note that this will not save any model data by default. The Stanza models are very large, so for now, this package expects that you load them separately.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spacy_stanza-0.2.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b050d5ee337ef8b0f2d7852e85a1553ae91e1a29fca57231f745720e51fe4d9 |
|
MD5 | 30e9ef18909b26169bc89a18dcf47469 |
|
BLAKE2b-256 | d48eb35e8275ce2658a16484570c9a4acdcbd99de22a49adb3a9db3fa9844072 |