A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.
Project description
Crosslingual Coreference
Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.
Install
pip install crosslingual-coreference
Quickstart
from crosslingual_coreference import Predictor
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
language="en_core_web_sm", device=-1, model_name="minilm"
)
print(predictor.predict(text)["resolved_text"])
# Note you can also get 'cluster_heads' and 'clusters'
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
Models
As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.
- The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
- The "info_xlm" model produces the best quality for multi-lingual texts.
- The AllenNLP "spanbert" model produces the best quality for english texts.
Chunking/batching to resolve memory OOM errors
from crosslingual_coreference import Predictor
predictor = Predictor(
language="en_core_web_sm",
device=0,
model_name="minilm",
chunk_size=2500,
chunk_overlap=2,
)
Use spaCy pipeline
import spacy
import crosslingual_coreference
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
"xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)
doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
print(doc._.cluster_heads)
# Output
#
# {Momofuku Ando: [5, 6],
# instant noodles: [11, 12],
# Osaka: [14, 14],
# Nissin: [21, 21],
# Many students: [26, 27]}
More Examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for crosslingual-coreference-0.2.6.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7994b9b81ea5dbc3c9bc8b471e9e56287237f31aa3296cc44c7d71a4d0fbe44a |
|
MD5 | 5a2ecf05d0e4c64f5eb0ba68fbc1ea49 |
|
BLAKE2b-256 | e52177b3a6cfbebd2156840e65d9a0a4fc72f11b77005219575bda694c516b48 |
Close
Hashes for crosslingual_coreference-0.2.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8f8bc66a86ab9874cf09717eb37027e503f291f1aa05da60d7aaa331736ac86d |
|
MD5 | e95b50c996df41e94763837fdc7c3a1e |
|
BLAKE2b-256 | fb98ef54236d21be9007910098682f8f45da1ad635cbab32192870abc73dc4bf |