A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.
Project description
Crosslingual Coreference
Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.
Install
pip install crosslingual-coreference
Quickstart
from crosslingual_coreference import Predictor
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
predictor = Predictor(
language="en_core_web_sm", device=-1, model_name="info_xlm"
)
print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
Chunking/batching to resolve memory OOM errors
from crosslingual_coreference import Predictor
predictor = Predictor(
language="en_core_web_sm",
device=0,
model_name="info_xlm",
chunk_size=2500,
chunk_overlap=2,
)
Use spaCy pipeline
import spacy
import crosslingual_coreference
text = (
"Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
" that location, Nissin was founded. Many students survived by eating these"
" noodles, but they don't even know him."
)
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
"xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)
doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
Available models
As of now, there are two models available "info_xlm", "xlm_roberta", which scored 77 and 74 on OntoNotes Release 5.0 English data, respectively.
More Examples
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for crosslingual-coreference-0.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5ab4d9201d012bd525b2641e4565b811384dcae0299bbf6a731ce81ae5d650d3 |
|
MD5 | 89fe8914ea3b39b78a720124180143f4 |
|
BLAKE2b-256 | 5b1318c80fd221e6684e03dc87b755b15deeb540194a32c6fcca61827777682e |
Close
Hashes for crosslingual_coreference-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a62689f7c17ee03780b8af29b4b389f138a484cc4f210563a8f332c4b422c40 |
|
MD5 | 853934be5d613c8439add09ef05f1a48 |
|
BLAKE2b-256 | e4661bfd16a1d1c808db33cd4364fca0cba56ffe483838484b705a71d609f0fa |