Analyze and characterize your Spans. Integrated with spaCy.
Project description
spacy-span-analyzer
A simple tool to analyze the Spans in your dataset. It's tightly integrated with spaCy, so you can easily incorporate it to existing NLP pipelines. This is also a reproduction of Papay, et al's work on Dissecting Span Identification Tasks with Performance Prediction (EMNLP 2020).
⏳ Install
Using pip:
pip install spacy-span-analyzer
Directly from source (I highly recommend running this within a virtual environment):
git clone git@github.com:ljvmiranda921/spacy-span-analyzer.git
cd spacy-span-analyzer
pip install .
⏯ Usage
You can use the Span Analyzer as a command-line tool:
spacy-span-analyzer ./path/to/dataset.spacy
Or as an imported library:
import spacy
from spacy.tokens import DocBin
from spacy_span_analyzer import SpanAnalyzer
nlp = spacy.blank("en") # or any Language model
# Ensure that your dataset is a DocBin
doc_bin = DocBin().from_disk("./path/to/data.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))
# Run SpanAnalyzer and get span characteristics
analyze = SpanAnalyzer(docs)
analyze.frequency
analyze.length
analyze.span_distinctiveness
analyze.boundary_distinctiveness
Inputs are expected to be a list of spaCy Docs or a DocBin (if you're using the command-line tool).
Working with Spans
In spaCy, you'd want to store your Spans in the
doc.spans
property, under a particular
spans_key
(sc
by default). Unlike the
doc.ents
property, doc.spans
allows
overlapping entities. This is useful especially for downstream tasks like Span
Categorization.
A common way to do this is to use
char_span
to define a slice from your
Doc:
doc = nlp(text)
spans = []
from annotation in annotations:
span = doc.char_span(
annotation["start"],
annotation["end"],
annotation["label"],
)
spans.append(span)
# Put all spans under a spans_key
doc.spans["sc"] = spans
You can also achieve the same thing by using
set_ents
or by creating a
SpanGroup.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spacy-span-analyzer-0.3.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 089cd3ef0db03d4981d546b64d250adf80c6efbe28b320786c5869ca8597c8e1 |
|
MD5 | 3cc6836aa3fff53a1548eb13ce03a002 |
|
BLAKE2b-256 | bff380317c5b1fdd5df5f84c0f3a58a6d6b55d23b9f442364cf3af77cfbaf036 |
Hashes for spacy_span_analyzer-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | db14c8c0d7bbc2b4db8ee0e6aef0c320988a0d2c9c470c829f357e99a86128ad |
|
MD5 | 9257eb3f0a8932eab804bd33881fd80f |
|
BLAKE2b-256 | f57dbd6d44e7e059814eb091050d268a64d46fe709163dd04497063e7f5884db |