A small tool to parse and process annotated text corpora
Project description
DSeqMap for NLP
An NLP utility for handling named entity recognition (NER) annotation data between discretized and character-wise representations.
Install
Use pip install dseqmap4nlp
to install the package.
Example
The following script demonstrates some transformation of annotation data.
from dseqmap4nlp import SpacySequenceMapper, LabelLoader, CharSequenceMapper
# Load JSONL
samples = [
{"text": "Das ist gut.", "label": [
(0,3, "a"),
(2,7, "b"),
(6,8, "b")
]}
]
for sample in samples:
text = sample["text"]
anns = sample["label"]
# Mapper @ Chars <-> SpaCy Tokens
mapper = SpacySequenceMapper(text, nlp="de")
# (trivial) Mapper @ Chars <-> Chars
charmapper = CharSequenceMapper(text=text)
# Load annotation data
# assuming the format: [ (start_idx, stop_idx, label_class), ...]
# and merge it with to a certain (char <-> discrete sequence) mapper
annotationset = LabelLoader.from_text_spans(anns, mapper)
# Determine number fo overlaps
print("Overlaps:", annotationset.countOverlaps())
# Apply the following transformations to the annotation data:
# - transform char-based labels onto discretized sequence items (e.g. tokens)
# -> Expand if a label's char bounds are not exactly at token bounds
# - Remove shorter spans in case of overlapping spans
# -> Note: New overlaps could also be introduced by span expansion!
filtered_spans = annotationset\
.toDSeqSpans(strategy=["expand"])\
.withoutOverlaps(strategy="prefer_longest", merge_same_classes=True)
# Check overlaps again (No overlap should exist anymore!)
print("Overlaps:", filtered_spans.countOverlaps())
# Transform annotation data into IOB2-formatted sequence.
print("Sequence:")
print(filtered_spans.toFormattedSequence(schema="IOB2"))
# Try to generate an IOB2 sequence with overlaps. (It should fail!)
print("Previous sequence (should fail):")
try:
# Should raise an error...
print(annotationset.toFormattedSequence(schema="IOB2"))
except ValueError as e:
print("Error raised: " + repr(e))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dseqmap4nlp-0.0.5.tar.gz
(10.4 kB
view details)
Built Distribution
File details
Details for the file dseqmap4nlp-0.0.5.tar.gz
.
File metadata
- Download URL: dseqmap4nlp-0.0.5.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17ed805ae621b4df062daa76abf244f7d228fe98492c88873a339a5ee5cad406 |
|
MD5 | 5aad5e1d88d59043d3660ff6a43659a4 |
|
BLAKE2b-256 | b344dc57ef439342eb126d212737e5ed7c3b95fadc56f46735c259b7b57afa3d |
File details
Details for the file dseqmap4nlp-0.0.5-py3-none-any.whl
.
File metadata
- Download URL: dseqmap4nlp-0.0.5-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a029b46d7785b9a9d5d2d0359c13bccfe6d6a57b169bea3a2d12ccac518fe32a |
|
MD5 | 51435caa7e3b3108273c57ee913df01f |
|
BLAKE2b-256 | e850247a9801f43cfaba051f23ea0f2875923ae81c31fe07166856439f38a966 |