A small tool to parse and process annotated text corpora
Project description
DSeqMap for NLP
An NLP utility for handling named entity recognition (NER) annotation data between discretized and character-wise representations.
Install
Use pip install dseqmap4nlp to install the package.
Example
The following script demonstrates some transformation of annotation data.
from dseqmap4nlp import SpacySequenceMapper, LabelLoader, CharSequenceMapper
# Load JSONL
samples = [
{"text": "Das ist gut.", "label": [
(0,3, "a"),
(2,7, "b"),
(6,8, "b")
]}
]
for sample in samples:
text = sample["text"]
anns = sample["label"]
# Mapper @ Chars <-> SpaCy Tokens
mapper = SpacySequenceMapper(text, nlp="de")
# (trivial) Mapper @ Chars <-> Chars
charmapper = CharSequenceMapper(text=text)
# Load annotation data
# assuming the format: [ (start_idx, stop_idx, label_class), ...]
# and merge it with to a certain (char <-> discrete sequence) mapper
annotationset = LabelLoader.from_text_spans(anns, mapper)
# Determine number fo overlaps
print("Overlaps:", annotationset.countOverlaps())
# Apply the following transformations to the annotation data:
# - transform char-based labels onto discretized sequence items (e.g. tokens)
# -> Expand if a label's char bounds are not exactly at token bounds
# - Remove shorter spans in case of overlapping spans
# -> Note: New overlaps could also be introduced by span expansion!
filtered_spans = annotationset\
.toDSeqSpans(strategy=["expand"])\
.withoutOverlaps(strategy="prefer_longest", merge_same_classes=True)
# Check overlaps again (No overlap should exist anymore!)
print("Overlaps:", filtered_spans.countOverlaps())
# Transform annotation data into IOB2-formatted sequence.
print("Sequence:")
print(filtered_spans.toFormattedSequence(schema="IOB2"))
# Try to generate an IOB2 sequence with overlaps. (It should fail!)
print("Previous sequence (should fail):")
try:
# Should raise an error...
print(annotationset.toFormattedSequence(schema="IOB2"))
except ValueError as e:
print("Error raised: " + repr(e))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dseqmap4nlp-0.0.5.tar.gz.
File metadata
- Download URL: dseqmap4nlp-0.0.5.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17ed805ae621b4df062daa76abf244f7d228fe98492c88873a339a5ee5cad406
|
|
| MD5 |
5aad5e1d88d59043d3660ff6a43659a4
|
|
| BLAKE2b-256 |
b344dc57ef439342eb126d212737e5ed7c3b95fadc56f46735c259b7b57afa3d
|
File details
Details for the file dseqmap4nlp-0.0.5-py3-none-any.whl.
File metadata
- Download URL: dseqmap4nlp-0.0.5-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a029b46d7785b9a9d5d2d0359c13bccfe6d6a57b169bea3a2d12ccac518fe32a
|
|
| MD5 |
51435caa7e3b3108273c57ee913df01f
|
|
| BLAKE2b-256 |
e850247a9801f43cfaba051f23ea0f2875923ae81c31fe07166856439f38a966
|