A small tool to parse and process annotated text corpora
Project description
DSeqMap for NLP
An NLP utility for handling named entity recognition (NER) annotation data between discretized and character-wise representations.
Install
Use pip install dseqmap4nlp
to install the package.
Example
The following script demonstrates some transformation of annotation data.
from dseqmap4nlp import SpacySequenceMapper, LabelLoader, CharSequenceMapper
# Load JSONL
samples = [
{"text": "Das ist gut.", "label": [
(0,3, "a"),
(2,7, "b"),
(6,8, "b")
]}
]
for sample in samples:
text = sample["text"]
anns = sample["label"]
# Mapper @ Chars <-> SpaCy Tokens
mapper = SpacySequenceMapper(text, nlp="de")
# (trivial) Mapper @ Chars <-> Chars
charmapper = CharSequenceMapper(text=text)
# Load annotation data
# assuming the format: [ (start_idx, stop_idx, label_class), ...]
# and merge it with to a certain (char <-> discrete sequence) mapper
annotationset = LabelLoader.from_text_spans(anns, mapper)
# Determine number fo overlaps
print("Overlaps:", annotationset.countOverlaps())
# Apply the following transformations to the annotation data:
# - transform char-based labels onto discretized sequence items (e.g. tokens)
# -> Expand if a label's char bounds are not exactly at token bounds
# - Remove shorter spans in case of overlapping spans
# -> Note: New overlaps could also be introduced by span expansion!
filtered_spans = annotationset\
.toDSeqSpans(strategy=["expand"])\
.withoutOverlaps(strategy="prefer_longest", merge_same_classes=True)
# Check overlaps again (No overlap should exist anymore!)
print("Overlaps:", filtered_spans.countOverlaps())
# Transform annotation data into IOB2-formatted sequence.
print("Sequence:")
print(filtered_spans.toFormattedSequence(schema="IOB2"))
# Try to generate an IOB2 sequence with overlaps. (It should fail!)
print("Previous sequence (should fail):")
try:
# Should raise an error...
print(annotationset.toFormattedSequence(schema="IOB2"))
except ValueError as e:
print("Error raised: " + repr(e))
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
dseqmap4nlp-0.0.4.tar.gz
(10.4 kB
view details)
Built Distribution
File details
Details for the file dseqmap4nlp-0.0.4.tar.gz
.
File metadata
- Download URL: dseqmap4nlp-0.0.4.tar.gz
- Upload date:
- Size: 10.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 042bc8c306995784339e0fb28be85570f4151b9664df88dcaaf615c603f9bb65 |
|
MD5 | b92f34cb5b7e94d8f7d34faca4e10d43 |
|
BLAKE2b-256 | 6f57b79cc8551c0fecd598512c18e9b4dd0499ec96b8816434ad2e8b23d4c287 |
File details
Details for the file dseqmap4nlp-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: dseqmap4nlp-0.0.4-py3-none-any.whl
- Upload date:
- Size: 11.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0db508533ff4de018d48bb5bd39e0eac807bd22fc3d80c1e6ba88b8bf690069 |
|
MD5 | c8ecdb0f9d9e662dcceb502ff87c689b |
|
BLAKE2b-256 | 5359ae3ba3043be76736b0458da8f349be7ad0bfff5170ca238ca12b18dc9b3f |