Skip to main content

A small tool to parse and process annotated text corpora

Project description

DSeqMap for NLP

An NLP utility for handling named entity recognition (NER) annotation data between discretized and character-wise representations.

Install

Use pip install dseqmap4nlp to install the package.

Example

The following script demonstrates some transformation of annotation data.

from dseqmap4nlp import SpacySequenceMapper, LabelLoader, CharSequenceMapper

# Load JSONL
samples = [
    {"text": "Das ist gut.", "label": [
        (0,3, "a"),
        (2,7, "b"),
        (6,8, "b")
    ]}
]

for sample in samples:
    text = sample["text"]
    anns = sample["label"]
    # Mapper @ Chars <-> SpaCy Tokens
    mapper = SpacySequenceMapper(text, nlp="de")
    # (trivial) Mapper @ Chars <-> Chars
    charmapper = CharSequenceMapper(text=text)

    # Load annotation data
    # assuming the format: [ (start_idx, stop_idx, label_class), ...]
    # and merge it with to a certain (char <-> discrete sequence) mapper
    annotationset = LabelLoader.from_text_spans(anns, mapper)
    
    # Determine number fo overlaps
    print("Overlaps:", annotationset.countOverlaps())

    # Apply the following transformations to the annotation data:
    # - transform char-based labels onto discretized sequence items (e.g. tokens)
    #   -> Expand if a label's char bounds are not exactly at token bounds
    # - Remove shorter spans in case of overlapping spans
    #   -> Note: New overlaps could also be introduced by span expansion!
    filtered_spans = annotationset\
        .toDSeqSpans(strategy=["expand"])\
        .withoutOverlaps(strategy="prefer_longest", merge_same_classes=True)

    # Check overlaps again (No overlap should exist anymore!)
    print("Overlaps:", filtered_spans.countOverlaps())

    # Transform annotation data into IOB2-formatted sequence.
    print("Sequence:")
    print(filtered_spans.toFormattedSequence(schema="IOB2"))

    # Try to generate an IOB2 sequence with overlaps. (It should fail!)
    print("Previous sequence (should fail):")
    try:
        # Should raise an error...
        print(annotationset.toFormattedSequence(schema="IOB2"))
    except ValueError as e:
        print("Error raised: " + repr(e))

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dseqmap4nlp-0.0.4.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

dseqmap4nlp-0.0.4-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file dseqmap4nlp-0.0.4.tar.gz.

File metadata

  • Download URL: dseqmap4nlp-0.0.4.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for dseqmap4nlp-0.0.4.tar.gz
Algorithm Hash digest
SHA256 042bc8c306995784339e0fb28be85570f4151b9664df88dcaaf615c603f9bb65
MD5 b92f34cb5b7e94d8f7d34faca4e10d43
BLAKE2b-256 6f57b79cc8551c0fecd598512c18e9b4dd0499ec96b8816434ad2e8b23d4c287

See more details on using hashes here.

File details

Details for the file dseqmap4nlp-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: dseqmap4nlp-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for dseqmap4nlp-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d0db508533ff4de018d48bb5bd39e0eac807bd22fc3d80c1e6ba88b8bf690069
MD5 c8ecdb0f9d9e662dcceb502ff87c689b
BLAKE2b-256 5359ae3ba3043be76736b0458da8f349be7ad0bfff5170ca238ca12b18dc9b3f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page