Skip to main content

Rule-based Text Labeling Framework Aiming at Flexibility

Project description

seqlabel: Flexible Rule-based Text Labeling

CI badge Open In Colab

seqlabel is a rule-based text labeling framework aiming at flexibility.

Installation

To install seqlabel:

pip install seqlabel

Requirements

  • Python 3.8+

Usage

For a normal text

First, import some classes.

from seqlabel import Text
from seqlabel.matchers import DictionaryMatcher
from seqlabel.entity_filters import LongestMatchFilter, MaximizedMatchFilter
from seqlabel.serializers import IOB2Serializer

Initialize Text by giving it a text you want to label over.

text = Text("Tokyo is the capital of Japan.")

Prepare matcher matching supplied patterns. You can supply patterns via Hash Map mapping string sequences to the corresponding labels. You can define your own matcher by inheriting seqlabel.matchers.Matcher.

Then, apply matcher.match to text.

# Preparing Matcher
matcher = DictionaryMatcher()
# Adding patterns
matcher.add({"Tokyo": "LOC", "Japan": "LOC"})
# Matching
entities = matcher.match(text)

Filter unwanted entities. LongestMatchFilter removes overlapping entities and leaves longer entity. MaximizedMatchFilter removes overlapping entities and leaves as many entities as possible. You can define your own filter by inheriting seqlabel.entity_filters.EntityFilter.

filter_a = LongestMatchFilter()
filtered_entities_a = filter_a(entities)

filter_b = MaximizedMatchFilter()
filtered_entities_b = filter_b(entities)

Convert entities to IOB2 format after matching and filtering. Check seqlabel.serializers out if you want to use other formats.

serializer = IOB2Serializer()
serializer.save(text, filtered_entities_a)

For a tokenized text

If you want to process a tokenized text, you need to use TokenizedText instead of Text. You could import it as follows:

from seqlabel import TokenizedText

Initialize TokenizedText by giving it tokens and space_after you want to label over. tokens is a list of strings and space_after is a list of boolean indicating whether each token has a subsequent space.

tokenized_text = TokenizedText(
  ["Tokyo", "is", "the", "captial", "of", "Japan", "."],
  [True, True, True, True, True, False, False]
)

You can use matcher, filter, and serializer just like a normal text, as shown above.

# Mathcing
entities = matcher.match(tokenized_text)
# Filtering
filtered_entities = filter_a(entities)
# Serializing
serializer.save(tokenized_text, filtered_entities)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqlabel-0.2.4.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

seqlabel-0.2.4-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file seqlabel-0.2.4.tar.gz.

File metadata

  • Download URL: seqlabel-0.2.4.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.10 CPython/3.9.7 Linux/5.8.0-1042-azure

File hashes

Hashes for seqlabel-0.2.4.tar.gz
Algorithm Hash digest
SHA256 33fd50599a55c4fa3b369f9e6866192f880ee61d14e48bb0a5274911ea8f74d0
MD5 c1e3ae224b4e81b3792a7469dca18ac6
BLAKE2b-256 36272eba515cebd939849f2c93b75433925c7abb074abe4da0769d80da1bfbbd

See more details on using hashes here.

File details

Details for the file seqlabel-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: seqlabel-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 6.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.10 CPython/3.9.7 Linux/5.8.0-1042-azure

File hashes

Hashes for seqlabel-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f247c54343a22f550e463c176a6861aada737f58f4bc2c7205b0b3a878084c31
MD5 70db8a2baba075ee84d420431a6767cf
BLAKE2b-256 6038af516f7eae5e6d39df6078f35f01ccbf61c65654f236480982b66e5f26ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page