Rule-based Text Labeling Framework Aiming at Flexibility

These details have not been verified by PyPI

Project links

Project description

seqlabel: Flexible Rule-based Text Labeling

CI badge

seqlabel is a rule-based text labeling framework aiming at flexibility.

Installation

To install seqlabel:

pip install seqlabel

Requirements

Python 3.8+

Usage

For a normal text

First, import some classes.

from seqlabel import Text
from seqlabel.matchers import DictionaryMatcher
from seqlabel.entity_filters import LongestMatchFilter, MaximizedMatchFilter
from seqlabel.serializers import IOB2Serializer

Initialize Text by giving it a text you want to label over.

text = Text("Tokyo is the capital of Japan.")

Prepare matcher matching supplied patterns. You can supply patterns via Hash Map mapping string sequences to the corresponding labels. You can define your own matcher by inheriting seqlabel.matchers.Matcher.

Then, apply matcher.match to text.

# Preparing Matcher
matcher = DictionaryMatcher()
# Adding patterns
matcher.add({"Tokyo": "LOC", "Japan": "LOC"})
# Matching
entities = matcher.match(text)

Filter unwanted entities. LongestMatchFilter removes overlapping entities and leaves longer entity. MaximizedMatchFilter removes overlapping entities and leaves as many entities as possible. You can define your own filter by inheriting seqlabel.entity_filters.EntityFilter.

filter_a = LongestMatchFilter()
filtered_entities_a = filter_a(entities)

filter_b = MaximizedMatchFilter()
filtered_entities_b = filter_b(entities)

Convert entities to IOB2 format after matching and filtering. Check seqlabel.serializers out if you want to use other formats.

serializer = IOB2Serializer()
serializer.save(text, filtered_entities_a)

For a tokenized text

If you want to process a tokenized text, you need to use TokenizedText instead of Text. You could import it as follows:

from seqlabel import TokenizedText

Initialize TokenizedText by giving it tokens and space_after you want to label over. tokens is a list of strings and space_after is a list of boolean indicating whether each token has a subsequent space.

tokenized_text = TokenizedText(
  ["Tokyo", "is", "the", "captial", "of", "Japan", "."],
  [True, True, True, True, True, False, False]
)

You can use matcher, filter, and serializer just like a normal text, as shown above.

# Mathcing
entities = matcher.match(tokenized_text)
# Filtering
filtered_entities = filter_a(entities)
# Serializing
serializer.save(tokenized_text, filtered_entities)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.4

Oct 4, 2021

0.2.3

Sep 15, 2021

0.2.2

Sep 15, 2021

0.2.1

Sep 10, 2021

0.2.0

Sep 2, 2021

0.1.0

Sep 2, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seqlabel-0.2.4.tar.gz (6.1 kB view details)

Uploaded Oct 4, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

seqlabel-0.2.4-py3-none-any.whl (6.8 kB view details)

Uploaded Oct 4, 2021 Python 3

File details

Details for the file seqlabel-0.2.4.tar.gz.

File metadata

Download URL: seqlabel-0.2.4.tar.gz
Upload date: Oct 4, 2021
Size: 6.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.10 CPython/3.9.7 Linux/5.8.0-1042-azure

File hashes

Hashes for seqlabel-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`33fd50599a55c4fa3b369f9e6866192f880ee61d14e48bb0a5274911ea8f74d0`
MD5	`c1e3ae224b4e81b3792a7469dca18ac6`
BLAKE2b-256	`36272eba515cebd939849f2c93b75433925c7abb074abe4da0769d80da1bfbbd`

See more details on using hashes here.

File details

Details for the file seqlabel-0.2.4-py3-none-any.whl.

File metadata

Download URL: seqlabel-0.2.4-py3-none-any.whl
Upload date: Oct 4, 2021
Size: 6.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.1.10 CPython/3.9.7 Linux/5.8.0-1042-azure

File hashes

Hashes for seqlabel-0.2.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f247c54343a22f550e463c176a6861aada737f58f4bc2c7205b0b3a878084c31`
MD5	`70db8a2baba075ee84d420431a6767cf`
BLAKE2b-256	`6038af516f7eae5e6d39df6078f35f01ccbf61c65654f236480982b66e5f26ee`

See more details on using hashes here.

seqlabel 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

seqlabel: Flexible Rule-based Text Labeling

Installation

Requirements

Usage

For a normal text

For a tokenized text

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes