Skip to main content

Framework for build NLP information extraction systems using regular expressions.

Project description

konsepy

Framework for build NLP information extraction systems using regular expressions. konsepy then enables leveraging the NLP system to create a silver standard for fine-tuning a transformer model.

Installation

  • konsepy is designed to be used with the knosepy_nlp_template
    • See the README there for current installation instructions.
  • To use konsepy as a standalone entity:
    • Install with pip:
      • pip install konsepy[all]
      • For sentence-splitting corpora from fine-tuning a sentence based transformer, spacy will also need to be installed and configured.

Usage

The package provides a centralized CLI tool konsepy.

Building your NLP Package

To use konsepy, you need to create an NLP package (e.g., my_nlp_package) with the following structure. The best way to get this format is to clone the konsepy_nlp_template:

my_nlp_package/
├── __init__.py
└── concepts/
    ├── __init__.py
    └── my_concept.py

Each concept file (e.g., my_concept.py) must define:

  • REGEXES: A list of regex-category pairs (and optional context functions).
  • RUN_REGEXES_FUNC: A function that executes the regexes and returns categories/matches ( see search functions, below)
  • CategoryEnum: An Enum defining the possible categories for the concept.

Regex Arguments

When defining REGEXES, you can supply a variable number of arguments. The can be entirely customized by your own search function, but the standard argument list is:

  • Position 0: Compile pattern (e.g., re.compile('score: (?P<val>\d+)))
  • Position 1: Default value (enum) if the compile pattern matches (e.g., MyCategory.SCORE)
  • Position 2: Post-processing function(s) (use a list/tuple if > 1) (e.g., [is_negated])
    • This function can accept contextual information provided as:
      • m: regex match object
      • precontext: text in m.start() - window (default to 20 characters)
      • postcontext: text in m.end() + window (default to 20 characters)
      • text: full text
      • window: character window (int)
      • word_window: word window (int)
      • around: text in m.start() - window to m.end() + window
  • Position 3: Pre-processing function(s) (use a list/tuple if > 1)
    • The functions should return start/end indices of the text that should be processed.
    • They can return (or yield) None or start_index == end_index if not text should be searched.

Regex search helpers

rxsearch provides small utilities for classifying or extracting values from text with ordered regex definitions.

The canonical search functions are:

search_all_regex()
search_first_regex()

Regex definition format

Each regex definition may contain up to four positions:

(regex, default_value, postprocessors, preprocessors)

Position 0: regex

A compiled regex pattern.

re.compile(r'score:\s*(?P<target>\d+)')

A None regex acts as a sentinel. If a non-UNKNOWN result has already been found, searching stops at the sentinel.

REGEXES = [
    (KNOWN_REGEX, 'KNOWN'),
    (None, None),
    (UNKNOWN_REGEX, 'UNKNOWN'),
]

Position 1: default value

The value yielded when the regex matches and no postprocessor overrides or skips the result.

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]

Position 2: postprocessors

Optional function, list, or tuple of functions.

Postprocessors receive contextual keyword arguments, including:

  • m: regex match object
  • precontext: text before the match
  • postcontext: text after the match
  • text: full text
  • window: character context window
  • word_window: word context window
  • around: text around the match

A postprocessor may return:

Return value Meaning
None no override; try the next postprocessor, then fall back to the default value
SKIP skip this match entirely
value yield value instead of the default value
(value, match) yield value and use match for match/index output

Example:

import re

from konsepy.rxsearch import SKIP, search_all_regex


def skip_negated(*, precontext, **_):
    if 'no ' in precontext.lower():
        return SKIP
    return None


REGEXES = [
    (re.compile(r'diabetes'), 'DIABETES', skip_negated),
]

search = search_all_regex(REGEXES)

print(list(search('diabetes')))
print(list(search('no diabetes')))

Output:

['DIABETES']
[]

Position 3: preprocessors

Optional function, list, or tuple of functions.

Preprocessors receive the full text and should return or yield searchable (start, end) regions.

They may return or yield:

  • None, which is ignored
  • (start, end), which is searched
  • (start, start), which is ignored

Example:

import re

from konsepy.rxsearch import search_all_regex


def first_sentence_only(text):
    end = text.find('.')
    if end == -1:
        yield 0, len(text)
    else:
        yield 0, end


REGEXES = [
    (
        re.compile(r'score:\s*\d+'),
        'SCORE',
        None,
        first_sentence_only,
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10. score: 20.')))

Output:

['SCORE']

Basic classification

Use search_all_regex() to yield every matching result.

import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
    (re.compile(r'Kalevala'), 'PLACE'),
]

search = search_all_regex(REGEXES)

print(list(search('Väinämöinen sang in Kalevala.')))

Output:

['HERO', 'PLACE']

First result only

Use search_first_regex() to yield at most one result.

import re

from konsepy.rxsearch import search_first_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
    (re.compile(r'Kalevala'), 'PLACE'),
]

search = search_first_regex(REGEXES)

print(list(search('Väinämöinen sang in Kalevala.')))

Output:

['HERO']

Include match objects

Pass include_match=True to receive (result, match) tuples.

import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]

search = search_all_regex(REGEXES)

for value, match in search('old Väinämöinen sang', include_match=True):
    print(value, match.group(), match.start(), match.end())

Output:

HERO
Väinämöinen
4
15

Return matched text and indices

Use get_all_regex_by_index() to yield:

(result, match_text, start, end)

Example:

import re

from konsepy.rxsearch import get_all_regex_by_index

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]

get_by_index = get_all_regex_by_index(REGEXES)

print(list(get_by_index('old Väinämöinen sang')))

Output:

[('HERO', 'Väinämöinen', 4, 15)]

Extracting (?P<target>...)

Use extract_all_regex_target() or extract_first_regex_target() to return regex group values instead of default classification values.

By default, these helpers extract the named group target.

import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'score:\s*(?P<target>\d+)'), 'SCORE'),
]

extract = extract_all_regex_target(REGEXES)

print(list(extract('score: 10 score: 25')))

Output:

['10', '25']

Extract and transform

Use transform to convert extracted values.

import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'score:\s*(?P<target>\d+)'), 'SCORE'),
]

extract = extract_all_regex_target(REGEXES, transform=int)

print(list(extract('score: 10 score: 25')))

Output:

[10, 25]

Falsey transformed values, such as 0, are preserved.

print(list(extract('score: 0')))

Output:

[0]

Extract a different group

Use target to extract a different group name or group index.

import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'hero:\s*(?P<name>\w+)'), 'HERO'),
]

extract = extract_all_regex_target(REGEXES, target='name')

print(list(extract('hero: Aino')))

Output:

['Aino']

Configure extraction fallback

Extraction skips matches by default when the group is missing or unmatched.

from konsepy.rxsearch import SKIP

extract = extract_all_regex_target(
    REGEXES,
    missing=SKIP,
    unmatched=SKIP,
)

To fall back to the regex default value, use None.

extract = extract_all_regex_target(
    REGEXES,
    missing=None,
    unmatched=None,
)

If extraction returns None, later postprocessors may still run. If no postprocessor returns a value, the default value is yielded.

Extraction is handled before postprocessors.

When using extract_all_regex_target() or extract_first_regex_target(), the extracted value is passed to postprocessors as:

  • extracted
  • extracted_value

If a postprocessor returns None, the extracted value is returned.

If a postprocessor returns SKIP, the match is skipped.

If a postprocessor returns any other value, that value replaces the extracted value.

Use extraction as a postprocessor

Use extract_group() directly in position 2 when you want extraction behavior inside regular search_all_regex() or search_first_regex() calls.

import re

from konsepy.rxsearch import extract_group, search_all_regex

REGEXES = [
    (
        re.compile(r'score:\s*(?P<target>\d+)'),
        'SCORE',
        extract_group(),
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10')))

Output:

['10']

Use extract_group_as() to transform the group.

import re

from konsepy.rxsearch import extract_group_as, search_all_regex

REGEXES = [
    (
        re.compile(r'score:\s*(?P<target>\d+)'),
        'SCORE',
        extract_group_as(transform=int),
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10')))

Output:

[10]

Prevent overlapping duplicate matches

Pass suppress_overlaps=True to let earlier matches claim spans of text. Later matches that overlap already-claimed spans are skipped.

This is useful when a specific pattern should override a more general one.

import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'not\s+x'), 'NEGATED_X'),
    (re.compile(r'x'), 'X'),
]

search = search_all_regex(REGEXES)

print(list(search('not x')))
print(list(search('not x', suppress_overlaps=True)))

Output:

['NEGATED_X', 'X']
['NEGATED_X']

The original text is not modified, so match indices and context windows remain stable.

Non-overlapping later matches are still returned.

print(list(search('not x and x', suppress_overlaps=True)))

Output:

['NEGATED_X', 'X']

Ignore preprocessing regions

Pass ignore_indices=True to search the whole text even when preprocessors are defined. This is mainly useful in tests.

import re

from konsepy.rxsearch import search_all_regex


def no_regions(text):
    return None


REGEXES = [
    (
        re.compile(r'Väinämöinen'),
        'HERO',
        None,
        no_regions,
    ),
]

search = search_all_regex(REGEXES)

print(list(search('Väinämöinen')))
print(list(search('Väinämöinen', ignore_indices=True)))

Output:

[]
['HERO']

Deprecated compatibility names

Use these names for new code:

search_all_regex()
search_first_regex()

These older names remain available for compatibility, but emit DeprecationWarning:

search_all_regex_func()
search_first_regex_func()
search_all_regex_match_func()
search_and_replace_regex_func()

search_and_replace_regex_func() now delegates to overlap-suppressed search instead of modifying the searched text. Prefer:

search = search_all_regex(REGEXES)

results = list(search(text, suppress_overlaps=True))

Regex Utilities

konsepy includes KonsepyRegex in konsepy.rxutils to allow for duplicate named groups in alternation branches:

import re
from konsepy.rxutils import KonsepyRegex

pattern = KonsepyRegex(
    r'(?:score: (?P<val>\d+)|results: (?P<val>\d+))',
    flags=re.I,
    allow_dupe_names=True,
)
# m.group("val") will return whichever branch matched

You can also use the shorthand helper rx_compile:

from konsepy.rxutils import rx_compile

pattern = rx_compile(r'(?:this: (?P<val>\d+)|results: (?P<val>\d+))')

Example of my_concept.py:

import re
from enum import Enum
from konsepy.rxsearch import search_all_regex_func
from konsepy.context.negation import check_if_negated
from konsepy.context.other_subject import check_if_other_subject


class CategoryEnum(Enum):
    MENTION = 1
    NO = 0
    OTHER = 3


REGEXES = [
    (re.compile(r'my pattern', re.I),
     CategoryEnum.MENTION,
     [
         lambda **kwargs: check_if_negated(neg_concept=CategoryEnum.NO, **kwargs),
         lambda **kwargs: check_if_other_subject(other_concept=CategoryEnum.OTHER, **kwargs),
     ]
     ),
]

# word_window specifies the number of words to retrieve for context functions (instead of character):
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, word_window=5)
# to alter the character-based window:
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, window=50)  # defaults to 30

Custom Search Functions

You can create your own search function by defining a function that returns a generator:

def my_custom_search(regexes):
    def _search(text, include_match=False):
        for regex, category, *other in regexes:
            for m in regex.finditer(text):
                yield (category, m) if include_match else category

    return _search

Running konsepy

# Run all concepts in a package against input files
konsepy run-all --package-name my_nlp_package --input-files data.csv --outdir output/

# Run and output individual matches as JSONL (useful for match-level analysis)
konsepy run-all-matches --package-name my_nlp_package --input-files data.csv --outdir output/

# Extract snippets for manual review
konsepy run4snippets --package-name my_nlp_package --input-files data.csv --outdir snippets/

# Generate BIO tagged data for model training
konsepy bio-tag --package-name my_nlp_package --input-files data.csv --outdir bio_data/

For more detailed documentation and a template, see konsepy_nlp_template.

Roadmap

  • Change labels to some metadata object to allow more diverse input sources and run info

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konsepy-0.5.1.tar.gz (45.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

konsepy-0.5.1-py3-none-any.whl (42.9 kB view details)

Uploaded Python 3

File details

Details for the file konsepy-0.5.1.tar.gz.

File metadata

  • Download URL: konsepy-0.5.1.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for konsepy-0.5.1.tar.gz
Algorithm Hash digest
SHA256 8dc9a13eec514753e3fafde7286f69abb78258f3ee13e886dacd2fb3ecc3b5d4
MD5 9de4a7883a73dd2622485b341d33deea
BLAKE2b-256 3ac7ecbc9fc8c400b39841ac2e1bdbba068bc546d6f7e8e56b1d685b0b3a8a2c

See more details on using hashes here.

File details

Details for the file konsepy-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: konsepy-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 42.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for konsepy-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ec6fd1a2a4e5434a1a155f0b405012d8f839da50e1182bf0b0b0647f293f7fe9
MD5 057a509aed4f70d0c4a15442fa719078
BLAKE2b-256 3346b5b74224a173562032a8b363cc27c566cf6c97ee77abc6db6882b2afaed4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page