Framework for build NLP information extraction systems using regular expressions.

These details have not been verified by PyPI

Project links

Home

Project description

konsepy

Framework for build NLP information extraction systems using regular expressions. konsepy then enables leveraging the NLP system to create a silver standard for fine-tuning a transformer model.

Installation

konsepy is designed to be used with the knosepy_nlp_template
- See the README there for current installation instructions.
To use konsepy as a standalone entity:
- Install with pip:
  - pip install konsepy[all]
  - For sentence-splitting corpora from fine-tuning a sentence based transformer, spacy will also need to be installed and configured.

Usage

The package provides a centralized CLI tool konsepy.

Building your NLP Package

To use konsepy, you need to create an NLP package (e.g., my_nlp_package) with the following structure. The best way to get this format is to clone the konsepy_nlp_template:

my_nlp_package/
├── __init__.py
└── concepts/
    ├── __init__.py
    └── my_concept.py

Each concept file (e.g., my_concept.py) must define:

REGEXES: A list of regex-category pairs (and optional context functions).
- See Regex Arguments
RUN_REGEXES_FUNC: A function that executes the regexes and returns categories/matches ( see search functions, below)
CategoryEnum: An Enum defining the possible categories for the concept.

Regex Arguments

When defining REGEXES, you can supply a variable number of arguments. The can be entirely customized by your own search function, but the standard argument list is:

Position 0: Compile pattern (e.g., re.compile('score: (?P<val>\d+)))
Position 1: Default value (enum) if the compile pattern matches (e.g., MyCategory.SCORE)
Position 2: Post-processing function(s) (use a list/tuple if > 1) (e.g., [is_negated])
- This function can accept contextual information provided as:
  - m: regex match object
  - precontext: text in m.start() - window (default to 20 characters)
  - postcontext: text in m.end() + window (default to 20 characters)
  - text: full text
  - window: character window (int)
  - word_window: word window (int)
  - around: text in m.start() - window to m.end() + window
Position 3: Pre-processing function(s) (use a list/tuple if > 1)
- The functions should return start/end indices of the text that should be processed.
- They can return (or yield) None or start_index == end_index if not text should be searched.

Regex search helpers

rxsearch provides small utilities for classifying or extracting values from text with ordered regex definitions.

The canonical search functions are:

search_all_regex()
search_first_regex()

Regex definition format

Each regex definition may contain up to four positions:

(regex, default_value, postprocessors, preprocessors)

Position 0: regex

A compiled regex pattern.

re.compile(r'score:\s*(?P<target>\d+)')

A None regex acts as a sentinel. If a non-UNKNOWN result has already been found, searching stops at the sentinel.

REGEXES = [
    (KNOWN_REGEX, 'KNOWN'),
    (None, None),
    (UNKNOWN_REGEX, 'UNKNOWN'),
]

Position 1: default value

The value yielded when the regex matches and no postprocessor overrides or skips the result.

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]

Position 2: postprocessors

Optional function, list, or tuple of functions.

Postprocessors receive contextual keyword arguments, including:

m: regex match object
precontext: text before the match
postcontext: text after the match
text: full text
window: character context window
word_window: word context window
around: text around the match

A postprocessor may return:

Return value	Meaning
`None`	no override; try the next postprocessor, then fall back to the default value
`SKIP`	skip this match entirely
`value`	yield `value` instead of the default value
`(value, match)`	yield `value` and use `match` for match/index output

Example:

import re

from konsepy.rxsearch import SKIP, search_all_regex


def skip_negated(*, precontext, **_):
    if 'no ' in precontext.lower():
        return SKIP
    return None


REGEXES = [
    (re.compile(r'diabetes'), 'DIABETES', skip_negated),
]

search = search_all_regex(REGEXES)

print(list(search('diabetes')))
print(list(search('no diabetes')))

Output:

['DIABETES']
[]

Position 3: preprocessors

Optional function, list, or tuple of functions.

Preprocessors receive the full text and should return or yield searchable (start, end) regions.

They may return or yield:

None, which is ignored
(start, end), which is searched
(start, start), which is ignored

Example:

import re

from konsepy.rxsearch import search_all_regex


def first_sentence_only(text):
    end = text.find('.')
    if end == -1:
        yield 0, len(text)
    else:
        yield 0, end


REGEXES = [
    (
        re.compile(r'score:\s*\d+'),
        'SCORE',
        None,
        first_sentence_only,
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10. score: 20.')))

Output:

['SCORE']

Basic classification

Use search_all_regex() to yield every matching result.

import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
    (re.compile(r'Kalevala'), 'PLACE'),
]

search = search_all_regex(REGEXES)

print(list(search('Väinämöinen sang in Kalevala.')))

Output:

['HERO', 'PLACE']

First result only

Use search_first_regex() to yield at most one result.

import re

from konsepy.rxsearch import search_first_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
    (re.compile(r'Kalevala'), 'PLACE'),
]

search = search_first_regex(REGEXES)

print(list(search('Väinämöinen sang in Kalevala.')))

Output:

['HERO']

Include match objects

Pass include_match=True to receive (result, match) tuples.

import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]

search = search_all_regex(REGEXES)

for value, match in search('old Väinämöinen sang', include_match=True):
    print(value, match.group(), match.start(), match.end())

Output:

HERO
Väinämöinen
4
15

Return matched text and indices

Use get_all_regex_by_index() to yield:

(result, match_text, start, end)

Example:

import re

from konsepy.rxsearch import get_all_regex_by_index

REGEXES = [
    (re.compile(r'Väinämöinen'), 'HERO'),
]

get_by_index = get_all_regex_by_index(REGEXES)

print(list(get_by_index('old Väinämöinen sang')))

Output:

[('HERO', 'Väinämöinen', 4, 15)]

Extracting `(?P<target>...)`

Use extract_all_regex_target() or extract_first_regex_target() to return regex group values instead of default classification values.

By default, these helpers extract the named group target.

import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'score:\s*(?P<target>\d+)'), 'SCORE'),
]

extract = extract_all_regex_target(REGEXES)

print(list(extract('score: 10 score: 25')))

Output:

['10', '25']

Extract and transform

Use transform to convert extracted values.

import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'score:\s*(?P<target>\d+)'), 'SCORE'),
]

extract = extract_all_regex_target(REGEXES, transform=int)

print(list(extract('score: 10 score: 25')))

Output:

[10, 25]

Falsey transformed values, such as 0, are preserved.

print(list(extract('score: 0')))

Output:

[0]

Extract a different group

Use target to extract a different group name or group index.

import re

from konsepy.rxsearch import extract_all_regex_target

REGEXES = [
    (re.compile(r'hero:\s*(?P<name>\w+)'), 'HERO'),
]

extract = extract_all_regex_target(REGEXES, target='name')

print(list(extract('hero: Aino')))

Output:

['Aino']

Configure extraction fallback

Extraction skips matches by default when the group is missing or unmatched.

from konsepy.rxsearch import SKIP

extract = extract_all_regex_target(
    REGEXES,
    missing=SKIP,
    unmatched=SKIP,
)

To fall back to the regex default value, use None.

extract = extract_all_regex_target(
    REGEXES,
    missing=None,
    unmatched=None,
)

If extraction returns None, later postprocessors may still run. If no postprocessor returns a value, the default value is yielded.

Extraction is handled before postprocessors.

When using extract_all_regex_target() or extract_first_regex_target(), the extracted value is passed to postprocessors as:

extracted
extracted_value

If a postprocessor returns None, the extracted value is returned.

If a postprocessor returns SKIP, the match is skipped.

If a postprocessor returns any other value, that value replaces the extracted value.

Use extraction as a postprocessor

Use extract_group() directly in position 2 when you want extraction behavior inside regular search_all_regex() or search_first_regex() calls.

import re

from konsepy.rxsearch import extract_group, search_all_regex

REGEXES = [
    (
        re.compile(r'score:\s*(?P<target>\d+)'),
        'SCORE',
        extract_group(),
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10')))

Output:

['10']

Use extract_group_as() to transform the group.

import re

from konsepy.rxsearch import extract_group_as, search_all_regex

REGEXES = [
    (
        re.compile(r'score:\s*(?P<target>\d+)'),
        'SCORE',
        extract_group_as(transform=int),
    ),
]

search = search_all_regex(REGEXES)

print(list(search('score: 10')))

Output:

[10]

Labeled extraction results

Extraction concepts can return an enum label plus an extracted value. This lets classification and extraction concepts appear similarly in category count files, while still preserving extracted values in separate extraction files.

import enum
import re

from konsepy.results import ExtractionResult
from konsepy.rxsearch import extract_all_regex_target


class ScoreCategory(enum.Enum):
    SCORE = 1
    UNKNOWN = -1


def label_score(*, extracted, **_):
    return ExtractionResult(
        label=ScoreCategory.SCORE,
        value=extracted,
    )


REGEXES = [
    (
        re.compile(r'\bscore\s*:\s*(?P<target>\d+)\b', re.I),
        None,
        label_score,
    ),
]


RUN_REGEXES_FUNC = extract_all_regex_target(REGEXES, transform=int)

The standard category output counts ScoreCategory.SCORE. Extraction-specific outputs store the numeric value.

Prevent overlapping duplicate matches

Pass suppress_overlaps=True to let earlier matches claim spans of text. Later matches that overlap already-claimed spans are skipped.

This is useful when a specific pattern should override a more general one.

import re

from konsepy.rxsearch import search_all_regex

REGEXES = [
    (re.compile(r'not\s+x'), 'NEGATED_X'),
    (re.compile(r'x'), 'X'),
]

search = search_all_regex(REGEXES)

print(list(search('not x')))
print(list(search('not x', suppress_overlaps=True)))

Output:

['NEGATED_X', 'X']
['NEGATED_X']

The original text is not modified, so match indices and context windows remain stable.

Non-overlapping later matches are still returned.

print(list(search('not x and x', suppress_overlaps=True)))

Output:

['NEGATED_X', 'X']

Ignore preprocessing regions

Pass ignore_indices=True to search the whole text even when preprocessors are defined. This is mainly useful in tests.

import re

from konsepy.rxsearch import search_all_regex


def no_regions(text):
    return None


REGEXES = [
    (
        re.compile(r'Väinämöinen'),
        'HERO',
        None,
        no_regions,
    ),
]

search = search_all_regex(REGEXES)

print(list(search('Väinämöinen')))
print(list(search('Väinämöinen', ignore_indices=True)))

Output:

[]
['HERO']

Deprecated compatibility names

Use these names for new code:

search_all_regex()
search_first_regex()

These older names remain available for compatibility, but emit DeprecationWarning:

search_all_regex_func()
search_first_regex_func()
search_all_regex_match_func()
search_and_replace_regex_func()

search_and_replace_regex_func() now delegates to overlap-suppressed search instead of modifying the searched text. Prefer:

search = search_all_regex(REGEXES)

results = list(search(text, suppress_overlaps=True))

Regex Utilities

konsepy includes KonsepyRegex in konsepy.rxutils to allow for duplicate named groups in alternation branches:

import re
from konsepy.rxutils import KonsepyRegex

pattern = KonsepyRegex(
    r'(?:score: (?P<val>\d+)|results: (?P<val>\d+))',
    flags=re.I,
    allow_dupe_names=True,
)
# m.group("val") will return whichever branch matched

You can also use the shorthand helper rx_compile:

from konsepy.rxutils import rx_compile

pattern = rx_compile(r'(?:this: (?P<val>\d+)|results: (?P<val>\d+))')

Example of my_concept.py:

import re
from enum import Enum
from konsepy.rxsearch import search_all_regex_func
from konsepy.context.negation import check_if_negated
from konsepy.context.other_subject import check_if_other_subject


class CategoryEnum(Enum):
    MENTION = 1
    NO = 0
    OTHER = 3


REGEXES = [
    (re.compile(r'my pattern', re.I),
     CategoryEnum.MENTION,
     [
         lambda **kwargs: check_if_negated(neg_concept=CategoryEnum.NO, **kwargs),
         lambda **kwargs: check_if_other_subject(other_concept=CategoryEnum.OTHER, **kwargs),
     ]
     ),
]

# word_window specifies the number of words to retrieve for context functions (instead of character):
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, word_window=5)
# to alter the character-based window:
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, window=50)  # defaults to 30

Custom Search Functions

You can create your own search function by defining a function that returns a generator:

def my_custom_search(regexes):
    def _search(text, include_match=False):
        for regex, category, *other in regexes:
            for m in regex.finditer(text):
                yield (category, m) if include_match else category

    return _search

Running konsepy

# Run all concepts in a package against input files
konsepy run-all --package-name my_nlp_package --input-files data.csv --outdir output/

# Run and output individual matches as JSONL (useful for match-level analysis)
konsepy run-all-matches --package-name my_nlp_package --input-files data.csv --outdir output/

# Extract snippets for manual review
konsepy run4snippets --package-name my_nlp_package --input-files data.csv --outdir snippets/

# Generate BIO tagged data for model training
konsepy bio-tag --package-name my_nlp_package --input-files data.csv --outdir bio_data/

For more detailed documentation and a template, see konsepy_nlp_template.

Testing

# end-to-end BIO train/predict test (requires a local model path)
python -m pytest test/test_train_predict_e2e.py -k test_train_predict_e2e --bio-model-path /my/huggingface/models/roberta-base

Note: By default, prediction output merges adjacent subword spans that share the same entity label into a single result to produce word-level captures. To preserve raw token-level spans for debugging, pass --no-merge-subwords to the prediction CLI.

Roadmap

Change labels to some metadata object to allow more diverse input sources and run info

Project details

These details have not been verified by PyPI

Project links

Home

Release history Release notifications | RSS feed

0.6.0

May 28, 2026

0.5.9

May 18, 2026

0.5.8

May 16, 2026

This version

0.5.7

May 16, 2026

0.5.1

May 15, 2026

0.5.0

May 13, 2026

0.4.1

Mar 12, 2026

0.4.0

Mar 11, 2026

0.3.2

Jan 30, 2026

0.3.1

Jan 30, 2026

0.2.2

Jan 30, 2026

0.1.1

Nov 19, 2024

0.1.0

Nov 6, 2024

0.0.15

Sep 26, 2024

0.0.14

Sep 25, 2024

0.0.13

Aug 30, 2024

0.0.12

Apr 12, 2024

0.0.11

Nov 16, 2023

0.0.10

Jun 26, 2023

0.0.9

Jun 23, 2023

0.0.8

Jun 13, 2023

0.0.7

Jun 13, 2023

0.0.6

May 4, 2023

0.0.5

Apr 17, 2023

0.0.4

Apr 14, 2023

0.0.3

Apr 14, 2023

0.0.2

Apr 13, 2023

0.0.1

Apr 13, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konsepy-0.5.7.tar.gz (54.8 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

konsepy-0.5.7-py3-none-any.whl (47.9 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file konsepy-0.5.7.tar.gz.

File metadata

Download URL: konsepy-0.5.7.tar.gz
Upload date: May 16, 2026
Size: 54.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for konsepy-0.5.7.tar.gz
Algorithm	Hash digest
SHA256	`e87ffcfe90d9287bef0e5128208970588a10c6cf035535ca6c2063b082b6aaeb`
MD5	`e40a2f3213ca7f85580aca62ae725fba`
BLAKE2b-256	`5fc481b5a5826e518d9110755a76d62c5fd7b81d7d5245b918364cc750094f53`

See more details on using hashes here.

File details

Details for the file konsepy-0.5.7-py3-none-any.whl.

File metadata

Download URL: konsepy-0.5.7-py3-none-any.whl
Upload date: May 16, 2026
Size: 47.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for konsepy-0.5.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7950acdf13e1adaa4ca837fca177274eba941f36bb50aa2e646286dff106ea19`
MD5	`8793375bee6e68ad0c75f2981c8f3f38`
BLAKE2b-256	`90e867231f622c8e133c6dd2be2565c6f3a6f02bb277a5a793df4b783e32a91f`

See more details on using hashes here.

konsepy 0.5.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

konsepy

Installation

Usage

Building your NLP Package

Regex Arguments

Regex search helpers

Regex definition format

Position 0: regex

Position 1: default value

Position 2: postprocessors

Position 3: preprocessors

Basic classification

First result only

Include match objects

Return matched text and indices

Extracting (?P<target>...)

Extract and transform

Extract a different group

Configure extraction fallback

Use extraction as a postprocessor

Labeled extraction results

Prevent overlapping duplicate matches

Ignore preprocessing regions

Deprecated compatibility names

Regex Utilities

Custom Search Functions

Running konsepy

Testing

Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Extracting `(?P<target>...)`