Framework for build NLP information extraction systems using regular expressions.
Project description
konsepy
Framework for build NLP information extraction systems using regular expressions. konsepy then enables leveraging the
NLP system to create a silver standard for fine-tuning a transformer model.
Installation
konsepyis designed to be used with theknosepy_nlp_template- See the README there for current installation instructions.
- To use
konsepyas a standalone entity:- Install with
pip:pip install konsepy[all]- For sentence-splitting corpora from fine-tuning a sentence based transformer,
spacywill also need to be installed and configured.
- Install with
Usage
The package provides a centralized CLI tool konsepy.
Building your NLP Package
To use konsepy, you need to create an NLP package (e.g., my_nlp_package) with the following structure. The best way
to get this format is to clone the konsepy_nlp_template:
my_nlp_package/
├── __init__.py
└── concepts/
├── __init__.py
└── my_concept.py
Each concept file (e.g., my_concept.py) must define:
REGEXES: A list of regex-category pairs (and optional context functions).- See Regex Arguments
RUN_REGEXES_FUNC: A function that executes the regexes and returns categories/matches ( see search functions, below)CategoryEnum: AnEnumdefining the possible categories for the concept.
Regex Arguments
When defining REGEXES, you can supply a variable number of arguments. The can be entirely customized by your
own search function, but the standard argument list is:
- Position 0: Compile pattern (e.g.,
re.compile('score: (?P<val>\d+))) - Position 1: Default value (enum) if the compile pattern matches (e.g.,
MyCategory.SCORE) - Position 2: Post-processing function(s) (use a list/tuple if > 1) (e.g.,
[is_negated])- This function can accept contextual information provided as:
- m: regex match object
- precontext: text in
m.start() - window(default to 20 characters) - postcontext: text in
m.end() + window(default to 20 characters) - text: full text
- window: character window (int)
- word_window: word window (int)
- around: text in
m.start() - window to m.end() + window
- This function can accept contextual information provided as:
- Position 3: Pre-processing function(s) (use a list/tuple if > 1)
- The functions should return start/end indices of the text that should be processed.
- They can return (or yield)
Noneorstart_index == end_indexif not text should be searched.
Regex search helpers
rxsearch provides small utilities for classifying or extracting values from text
with ordered regex definitions.
The canonical search functions are:
search_all_regex()
search_first_regex()
Regex definition format
Each regex definition may contain up to four positions:
(regex, default_value, postprocessors, preprocessors)
Position 0: regex
A compiled regex pattern.
re.compile(r'score:\s*(?P<target>\d+)')
A None regex acts as a sentinel. If a non-UNKNOWN result has already been
found, searching stops at the sentinel.
REGEXES = [
(KNOWN_REGEX, 'KNOWN'),
(None, None),
(UNKNOWN_REGEX, 'UNKNOWN'),
]
Position 1: default value
The value yielded when the regex matches and no postprocessor overrides or skips the result.
REGEXES = [
(re.compile(r'Väinämöinen'), 'HERO'),
]
Position 2: postprocessors
Optional function, list, or tuple of functions.
Postprocessors receive contextual keyword arguments, including:
m: regex match objectprecontext: text before the matchpostcontext: text after the matchtext: full textwindow: character context windowword_window: word context windowaround: text around the match
A postprocessor may return:
| Return value | Meaning |
|---|---|
None |
no override; try the next postprocessor, then fall back to the default value |
SKIP |
skip this match entirely |
value |
yield value instead of the default value |
(value, match) |
yield value and use match for match/index output |
Example:
import re
from konsepy.rxsearch import SKIP, search_all_regex
def skip_negated(*, precontext, **_):
if 'no ' in precontext.lower():
return SKIP
return None
REGEXES = [
(re.compile(r'diabetes'), 'DIABETES', skip_negated),
]
search = search_all_regex(REGEXES)
print(list(search('diabetes')))
print(list(search('no diabetes')))
Output:
['DIABETES']
[]
Position 3: preprocessors
Optional function, list, or tuple of functions.
Preprocessors receive the full text and should return or yield searchable
(start, end) regions.
They may return or yield:
None, which is ignored(start, end), which is searched(start, start), which is ignored
Example:
import re
from konsepy.rxsearch import search_all_regex
def first_sentence_only(text):
end = text.find('.')
if end == -1:
yield 0, len(text)
else:
yield 0, end
REGEXES = [
(
re.compile(r'score:\s*\d+'),
'SCORE',
None,
first_sentence_only,
),
]
search = search_all_regex(REGEXES)
print(list(search('score: 10. score: 20.')))
Output:
['SCORE']
Basic classification
Use search_all_regex() to yield every matching result.
import re
from konsepy.rxsearch import search_all_regex
REGEXES = [
(re.compile(r'Väinämöinen'), 'HERO'),
(re.compile(r'Kalevala'), 'PLACE'),
]
search = search_all_regex(REGEXES)
print(list(search('Väinämöinen sang in Kalevala.')))
Output:
['HERO', 'PLACE']
First result only
Use search_first_regex() to yield at most one result.
import re
from konsepy.rxsearch import search_first_regex
REGEXES = [
(re.compile(r'Väinämöinen'), 'HERO'),
(re.compile(r'Kalevala'), 'PLACE'),
]
search = search_first_regex(REGEXES)
print(list(search('Väinämöinen sang in Kalevala.')))
Output:
['HERO']
Include match objects
Pass include_match=True to receive (result, match) tuples.
import re
from konsepy.rxsearch import search_all_regex
REGEXES = [
(re.compile(r'Väinämöinen'), 'HERO'),
]
search = search_all_regex(REGEXES)
for value, match in search('old Väinämöinen sang', include_match=True):
print(value, match.group(), match.start(), match.end())
Output:
HERO
Väinämöinen
4
15
Return matched text and indices
Use get_all_regex_by_index() to yield:
(result, match_text, start, end)
Example:
import re
from konsepy.rxsearch import get_all_regex_by_index
REGEXES = [
(re.compile(r'Väinämöinen'), 'HERO'),
]
get_by_index = get_all_regex_by_index(REGEXES)
print(list(get_by_index('old Väinämöinen sang')))
Output:
[('HERO', 'Väinämöinen', 4, 15)]
Extracting (?P<target>...)
Use extract_all_regex_target() or extract_first_regex_target() to return
regex group values instead of default classification values.
By default, these helpers extract the named group target.
import re
from konsepy.rxsearch import extract_all_regex_target
REGEXES = [
(re.compile(r'score:\s*(?P<target>\d+)'), 'SCORE'),
]
extract = extract_all_regex_target(REGEXES)
print(list(extract('score: 10 score: 25')))
Output:
['10', '25']
Extract and transform
Use transform to convert extracted values.
import re
from konsepy.rxsearch import extract_all_regex_target
REGEXES = [
(re.compile(r'score:\s*(?P<target>\d+)'), 'SCORE'),
]
extract = extract_all_regex_target(REGEXES, transform=int)
print(list(extract('score: 10 score: 25')))
Output:
[10, 25]
Falsey transformed values, such as 0, are preserved.
print(list(extract('score: 0')))
Output:
[0]
Extract a different group
Use target to extract a different group name or group index.
import re
from konsepy.rxsearch import extract_all_regex_target
REGEXES = [
(re.compile(r'hero:\s*(?P<name>\w+)'), 'HERO'),
]
extract = extract_all_regex_target(REGEXES, target='name')
print(list(extract('hero: Aino')))
Output:
['Aino']
Configure extraction fallback
Extraction skips matches by default when the group is missing or unmatched.
from konsepy.rxsearch import SKIP
extract = extract_all_regex_target(
REGEXES,
missing=SKIP,
unmatched=SKIP,
)
To fall back to the regex default value, use None.
extract = extract_all_regex_target(
REGEXES,
missing=None,
unmatched=None,
)
If extraction returns None, later postprocessors may still run. If no
postprocessor returns a value, the default value is yielded.
Extraction is handled before postprocessors.
When using extract_all_regex_target() or extract_first_regex_target(), the
extracted value is passed to postprocessors as:
extractedextracted_value
If a postprocessor returns None, the extracted value is returned.
If a postprocessor returns SKIP, the match is skipped.
If a postprocessor returns any other value, that value replaces the extracted value.
Use extraction as a postprocessor
Use extract_group() directly in position 2 when you want extraction behavior
inside regular search_all_regex() or search_first_regex() calls.
import re
from konsepy.rxsearch import extract_group, search_all_regex
REGEXES = [
(
re.compile(r'score:\s*(?P<target>\d+)'),
'SCORE',
extract_group(),
),
]
search = search_all_regex(REGEXES)
print(list(search('score: 10')))
Output:
['10']
Use extract_group_as() to transform the group.
import re
from konsepy.rxsearch import extract_group_as, search_all_regex
REGEXES = [
(
re.compile(r'score:\s*(?P<target>\d+)'),
'SCORE',
extract_group_as(transform=int),
),
]
search = search_all_regex(REGEXES)
print(list(search('score: 10')))
Output:
[10]
Labeled extraction results
Extraction concepts can return an enum label plus an extracted value. This lets classification and extraction concepts appear similarly in category count files, while still preserving extracted values in separate extraction files.
import enum
import re
from konsepy.results import ExtractionResult
from konsepy.rxsearch import extract_all_regex_target
class ScoreCategory(enum.Enum):
SCORE = 1
UNKNOWN = -1
def label_score(*, extracted, **_):
return ExtractionResult(
label=ScoreCategory.SCORE,
value=extracted,
)
REGEXES = [
(
re.compile(r'\bscore\s*:\s*(?P<target>\d+)\b', re.I),
None,
label_score,
),
]
RUN_REGEXES_FUNC = extract_all_regex_target(REGEXES, transform=int)
The standard category output counts ScoreCategory.SCORE. Extraction-specific
outputs store the numeric value.
Prevent overlapping duplicate matches
Pass suppress_overlaps=True to let earlier matches claim spans of text.
Later matches that overlap already-claimed spans are skipped.
This is useful when a specific pattern should override a more general one.
import re
from konsepy.rxsearch import search_all_regex
REGEXES = [
(re.compile(r'not\s+x'), 'NEGATED_X'),
(re.compile(r'x'), 'X'),
]
search = search_all_regex(REGEXES)
print(list(search('not x')))
print(list(search('not x', suppress_overlaps=True)))
Output:
['NEGATED_X', 'X']
['NEGATED_X']
The original text is not modified, so match indices and context windows remain stable.
Non-overlapping later matches are still returned.
print(list(search('not x and x', suppress_overlaps=True)))
Output:
['NEGATED_X', 'X']
Ignore preprocessing regions
Pass ignore_indices=True to search the whole text even when preprocessors are
defined. This is mainly useful in tests.
import re
from konsepy.rxsearch import search_all_regex
def no_regions(text):
return None
REGEXES = [
(
re.compile(r'Väinämöinen'),
'HERO',
None,
no_regions,
),
]
search = search_all_regex(REGEXES)
print(list(search('Väinämöinen')))
print(list(search('Väinämöinen', ignore_indices=True)))
Output:
[]
['HERO']
Deprecated compatibility names
Use these names for new code:
search_all_regex()
search_first_regex()
These older names remain available for compatibility, but emit
DeprecationWarning:
search_all_regex_func()
search_first_regex_func()
search_all_regex_match_func()
search_and_replace_regex_func()
search_and_replace_regex_func() now delegates to overlap-suppressed search
instead of modifying the searched text. Prefer:
search = search_all_regex(REGEXES)
results = list(search(text, suppress_overlaps=True))
Regex Utilities
konsepy includes KonsepyRegex in konsepy.rxutils to allow for duplicate named groups in alternation branches:
import re
from konsepy.rxutils import KonsepyRegex
pattern = KonsepyRegex(
r'(?:score: (?P<val>\d+)|results: (?P<val>\d+))',
flags=re.I,
allow_dupe_names=True,
)
# m.group("val") will return whichever branch matched
You can also use the shorthand helper rx_compile:
from konsepy.rxutils import rx_compile
pattern = rx_compile(r'(?:this: (?P<val>\d+)|results: (?P<val>\d+))')
Example of my_concept.py:
import re
from enum import Enum
from konsepy.rxsearch import search_all_regex_func
from konsepy.context.negation import check_if_negated
from konsepy.context.other_subject import check_if_other_subject
class CategoryEnum(Enum):
MENTION = 1
NO = 0
OTHER = 3
REGEXES = [
(re.compile(r'my pattern', re.I),
CategoryEnum.MENTION,
[
lambda **kwargs: check_if_negated(neg_concept=CategoryEnum.NO, **kwargs),
lambda **kwargs: check_if_other_subject(other_concept=CategoryEnum.OTHER, **kwargs),
]
),
]
# word_window specifies the number of words to retrieve for context functions (instead of character):
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, word_window=5)
# to alter the character-based window:
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, window=50) # defaults to 30
Custom Search Functions
You can create your own search function by defining a function that returns a generator:
def my_custom_search(regexes):
def _search(text, include_match=False):
for regex, category, *other in regexes:
for m in regex.finditer(text):
yield (category, m) if include_match else category
return _search
Running konsepy
# Run all concepts in a package against input files
konsepy run-all --package-name my_nlp_package --input-files data.csv --outdir output/
# Run and output individual matches as JSONL (useful for match-level analysis)
konsepy run-all-matches --package-name my_nlp_package --input-files data.csv --outdir output/
# Extract snippets for manual review
konsepy run4snippets --package-name my_nlp_package --input-files data.csv --outdir snippets/
# Generate BIO tagged data for model training
konsepy bio-tag --package-name my_nlp_package --input-files data.csv --outdir bio_data/
For more detailed documentation and a template, see konsepy_nlp_template.
Testing
# end-to-end BIO train/predict test (requires a local model path)
python -m pytest test/test_train_predict_e2e.py -k test_train_predict_e2e --bio-model-path /my/huggingface/models/roberta-base
Note: By default, prediction output merges adjacent subword spans that share the same
entity label into a single result to produce word-level captures. To preserve
raw token-level spans for debugging, pass --no-merge-subwords to the
prediction CLI.
Roadmap
- Change labels to some metadata object to allow more diverse input sources and run info
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file konsepy-0.5.7.tar.gz.
File metadata
- Download URL: konsepy-0.5.7.tar.gz
- Upload date:
- Size: 54.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e87ffcfe90d9287bef0e5128208970588a10c6cf035535ca6c2063b082b6aaeb
|
|
| MD5 |
e40a2f3213ca7f85580aca62ae725fba
|
|
| BLAKE2b-256 |
5fc481b5a5826e518d9110755a76d62c5fd7b81d7d5245b918364cc750094f53
|
File details
Details for the file konsepy-0.5.7-py3-none-any.whl.
File metadata
- Download URL: konsepy-0.5.7-py3-none-any.whl
- Upload date:
- Size: 47.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7950acdf13e1adaa4ca837fca177274eba941f36bb50aa2e646286dff106ea19
|
|
| MD5 |
8793375bee6e68ad0c75f2981c8f3f38
|
|
| BLAKE2b-256 |
90e867231f622c8e133c6dd2be2565c6f3a6f02bb277a5a793df4b783e32a91f
|