Skip to main content

Framework for build NLP information extraction systems using regular expressions.

Project description

konsepy

Framework for build NLP information extraction systems using regular expressions. konsepy then enables leveraging the NLP system to create a silver standard for fine-tuning a transformer model.

Installation

  • konsepy is designed to be used with the knosepy_nlp_template
    • See the README there for current installation instructions.
  • To use konsepy as a standalone entity:
    • Install with pip:
      • pip install konsepy[all]
      • For sentence-splitting corpora from fine-tuning a sentence based transformer, spacy will also need to be installed and configured.

Usage

The package provides a centralized CLI tool konsepy.

Building your NLP Package

To use konsepy, you need to create an NLP package (e.g., my_nlp_package) with the following structure. The best way to get this format is to clone the konsepy_nlp_template:

my_nlp_package/
├── __init__.py
└── concepts/
    ├── __init__.py
    └── my_concept.py

Each concept file (e.g., my_concept.py) must define:

  • REGEXES: A list of regex-category pairs (and optional context functions).
  • RUN_REGEXES_FUNC: A function that executes the regexes and returns categories/matches.
  • CategoryEnum: An Enum defining the possible categories for the concept.

Search Functions

konsepy provides several pre-built search functions in konsepy.rxsearch:

Some simple ones:

  • search_all_regex: Finds all occurrences of each regex in the list.
  • search_first_regex: Finds only the first occurrence of each regex.

Probably the most useful:

  • search_and_replace_regex_func: Prevents double-matching by replacing found text with dots before proceeding to the next regex.
  • search_all_regex_func: Supports "sentinel" values (None) to stop processing if a match was found earlier.

Regex Utilities

konsepy includes KonsepyRegex in konsepy.rxutils to allow for duplicate named groups in alternation branches:

import re
from konsepy.rxutils import KonsepyRegex

pattern = KonsepyRegex(
  r'(?:score: (?P<val>\d+)|results: (?P<val>\d+))',
  flags=re.I,
  allow_dupe_names=True,
)
# m.group("val") will return whichever branch matched

You can also use the shorthand helper rx_compile:

from konsepy.rxutils import rx_compile

pattern = rx_compile(r'(?:this: (?P<val>\d+)|results: (?P<val>\d+))')

Example of my_concept.py:

import re
from enum import Enum
from konsepy.rxsearch import search_all_regex_func
from konsepy.context.negation import check_if_negated
from konsepy.context.other_subject import check_if_other_subject


class CategoryEnum(Enum):
  MENTION = 1
  NO = 0
  OTHER = 3


REGEXES = [
  (re.compile(r'my pattern', re.I),
   CategoryEnum.MENTION,
   [
     lambda **kwargs: check_if_negated(neg_concept=CategoryEnum.NO, **kwargs),
     lambda **kwargs: check_if_other_subject(other_concept=CategoryEnum.OTHER, **kwargs),
   ]
   ),
]

# word_window specifies the number of words to retrieve for context functions (instead of character):
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, word_window=5)
# to alter the character-based window:
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, window=50)  # defaults to 30

Custom Search Functions

You can create your own search function by defining a function that returns a generator:

def my_custom_search(regexes):
    def _search(text, include_match=False):
        for regex, category, *other in regexes:
            for m in regex.finditer(text):
                yield (category, m) if include_match else category
    return _search

Running konsepy

# Run all concepts in a package against input files
konsepy run-all --package-name my_nlp_package --input-files data.csv --outdir output/

# Run and output individual matches as JSONL (useful for match-level analysis)
konsepy run-all-matches --package-name my_nlp_package --input-files data.csv --outdir output/

# Extract snippets for manual review
konsepy run4snippets --package-name my_nlp_package --input-files data.csv --outdir snippets/

# Generate BIO tagged data for model training
konsepy bio-tag --package-name my_nlp_package --input-files data.csv --outdir bio_data/

For more detailed documentation and a template, see konsepy_nlp_template.

Roadmap

  • Change labels to some metadata object to allow more diverse input sources and run info

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konsepy-0.4.0.tar.gz (34.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

konsepy-0.4.0-py3-none-any.whl (37.5 kB view details)

Uploaded Python 3

File details

Details for the file konsepy-0.4.0.tar.gz.

File metadata

  • Download URL: konsepy-0.4.0.tar.gz
  • Upload date:
  • Size: 34.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for konsepy-0.4.0.tar.gz
Algorithm Hash digest
SHA256 340461556f57062dc0c14c2c398f929d416ea27bc3cf19653437a2e2db94920b
MD5 79b83aced7e7b236da10dcfdca9f4175
BLAKE2b-256 7bed79a682d9778372ae07fb41b4264a260f5735f85207216dd88cf1bbb86a9b

See more details on using hashes here.

File details

Details for the file konsepy-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: konsepy-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 37.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for konsepy-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d9447cde82fb50220563294d1d0861ceeda785e80e6d84b9f62236b5e7b69ca6
MD5 09f342d76634bce6faa7d0bfebb6543c
BLAKE2b-256 17fe0861310a831c33a64e829a7a814ec97954c0e7bd61acda02a0a9f578ff8f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page