Skip to main content

Framework for build NLP information extraction systems using regular expressions.

Project description

konsepy

Framework for build NLP information extraction systems using regular expressions. konsepy then enables leveraging the NLP system to create a silver standard for fine-tuning a transformer model.

Installation

  • konsepy is designed to be used with the knosepy_nlp_template
    • See the README there for current installation instructions.
  • To use konsepy as a standalone entity:
    • Install with pip:
      • pip install konsepy[all]
      • For sentence-splitting corpora from fine-tuning a sentence based transformer, spacy will also need to be installed and configured.

Usage

The package provides a centralized CLI tool konsepy.

Building your NLP Package

To use konsepy, you need to create an NLP package (e.g., my_nlp_package) with the following structure. The best way to get this format is to clone the konsepy_nlp_template:

my_nlp_package/
├── __init__.py
└── concepts/
    ├── __init__.py
    └── my_concept.py

Each concept file (e.g., my_concept.py) must define:

  • REGEXES: A list of regex-category pairs (and optional context functions).
  • RUN_REGEXES_FUNC: A function that executes the regexes and returns categories/matches.
  • CategoryEnum: An Enum defining the possible categories for the concept.

Search Functions

konsepy provides several pre-built search functions in konsepy.rxsearch:

Some simple ones:

  • search_all_regex: Finds all occurrences of each regex in the list.
  • search_first_regex: Finds only the first occurrence of each regex.

Probably the most useful:

  • search_and_replace_regex_func: Prevents double-matching by replacing found text with dots before proceeding to the next regex.
  • search_all_regex_func: Supports "sentinel" values (None) to stop processing if a match was found earlier.

Regex Utilities

konsepy includes KonsepyRegex in konsepy.rxutils to allow for duplicate named groups in alternation branches:

import re
from konsepy.rxutils import KonsepyRegex

pattern = KonsepyRegex(
  r'(?:score: (?P<val>\d+)|results: (?P<val>\d+))',
  flags=re.I,
  allow_dupe_names=True,
)
# m.group("val") will return whichever branch matched

You can also use the shorthand helper rx_compile:

from konsepy.rxutils import rx_compile

pattern = rx_compile(r'(?:this: (?P<val>\d+)|results: (?P<val>\d+))')

Example of my_concept.py:

import re
from enum import Enum
from konsepy.rxsearch import search_all_regex_func
from konsepy.context.negation import check_if_negated
from konsepy.context.other_subject import check_if_other_subject


class CategoryEnum(Enum):
  MENTION = 1
  NO = 0
  OTHER = 3


REGEXES = [
  (re.compile(r'my pattern', re.I),
   CategoryEnum.MENTION,
   [
     lambda **kwargs: check_if_negated(neg_concept=CategoryEnum.NO, **kwargs),
     lambda **kwargs: check_if_other_subject(other_concept=CategoryEnum.OTHER, **kwargs),
   ]
   ),
]

# word_window specifies the number of words to retrieve for context functions (instead of character):
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, word_window=5)
# to alter the character-based window:
RUN_REGEXES_FUNC = search_all_regex_func(REGEXES, window=50)  # defaults to 30

Custom Search Functions

You can create your own search function by defining a function that returns a generator:

def my_custom_search(regexes):
    def _search(text, include_match=False):
        for regex, category, *other in regexes:
            for m in regex.finditer(text):
                yield (category, m) if include_match else category
    return _search

Running konsepy

# Run all concepts in a package against input files
konsepy run-all --package-name my_nlp_package --input-files data.csv --outdir output/

# Run and output individual matches as JSONL (useful for match-level analysis)
konsepy run-all-matches --package-name my_nlp_package --input-files data.csv --outdir output/

# Extract snippets for manual review
konsepy run4snippets --package-name my_nlp_package --input-files data.csv --outdir snippets/

# Generate BIO tagged data for model training
konsepy bio-tag --package-name my_nlp_package --input-files data.csv --outdir bio_data/

For more detailed documentation and a template, see konsepy_nlp_template.

Roadmap

  • Change labels to some metadata object to allow more diverse input sources and run info

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

konsepy-0.4.1.tar.gz (35.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

konsepy-0.4.1-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file konsepy-0.4.1.tar.gz.

File metadata

  • Download URL: konsepy-0.4.1.tar.gz
  • Upload date:
  • Size: 35.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for konsepy-0.4.1.tar.gz
Algorithm Hash digest
SHA256 81fe844822c58ff87b8809253b11c7acad23b9e53fb3c02bf22041075f640982
MD5 a06cbaa595991b79bd14098d875737a5
BLAKE2b-256 7f016ca8f93955e1f58c3f13fb2bbd34182095a8f9a94076a3d694d3066c456b

See more details on using hashes here.

File details

Details for the file konsepy-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: konsepy-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for konsepy-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 589bb373ac6a22bb03c6e2c896f90bdddd3a5b978e8a9fb9fbf84e12d1574ea1
MD5 d359986179960094d11e74177f5a6e59
BLAKE2b-256 d083aa6c5bc13391d0f662c153548f59e77f4e0223ac44e1dfb2234dededd524

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page