Skip to main content

Library to aid in organizing, running, and debugging regular expressions against large bodies of text.

Project description

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

Runrex

Library to aid in organizing, running, and debugging regular expressions against large bodies of text.

Table of Contents

About the Project

The goal of this library is to simplify the deployment of regular expression on large bodies of text, in a variety of input formats.

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Installation

  1. Clone the repo
    git clone https://github.com/kpwhri/runrex.git
    
  2. Install requirements (requirements-dev is for test packages)
    pip install -r requirements.txt -r requirements-dev.txt
    
  3. If you wish to read text from SAS or SQL, you will need to install additional requirements. These additional requirements files may be of use:
    • ODBC-connection: requirements-db.txt
    • Postgres: requirements-psql.txt
    • SAS: requirements-sas.txt
  4. Run tests.
    set/export PYTHONPATH=src
    pytest tests
    

Usage

Example Implementations

Build Customized Algorithm

  • Create 4 files:
    • patterns.py: defines regular expressions of interest
      • See examples/example_patterns.py for some examples
    • test_patterns.py: tests for those regular expressions
      • Why? Make sure the patterns do what you think they do
    • algorithm.py: defines algorithm (how to use regular expressions); returns a Result
      • See examples/example_algorithm.py for guidance
    • config.(py|json|yaml): various configurations defined in schema.py
      • See example in examples/example_config.py for basic config

Input Data

Accepts a variety of input formats, but will need to at least specify a document_id and document_text. The names are configurable.

Sentence Splitting

By default, the input document text is expected to have each sentence on a separate line. If a sentence splitting scheme is desired, it will need to be supplied to the application.

Schema/Examples

For more details, see the example config or consult the schema

Output Format

  • Recommended output format is jsonl
    • The data can be extracted using python:
import json
with open('output.jsonl') as fh:
    for line in fh:
         data = json.loads(line)  # data is dict
  • Output variables are configurable and can include:

    • id: unique id for line
    • name: document name
    • algorithm: name of algorithm with finding
    • value
    • category: name of category (usually the pattern; multiple categories contribute to an algorithm)
    • date
    • extras
    • matches: pattern matches
    • text: captured text
    • start: start index/offset of match
    • end: end index/offset of match
  • Scripts to accomplish useful tasks with the output are included in the scripts directory.

Versions

Uses SEMVER.

See https://github.com/kpwhri/runrex/releases.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License.

See LICENSE or https://kpwhri.mit-license.org for more information.

Contact

Please use the issue tracker.

Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

runrex-0.5.0.tar.gz (54.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

runrex-0.5.0-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file runrex-0.5.0.tar.gz.

File metadata

  • Download URL: runrex-0.5.0.tar.gz
  • Upload date:
  • Size: 54.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for runrex-0.5.0.tar.gz
Algorithm Hash digest
SHA256 5498b907fe89c54f545a4bdfab266a92dd24f1af01a3e3a74415c164aabffa43
MD5 0091b3c7b9de974823908d19293f4cd1
BLAKE2b-256 a4c7309a1e180ba0a7d5090a7e36b58023ced8372df8635cee67b9ff230e9c01

See more details on using hashes here.

File details

Details for the file runrex-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: runrex-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.2

File hashes

Hashes for runrex-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 abc02b2492699962b45efc62400a3ee35afdb52def21108059ecb1de912545a6
MD5 0c6e68970a6360c533f7e8e4f4cc8e82
BLAKE2b-256 face70f45c98f951c15f997405bab3b4f17d22f8cad07ceb6d11528af8c23f89

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page