Skip to main content

High Performance Text Processing & Segmentation Framework

Project description

Python Contributors Watchers Forks MIT License Stargazers

Pawpaw High Performance Text Processing & Segmentation Framework

Botanical Drawing: Asimina triloba: the American papaw

Pawpaw is a high performance text segmentation framework that allows you to quickly create parsers whose outputs are tree graphs. The resulting parse trees can be serialized, traversed, and searched using a powerful structured query language.

  • Indexed str and substr representation
    • Efficient memory utilization
    • Fast processing
    • Pythonic relative indexing and slicing
    • Runtime & polymorphic value extraction
  • Rules Pipelining Engine
    • Develop complex lexical parsers with just a few lines of code
    • Quickly and easily convert unstructured text into structured, indexed, & searchable tree graphs
    • Pre-process text for downstream NLP/AI/ML consumers
  • Search and Query
    • Hierarchical data structure for all indexed text
    • Search using extensive structured query language
    • Optionally pre-compile queries for reuse to improvement performance
  • XML Processing
    • Features a drop-in replacement for ElementTree.XmlParser
    • Full text indexes for all Elements, Attributes, Tags, Text, etc.
    • Search XML using both XPATH and the included, structured query language
  • Efficient pickling and JSON persistance
    • Security option enables persistance of index-only data, with refrence strings re-injected during de-serialziation
  • Stable & Defect Free
    • Over 3,000 unit tests and counting!

Usage

Pawpaw has extensive features and capabilities you can read about in the Docs. As a quick example, say you have some text that would like to perform nlp-like segmentation on.

>>> s = 'nine 9 ten 10 eleven 11 TWELVE 12 thirteen 13'

You can use a regular expression for segmentation as follows:

>>> re = regex.compile(r'(?P<phrase>(?P<word>(?P<char>\w)+) (?P<number>(?P<digit>\d)+)\s*)+')
>>> match = re.fullmatch(s)

The resulting match can then be fed into Pawpaw as follows:

>>> doc = Ito.from_match(match)

With this single line of code, Pawpaw generates a fully hierarchical, tree\ [#]_ of phrases, words, chars, numbers, and digits. This tree can be traversed, and even searched using a powerful XPATH-like structured query language:

>>> print(*doc.find_all('**[d:dig]'), sep=', ')  # all digits
9, 1, 0, 1, 1, 1, 2, 1, 3
>>> print(*doc.find_all('**[d:num]{</*[s:i]}'), sep=', ')  # all numbers with 'i' in their name
9, 13

This example uses regular expressions as a source, however, Pawpaw is able to work with many other input types. Pawpaw also includes a library of parser components that can be easily chained together to help you quickly develop large, sophisticated parsers.

(back to top)

Getting Started

Prerequisites

Pawpaw has been written and tested using Python 3.10. The only dependency is regex, which will be fetched and installed automatically if you install Pawpaw with pip or conda.

Installation Options

There are lots of ways to install Pawpaw:

  1. Install with pip from PyPI:

    pip install pawpaw
    
  2. Install with pip from GitHub:

    pip install git+https://github.com/rlayers/pawpaw.git
    
  3. Install with conda from PyPI:

    conda activate myenv
    conda install git pip
    pip install pawpaw
    
  4. Install with conda from GitHub:

    conda activate myenv
    conda install git pip
    pip install git+https://github.com/rlayers/pawpaw.git
    
  5. Clone the repo with git from GitHub:

    git clone https://github.com/rlayers/pawpaw
    

Verify

Open a python prompt and type:

>>> from pawpaw import Ito
>>> Ito('Hello, World!')
Ito('Hello, World!', 0, 13, None)

If your last line looks like this, you are up and running with Pawpaw!

(back to top)

Contributing

Contributions to Pawpaw are greatly appreciated - please refer to the contributing guildelines for details.

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contacts

Robert L. Ayers:  a.nov.guy@gmail.com

(back to top)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawpaw-1.0.0a5.tar.gz (1.2 MB view hashes)

Uploaded Source

Built Distribution

pawpaw-1.0.0a5-py3-none-any.whl (44.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page