Skip to main content

High Performance Text Processing & Segmentation Framework

Project description

Python Contributors Watchers Forks MIT License Stargazers

Pawpaw High Performance Text Processing & Segmentation Framework

Botanical Drawing: Asimina triloba: the American papaw

Pawpaw is a high performance text segmentation framework. Segments are organized into tree graphs that can be serialized, traversed, and searched using a powerful structured query language. Pawpaw also features a framework for quickly and easily building complex, pipelined parsers.

  • Indexed str and substr representation
    • Efficient memory utilization
    • Fast processing
    • Pythonic relative indexing and slicing
    • Runtime & polymorphic value extraction
    • Tree graphs for all indexed text
  • Search and Query
    • Search trees using plumule: a powerful structured query language similar to XPATH
    • Combined multiple axes, filters, and subqueries sequentially and recursively to any depth
    • Optionally pre-compile queries for increased performance
  • Rules Pipelining Engine
    • Develop complex lexical parsers with just a few lines of code
    • Quickly and easily convert unstructured text into structured, indexed, & searchable tree graphs
    • Pre-process text for downstream NLP/AI/ML consumers
  • XML Processing
    • Features a drop-in replacement for ElementTree.XmlParser
    • Full text indexes for all Elements, Attributes, Tags, Text, etc.
    • Extract both ElementTree and Pawpaw datastructures in one go
      • The ElementTree and Pawpaw structures are cross-linked at each ELement
      • Search the resulting XML using both XPATH and Plumule
      • Access the raw XML corresponding to ElementTree elements, attributes, text, etc.
  • NLP Support:
    • Pawpaw is ideal for both a) preprocessing unstructured text for downstream NLP consumption and b) storing and searching NLP generated content
    • Works with NLTK
  • Efficient pickling and JSON persistance
    • Security option enables persistance of index-only data, with refrence strings re-injected during de-serialziation
  • Stable & Defect Free
    • Over 3,100 unit tests and counting!

Usage

Pawpaw has extensive features and capabilities you can read about in the Docs. As a quick example, say you have some text that would like to perform nlp-like segmentation on.

>>> s = 'nine 9 ten 10 eleven 11 TWELVE 12 thirteen 13'

You can use a regular expression for segmentation as follows:

>>> re = regex.compile(r'(?P<phrase>(?P<word>(?P<char>\w)+) (?P<number>(?P<digit>\d)+)\s*)+')

You can then use this regex to feed Pawpaw:

>>> doc = Ito.from_match(re.fullmatch(s))

With this single line of code, Pawpaw generates a fully hierarchical, tree of phrases, words, chars, numbers, and digits. The tree can be traversed, and even searched using Pawpaw's plumule, a powerful XPATH-like structured query language:

>>> print(*doc.find_all('**[d:dig]'), sep=', ')  # all digits
9, 1, 0, 1, 1, 1, 2, 1, 3
>>> print(*doc.find_all('**[d:num]{</*[s:i]}'), sep=', ')  # all numbers with 'i' in their name
9, 13

This example uses regular expressions as a source, however, Pawpaw is able to work with many other input types. For example, you can use libraries such as NLTK to grow Pawpaw trees, or, you can use Pawpaw's included parser framework to build your own sophisticated parsers quickly and easily.

(back to top)

Getting Started

Prerequisites

Pawpaw has been written and tested using Python 3.10. The only dependency is regex, which will be fetched and installed automatically if you install Pawpaw with pip or conda.

Installation Options

There are lots of ways to install Pawpaw:

  1. Install with pip from PyPI:

    pip install pawpaw
    
  2. Install with pip from GitHub:

    pip install git+https://github.com/rlayers/pawpaw.git
    
  3. Install with conda from PyPI:

    conda activate myenv
    conda install git pip
    pip install pawpaw
    
  4. Install with conda from GitHub:

    conda activate myenv
    conda install git pip
    pip install git+https://github.com/rlayers/pawpaw.git
    
  5. Clone the repo with git from GitHub:

    git clone https://github.com/rlayers/pawpaw
    

Verify

Open a python prompt and type:

>>> from pawpaw import Ito
>>> Ito('Hello, World!')
Ito('Hello, World!', 0, 13, None)

If your last line looks like this, you are up and running with Pawpaw!

(back to top)

History & Roadmap

Pawpaw is a rewrite of desponia, a now-deprecated Python 2.x segmentation framework that was itself based on a prior framework called Ito. Currently in alpha, many components and features are production ready. However, documentation is still being written and some newer features are still undergoing work. A rough outline of what components are production ready is as follows

  • arborform
  • core (Span & Ito)
    • itorator
    • postorator
  • documentation & examples
  • query
    • radicle query engine
    • plumule
  • NLP
  • visualization
    • ascibox
    • highlighter
    • pepo
    • sgr
  • xml
    • XmlHelper
    • XmlParser

(back to top)

-->

Contributing

Contributions to Pawpaw are greatly appreciated - please refer to the contributing guildelines for details.

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contacts

Robert L. Ayers:  a.nov.guy@gmail.com

(back to top)

References

(back to top)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawpaw-1.0.0a6.tar.gz (1.2 MB view hashes)

Uploaded Source

Built Distribution

pawpaw-1.0.0a6-py3-none-any.whl (47.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page