High Performance Text Processing & Segmentation Framework
Project description
Pawpaw
Pawpaw is a high performance text segmentation framework that allows you to quickly create parsers whose outputs are tree graphs. The resulting parse trees can be serialized, traversed, and searched using a powerful structured query language.
- Indexed str and substr representation
- Efficient memory utilization
- Fast processing
- Pythonic relative indexing and slicing
- Runtime & polymorphic value extraction
- Rules Pipelining Engine
- Develop complex lexical parsers with just a few lines of code
- Quickly and easily convert unstructured text into structured, indexed, & searchable tree graphs
- Pre-process text for downstream NLP/AI/ML consumers
- Search and Query
- Hierarchical data structure for all indexed text
- Search using extensive structured query language
- Optionally pre-compile queries for reuse to improvement performance
- XML Processing
- Features a drop-in replacement for ElementTree.XmlParser
- Full text indexes for all Elements, Attributes, Tags, Text, etc.
- Search XML using both XPATH and the included, structured query language
- Efficient pickling and JSON persistance
- Security option enables persistance of index-only data, with refrence strings re-injected during de-serialziation
- Stable & Defect Free
- Over 3,000 unit tests and counting!
Usage
Pawpaw has extensive features and capabilities you can read about in the Docs. As a quick example, say you have some text that would like to perform nlp-like segmentation on.
>>> s = 'nine 9 ten 10 eleven 11 TWELVE 12 thirteen 13'
You can use a regular expression for segmentation as follows:
>>> re = regex.compile(r'(?P<phrase>(?P<word>(?P<char>\w)+) (?P<number>(?P<digit>\d)+)\s*)+')
>>> match = re.fullmatch(s)
The resulting match can then be fed into Pawpaw as follows:
>>> doc = Ito.from_match(match)
With this single line of code, Pawpaw generates a fully hierarchical, tree\ [#]_ of phrases, words, chars, numbers, and digits. This tree can be traversed, and even searched using a powerful XPATH-like structured query language:
>>> print(*doc.find_all('**[d:dig]'), sep=', ') # all digits
9, 1, 0, 1, 1, 1, 2, 1, 3
>>> print(*doc.find_all('**[d:num]{</*[s:i]}'), sep=', ') # all numbers with 'i' in their name
9, 13
This example uses regular expressions as a source, however, Pawpaw is able to work with many other input types. Pawpaw also includes a library of parser components that can be easily chained together to help you quickly develop large, sophisticated parsers.
Getting Started
Prerequisites
Pawpaw has been written and tested using Python 3.10. The only dependency is
regex
, which will be fetched and installed automatically if you install Pawpaw
with pip or conda.
Installation Options
There are lots of ways to install Pawpaw:
-
Install with pip from PyPI:
pip install pawpaw
-
Install with pip from GitHub:
pip install git+https://github.com/rlayers/pawpaw.git
-
Install with conda from PyPI:
conda activate myenv conda install git pip pip install pawpaw
-
Install with conda from GitHub:
conda activate myenv conda install git pip pip install git+https://github.com/rlayers/pawpaw.git
-
Clone the repo with git from GitHub:
git clone https://github.com/rlayers/pawpaw
Verify
Open a python prompt and type:
>>> from pawpaw import Ito
>>> Ito('Hello, World!')
Ito('Hello, World!', 0, 13, None)
If your last line looks like this, you are up and running with Pawpaw!
Contributing
Contributions to Pawpaw are greatly appreciated - please refer to the contributing guildelines for details.
License
Distributed under the MIT License. See LICENSE for more information.
Contacts
Robert L. Ayers: a.nov.guy@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.