High Performance Text Processing & Segmentation Framework

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.10
- Python :: 3.11
Topic
- Software Development :: Libraries :: Python Modules
- Text Processing

Project description

Pawpaw

Pawpaw is a high performance parsing & text segmentation framework that allows you to quickly and easily build complex, pipelined parsers. Segments are automatically organized into tree graphs that can be serialized, traversed, and searched using a powerful structured query language called plumule.

Botanical Drawing: Asimina triloba: the American papaw

Indexed string and substring representation
- Efficient memory utilization
- Fast processing
- Pythonic relative indexing and slicing
- Runtime & polymorphic value extraction
- Tree graphs for all indexed text
Search and Query
- Search trees using plumule: a powerful structured query language similar to XPATH
- Combined multiple axes, filters, and subqueries sequentially and recursively to any depth
- Optionally pre-compile queries for increased performance
Rules Pipelining Engine
- Develop complex lexical parsers with just a few lines of code
- Quickly and easily convert unstructured text into structured, indexed, & searchable tree graphs
- Pre-process text for downstream NLP/AI/ML consumers
XML Processing
- Features a drop-in replacement for ElementTree.XmlParser
- Full text indices for all elements, attributes, tags, text, etc.
- Search the resulting XML using either XPATH and/or plumule
- Extract both ElementTree and Pawpaw datastructures in one go; with cross-linked nodes between trees
NLP Support:
- Pawpaw is ideal for both a) preprocessing unstructured text for downstream NLP consumption and b) storing and searching NLP generated content
- Works with other libraries, such as NLTK
Efficient pickling and JSON persistence
- Security option enables persistence of index-only data, with reference strings re-injected during de-serialization
Stable & Defect Free
- Over 5,000 unit tests and counting!
- Pure Python, with only one external dependency: regex

Explore the docs • Request a feature or report a bug • Explore the code

Example

With Pawpaw, you can start with flattened text like this:

ARTICLE I
Section 1: Congress
All legislative Powers herein granted shall be vested in a Congress of the United States,
which shall consist of a Senate and House of Representatives.

Section 2: The House of Representatives
The House of Representatives shall be composed of Members chosen every second Year by the
People of the several States, and the Electors in each State shall have the Qualifications
requisite for Electors of the most numerous Branch of the State Legislature.

No Person shall be a Representative who shall not have attained to the Age of twenty five
Years, and been seven Years a Citizen of the United States, and who shall not, when elected,
be an Inhabitant of that State in which he shall be chosen.

and quickly and easily produce a tree that look like this:

graph TD;
  A1["[article]<br/>#quot;ARTICLE I…#quot;"]:::dark_brown --> A1_k["[key]<br/>#quot;I#quot;"]:::dark_brown;
  A1--->Sc1["[section]<br/>#quot;Section 1…#quot;"]:::light_brown;
  Sc1-->Sc1_k["[key]<br/>#quot;1#quot;"]:::light_brown
  Sc1--->Sc1_p1["[paragraph]<br/>#quot;All legislative Powers…#quot;"]:::peach
  Sc1_p1-->Sc1_p1_s1["[sentence]<br/>#quot;All legislative Powers…#quot;"]:::dark_green
  Sc1_p1_s1-->Sc1_p1_s1_w1["[word]<br/>#quot;All#quot;"]:::light_green
  Sc1_p1_s1-->Sc1_p1_s1_w2["[word]<br/>#quot;legislative#quot;"]:::light_green
  Sc1_p1_s1-->Sc1_p1_s1_w3["[word]<br/>#quot;Powers#quot;"]:::light_green
  Sc1_p1_s1-->Sc1_p1_s1_w4["..."]:::ellipsis

  A1--->Sc2["[section]<br/>#quot;Section 2#quot;"]:::light_brown;
  Sc2-->Sc2_k["[key]<br/>#quot;2#quot;"]:::light_brown
  Sc2--->Sc2_p1["[paragraph]<br/>#quot;The House of…#quot;"]:::peach
  Sc2_p1---->Sc2_p1_s1["[sentence]<br/>#quot;The House of…#quot;"]:::dark_green
  Sc2_p1_s1-->Sc2_p1_s1_w1["[word]<br/>#quot;The#quot;"]:::light_green
  Sc2_p1_s1-->Sc2_p1_s1_w2["[word]<br/>#quot;House#quot;"]:::light_green
  Sc2_p1_s1-->Sc2_p1_s1_w3["[word]<br/>#quot;of#quot;"]:::light_green
  Sc2_p1_s1-->Sc2_p1_s1_w4["..."]:::ellipsis
  Sc2--->Sc2_p2["[paragraph]<br/>#quot;No Person shall…#quot;"]:::peach
  Sc2_p2---->Sc2_p2_s1["[sentence]<br/>#quot;No Person shall…#quot;"]:::dark_green
  Sc2_p2_s1-->Sc2_p2_s1_w1["[word]<br/>#quot;No#quot;"]:::light_green
  Sc2_p2_s1-->Sc2_p2_s1_w2["[word]<br/>#quot;Person#quot;"]:::light_green
  Sc2_p2_s1-->Sc2_p2_s1_w3["[word]<br/>#quot;shall#quot;"]:::light_green
  Sc2_p2_s1-->Sc2_p2_s1_w4["..."]:::ellipsis

  classDef dark_brown fill:#533E30,stroke:#000000,color:#FFFFFF;
  classDef light_brown fill:#D2AC70,stroke:#000000,color:#000000;
  classDef peach fill:#E4D1AE,stroke:#000000,color:#000000;
  classDef dark_green fill:#517D3D,stroke:#000000,color:#FFFFFF;
  classDef light_green fill:#90C246,stroke:#000000,color:#FFFFFF;

  classDef ellipsis fill:#FFFFFF,stroke:#FFFFFF,color:#000000;

You can then search your tree using plumule: a powerful structured query language:

'**[d:section]{**[d:word] & [lcs:power,right]}'  # Plumule query to find sections that containing words 'power' or 'right'

Try out this demo yourself, which shows how easy it is to parse, visualize, and query the US Constitution using Pawpaw.

Usage

Pawpaw has extensive features and capabilities you can read about in the Docs. As a quick example, say you have some text that would like to perform nlp-like segmentation on.

>>> s = 'nine 9 ten 10 eleven 11 TWELVE 12 thirteen 13'

You can use a regular expression for segmentation as follows:

>>> import regex 
>>> re = regex.compile(r'(?:(?P<phrase>(?P<word>(?P<char>\w)+) (?P<number>(?P<digit>\d)+))\s*)+')

You can then use this regex to feed Pawpaw:

>>> import pawpaw 
>>> doc = pawpaw.Ito.from_match(re.fullmatch(s))[0]

With this single line of code, Pawpaw generates a fully hierarchical, tree of phrases, words, chars, numbers, and digits. You can visualize the tree:

>>> tree_vis = pawpaw.visualization.pepo.Tree()
>>> print(tree_vis.dumps(doc))
(0, 45) '0' : 'nine 9 ten 10 eleven…ELVE 12 thirteen 13'
├──(0, 6) 'phrase' : 'nine 9'
│  ├──(0, 4) 'word' : 'nine'
│  │  ├──(0, 1) 'char' : 'n'
│  │  ├──(1, 2) 'char' : 'i'
│  │  ├──(2, 3) 'char' : 'n'
│  │  └──(3, 4) 'char' : 'e'
│  └──(5, 6) 'number' : '9'
│     └──(5, 6) 'digit' : '9'
├──(7, 13) 'phrase' : 'ten 10'
│  ├──(7, 10) 'word' : 'ten'
│  │  ├──(7, 8) 'char' : 't'
│  │  ├──(8, 9) 'char' : 'e'
│  │  └──(9, 10) 'char' : 'n'
│  └──(11, 13) 'number' : '10'
│     ├──(11, 12) 'digit' : '1'
│     └──(12, 13) 'digit' : '0'
├──(14, 23) 'phrase' : 'eleven 11'
│  ├──(14, 20) 'word' : 'eleven'
│  │  ├──(14, 15) 'char' : 'e'
│  │  ├──(15, 16) 'char' : 'l'
│  │  ├──(16, 17) 'char' : 'e'
│  │  ├──(17, 18) 'char' : 'v'
│  │  ├──(18, 19) 'char' : 'e'
│  │  └──(19, 20) 'char' : 'n'
│  └──(21, 23) 'number' : '11'
│     ├──(21, 22) 'digit' : '1'
│     └──(22, 23) 'digit' : '1'
├──(24, 33) 'phrase' : 'TWELVE 12'
│  ├──(24, 30) 'word' : 'TWELVE'
│  │  ├──(24, 25) 'char' : 'T'
│  │  ├──(25, 26) 'char' : 'W'
│  │  ├──(26, 27) 'char' : 'E'
│  │  ├──(27, 28) 'char' : 'L'
│  │  ├──(28, 29) 'char' : 'V'
│  │  └──(29, 30) 'char' : 'E'
│  └──(31, 33) 'number' : '12'
│     ├──(31, 32) 'digit' : '1'
│     └──(32, 33) 'digit' : '2'
└──(34, 45) 'phrase' : 'thirteen 13'
   ├──(34, 42) 'word' : 'thirteen'
   │  ├──(34, 35) 'char' : 't'
   │  ├──(35, 36) 'char' : 'h'
   │  ├──(36, 37) 'char' : 'i'
   │  ├──(37, 38) 'char' : 'r'
   │  ├──(38, 39) 'char' : 't'
   │  ├──(39, 40) 'char' : 'e'
   │  ├──(40, 41) 'char' : 'e'
   │  └──(41, 42) 'char' : 'n'
   └──(43, 45) 'number' : '13'
      ├──(43, 44) 'digit' : '1'
      └──(44, 45) 'digit' : '3'

And you can search the tree using Pawpaw's plumule, a powerful XPATH-like structured query language:

>>> print(*doc.find_all('**[d:digit]'), sep=', ')  # all digits
9, 1, 0, 1, 1, 1, 2, 1, 3
>>> print(*doc.find_all('**[d:number]{</*[s:i]}'), sep=', ')  # all numbers with 'i' in their name
9, 13

This example uses regular expressions as a source, however, Pawpaw is able to work with many other input types. For example, you can use libraries such as NLTK to grow Pawpaw trees, or, you can use Pawpaw's included parser framework to build your own sophisticated parsers quickly and easily.

(back to top)

Getting Started

Prerequisites

Pawpaw has been written and tested using Python 3.10. The only dependency is regex, which will be fetched and installed automatically if you install Pawpaw with pip or conda.

Installation Options

There are lots of ways to install Pawpaw. Versioned instances that have passed all automated tests are available from PyPI:

Install with pip from PyPI:
```
pip install pawpaw
```

Install with conda from PyPI:

conda activate myenv
conda install git pip
pip install pawpaw

Alternatively, you can pull from the main branch at GitHub. This will ensure that you have the latest code, however, the main branch can potentially have internal inconsistencies and/or failed tests:

Install with pip from GitHub:

pip install git+https://github.com/rlayers/pawpaw.git

Install with conda from GitHub:

conda activate myenv
conda install git pip
pip install git+https://github.com/rlayers/pawpaw.git

Clone the repo with git from GitHub:

git clone https://github.com/rlayers/pawpaw

Verify Installation

Whichever way you fetch Pawpaw, you can easily verify that it is installed correctly. Just open Python prompt and type:

>>> from pawpaw import Ito
>>> Ito('Hello, World!')
Ito(span=(0, 13), desc='', substr='Hello, World!')

If your last line looks like this, you are up and running with Pawpaw!

(back to top)

History & Roadmap

Pawpaw is a rewrite of desponia, a now-deprecated Python 2.x segmentation framework that was itself based on a prior framework called Ito. Currently in release-candidate status, many components and features are production ready. However, documentation is still being written and some newer features are still undergoing work. A rough outline of which components are finalized is as follows:

arborform
- itorator
  - Desc
  - Extract
  - Reflect
  - Split
  - ValueFunc
- postorator
  - StackedReduce
  - WindowedJoin
core
- Errors
- Infix
- Ito
- ItoChildren
- nuco
- Span
- Types
documentation & examples
query
- radicle query engine
- plumule
nlp
visualization
- ascibox
- highlighter
- pepo
- sgr
xml
- XmlHelper
- XmlParser

(back to top)

Contributing

Contributions to Pawpaw are greatly appreciated - please refer to the contributing guildelines for details.

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contacts

Robert L. Ayers: a.nov.guy@gmail.com

(back to top)

References

(back to top)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.10
- Python :: 3.11
Topic
- Software Development :: Libraries :: Python Modules
- Text Processing

Release history Release notifications | RSS feed

This version

1.0.0rc7 pre-release

Jan 18, 2024

1.0.0rc5 pre-release

Jan 13, 2024

1.0.0rc4 pre-release

Aug 23, 2023

1.0.0rc3 pre-release

Jul 27, 2023

1.0.0rc2 pre-release

Jun 2, 2023

1.0.0rc1 pre-release

Apr 25, 2023

1.0.0a10 pre-release

Apr 10, 2023

1.0.0a9 pre-release

Feb 17, 2023

1.0.0a8 pre-release

Feb 3, 2023

1.0.0a7 pre-release

Jan 13, 2023

1.0.0a6 pre-release

Dec 22, 2022

1.0.0a5 pre-release

Dec 3, 2022

1.0.0a4 pre-release

Nov 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pawpaw-1.0.0rc7.tar.gz (1.1 MB view hashes)

Uploaded Jan 18, 2024 Source

Built Distribution

pawpaw-1.0.0rc7-py3-none-any.whl (71.6 kB view hashes)

Uploaded Jan 18, 2024 Python 3

Hashes for pawpaw-1.0.0rc7.tar.gz

Hashes for pawpaw-1.0.0rc7.tar.gz
Algorithm	Hash digest
SHA256	`463729d619926d063d3e8d6cbbb0bda0a2423cc96d8c7b105e3f571ba4e2031d`
MD5	`2f40b0dafec2bd287d62e6ab097b92ee`
BLAKE2b-256	`d0741d8bf0c21b2b84e1bbef26d633f719f7bb9b9f51e23e8bcfba6601e001f7`

Hashes for pawpaw-1.0.0rc7-py3-none-any.whl

Hashes for pawpaw-1.0.0rc7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4912d2f304d0f2738c47dce08ea7fcff8a7a90a74f7f893483b343ad779d63ad`
MD5	`de6741069f58fd0611ede0c881f69e72`
BLAKE2b-256	`56068526242b59f7a7263702aea425cd4bf961769e98e79ce822034bb1c37548`