Library designed as a python wrapper to unleash Rust text processing power combined with Python
Project description
PyTextRust
Library defined to achieve easily high performance on regex and text processing inside Python, being built as a direct Wrapper of Rust regex and text crates.
On short text, sparsity of found elements is the common denominator, this library focuses on algorithms that aknowledge this sparsity and efficiently achieves good performance from simple Python API calls to Rust optimized logics.
Regex
Regexset
and Regex
are the main engines.
Configuration of regexsetbuilder https://rust-lang.github.io/regex/regex/struct.RegexSetBuilder.html
Locally build
https://maturin.rs/project_layout.html#mixed-rustpython-project
BIG INFORMATION
- pending logging?
maturin build --release
pip install --force-reinstall target/wheels/pytextrust-0.1.0-cp38-cp38-linux_x86_64.whl
Learning doc
- The Rust Programming Language
- Rust CookBook
- Rust by example
- PyO3
- Maturin
- A comparison of regex engines
Reference Rust pattern matching packages
- https://docs.rs/fst/latest/fst/, particularly https://docs.rs/fst/latest/fst/#example-searching-multiple-sets-efficiently for entities
- https://docs.rs/regex-automata/latest/regex_automata/
- https://docs.rs/aho-corasick/0.7.18/aho_corasick/
- https://docs.rs/regex-syntax/latest/regex_syntax/
Features
RegexOperator
- Rust phase:
- Try to compile in
regex::Regex
for the list of patterns. Get the list of valid ones and invalid ones. - Compile
regex::RegexSet
with valid patterns and apply over the list of haystacks. - Operate compiled
regex::Regex
, finding them over all the haystacks, for the subset of pairs that have matched in theregex::RegexSet
. - Try to compile invalid patterns with
fancy_regex::Regex
and find matches over the haystacks. Reduce final invalid patterns list to give back to python. - Operate compiled
fancy_regex::Regex
, finding them over all the haystacks. - Give back to Python.
- Try to compile in
- Python phase:
- Try to apply all failed patterns, finding them over all the haystacks.
- Return the final result.
Entities
Entities are found by overlapping and have a hierarchichal folder structure.
- Literal entities: fast only literal based entities. Those entities are based in literal alternations, and
are built from a list of strings, is like matching (lit_1|...|lit_N)`. Can be:
- Private: only used by regex entities by composition. The only interest on them is for composition so those are only matched not finded.
- Public: calculated and reported. Those reports enforce that matched boundaries are
\b
, just if the literal matching where\b(lit_1|...|lit_N)\b
. Tech note: positions reported by aho corasick should be mapped from byte to char position.
- Regex entities: a list of regex patterns, possibly containing literal entities calls with template language. For example if
month
is a literal entity, Then\d+ of \d+ of {{month}}
is a possible entity. The regex entities that depend positively (no negative lookback or lookahead), only are searched on the texts where the literal entity has been found, minimizing computational weight.
Feeding of entity matches:
- From python list of objects, where each object is equivalent to the file JSON loaded. Each object contains a field
kind
with one of two values:re
orlit
. - From local folder with folders:
- Structured hierarchically.
Steps of entity recognition:
- Load the entity system:
- Deserialize all defined entities.
- Build
LiteralEntityPool
. There are public and private literal entities:- Private literal entities will not be reported only used internally by regex entities.
- Public literal entities will be reported as entities. NOTE: the bound of the literal public entities is calculated after all as Aho Corasick has not bound allowed.
- Build
RegexEntityPool
using literals fromLiteralEntityPool
, then there are two kinds of regex entities- The ones that use any literal entity.
- The ones that do not use any literal entity.
- Process texts and get entities:
- Get literal entity raw index matches.
- Literal-based regex entities perform find if the ordered set of matches of literal entities is satisfied from literal entities results.
- Non literal-based regex entities find is performed using
regex::RegexSet
- Ensemble together public literal entities, literal-based regex entities and non literal-based regex entities and give output.
A pattern in a regex entity has two type of categorizations:
- Pattern that can be compiled at
regex
crate:- Pattern with at least one positive capture group related to a literal entity. Match will be decided by aho corasick and literal entity order. This is a regex were
entities::extract_required_template_structure
throws a non-empty vector. - Pattern that does not fit the previous case, this pattern will be matched through
RegexSet
. This is a pattern withentities::extract_required_template_structure
throwing an empty vector.
- Pattern with at least one positive capture group related to a literal entity. Match will be decided by aho corasick and literal entity order. This is a regex were
- Regex that can not be compiled by
regex
crate will receive a direct find fromfancy_regex
crate. This pattern receives an Error fromentities::extract_required_template_structure
.
Naming convention for entity files is:
Literal replacer
This is a very concrete function to perform high performance literal replacement using Rust aho_corasick
implementation. It accepts parallelization by chunks of text.
Performance advices
- https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#unicode-word-boundaries-may-prevent-the-dfa-from-being-used
- there is no problem with using non-greedy matching or having lots of alternations in your regex https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#resist-the-temptation-to-optimize-regexes
CICD
This repository pretends to be a perfect CICD example for a Python+Rust lib based on pyo3
. Any suggestions (caching, badges, anything, ...) just let me know by issue :)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for pytextrust-0.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7f0da66005ed5f289010ea16cfd70cdc17494c0e81b6363ac82294c65ef0ac02 |
|
MD5 | 0cbb8c7090c483610b171642c6819e8e |
|
BLAKE2b-256 | 0dbfa6955a3869ec0b747ec98a4752f0257d259ebf0b5bd8691f77abc98e5470 |
Hashes for pytextrust-0.2.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 898cdc7e15f9ee862614d31bdfa0ab88f73f5c42791070c067541bbb58fd178a |
|
MD5 | 68fda3ea65b7f0e8f0137e121d7a47bd |
|
BLAKE2b-256 | f7419fa11b00e677cc3de13126cd09f0ab98a46d2dd019928c2fb82a10cc2465 |
Hashes for pytextrust-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7e65d85f1156b8092808d5b6ba21ad16627096915c714db467cde010fdd8be9 |
|
MD5 | 89c7df211c65aa18a0b3b7483fd39207 |
|
BLAKE2b-256 | 5f7348d23465677e0402e96b5e46867c398a7506643d1d0be6f8bc935be06df4 |