Library designed as a python wrapper to unleash Rust text processing power combined with Python

These details have not been verified by PyPI

Project links

Homepage

Project description

PyTextRust

main:
develop:

Library defined to achieve easily high performance on regex and text processing inside Python, being built as a direct Wrapper of Rust regex and text crates.

On short text, sparsity of found elements is the common denominator, this library focuses on algorithms that aknowledge this sparsity and efficiently achieves good performance from simple Python API calls to Rust optimized logics.

Give some happiness

Features

Special case

This lib has special treatment for texts that only contain [a-zA-Z0-9ñç ] plus accented vocals, allowing to use non unicode matching over those texts. This is particularly convenient for some Automatic Speech Recognition outputs.

In every place that it is possibly to provide it, this:

unicode: False -> removes unicode chars from matching, making matching much more efficient (x6 - x12 it is easly achieved).
substitute_bound: True -> substitutes in patterns r"\b" for r"(?-u:\b)" as recommended here
substitute_latin_char: True -> substitutes in patterns pkg::constants::LATIN_CHARS_TO_REPLACE for pkg::constants::LATIN_CHARS_REPLACEMENT, to allow the use of non unicode variant without losing the ability to match texts and patterns that contain those latin chars (care it projects them into pkg::constants::LATIN_CHARS_TO_REPLACE both in patterns and texts).

Find

Find patterns in texts, possibly parallelizing by chunks of either patterns or texts.

It uses efficient regex::RegexSet that reduces the cardinality of the patterns in the matching phase.

The structure of finding function is:

Rust phase:
1. Try to compile in regex::Regex for the list of patterns. Get the list of valid ones and invalid ones.
2. Compile regex::RegexSet with valid patterns and apply over the list of texts. This gives which ones have match over the texts.
3. Operate compiled regex::Regex, finding them over all the texts for the subset of pairs that have matched in the regex::RegexSet.
4. Try to compile invalid patterns with fancy_regex::Regex and find matches over the texts. It reduces final invalid patterns list that is given back to python.
5. Give matches of valid patterns and invalid patterns back to Python.
Python phase:
1. Try to apply all failed patterns, finding them over all the texts. It uses regex package that has expanded pattern support over re built-in package.
2. Return the final result.

Calling examples

Literal replacer

This is a very concrete function to perform high performance literal replacement using Rust aho_corasick implementation. It accepts parallelization by chunks of text.

It uses Rust aho_corasick to perform replacements, adding a layer of bounding around literals to replace through the is_bounded parameter.

If is_bounded is True then before replacing the literal found, it is checked that any of [A-Za-z0-9_] (expanded with accents and special word chards that can be checked in pkg::unicode::check_if_word_bytes) is around the literal.
Matching types can be chosen over available ones in aho_corasick::MatchKind, being the default one aho_corasick::MatchKind::LeftmostLongest.

More at doc/notebook/doc/literal_replacer.ipynb in the repository.

Calling examples

from pytextrust.replacer import replace_literal_patterns, MatchKind

replace_literal_patterns(
    literal_patterns=["uno", "dos"],
    replacements=["1", "2"],
    text_to_replace=["es el numero uno o el Dos yo soy el veintiuno"],
    is_bounded=True,
    case_insensitive=True,
    match_kind=MatchKind.LeftmostLongest)

returns the replaced text and the number of replacements

(['es el numero 1 o el 2 yo soy el veintiuno'], 2)

Entities

Entities are found by overlapping and have a hierarchichal folder structure.

Literal entities: fast only literal based entities. Those entities are based in literal alternations, and are built from a list of strings, is like matching (lit_1|...|lit_N)`. Can be:
- Private: only used by regex entities by composition. The only interest on them is for composition so those are only matched not finded.
- Public: calculated and reported. Those reports enforce that matched boundaries are \b, just if the literal matching where \b(lit_1|...|lit_N)\b. Tech note: positions reported by aho corasick should be mapped from byte to char position.
Regex entities: a list of regex patterns, possibly containing literal entities calls with template language. For example if month is a literal entity, Then \d+ of \d+ of {{month}} is a possible entity. The regex entities that depend positively (no negative lookback or lookahead), only are searched on the texts where the literal entity has been found, minimizing computational weight.

Feeding of entity matches:

From python list of objects, where each object is equivalent to the file JSON loaded. Each object contains a field kind with one of two values: re or lit.
From local folder with folders:
- Structured hierarchically.

Steps of entity recognition:

Load the entity system:
- Deserialize all defined entities.
- Build LiteralEntityPool. There are public and private literal entities:
  - Private literal entities will not be reported only used internally by regex entities.
  - Public literal entities will be reported as entities. NOTE: the bound of the literal public entities is calculated after all as Aho Corasick has not bound allowed.
- Build RegexEntityPool using literals from LiteralEntityPool, then there are two kinds of regex entities
  - The ones that use any literal entity.
  - The ones that do not use any literal entity.
Process texts and get entities:
- Get literal entity raw index matches.
- Literal-based regex entities perform find if the ordered set of matches of literal entities is satisfied from literal entities results.
- Non literal-based regex entities find is performed using regex::RegexSet
Ensemble together public literal entities, literal-based regex entities and non literal-based regex entities and give output.

A pattern in a regex entity has two type of categorizations:

Pattern that can be compiled at regex crate:
- Pattern with at least one positive capture group related to a literal entity. Match will be decided by aho corasick and literal entity order. This is a regex were entities::extract_required_template_structure throws a non-empty vector.
- Pattern that does not fit the previous case, this pattern will be matched through RegexSet. This is a pattern with entities::extract_required_template_structure throwing an empty vector.
Regex that can not be compiled by regex crate will receive a direct find from fancy_regex crate. This pattern receives an Error from entities::extract_required_template_structure.

Naming convention for entity files is:

Calling examples

CICD

This repository pretends to be a perfect CICD example for a Python+Rust lib based on pyo3. Any suggestions (caching, badges, anything, ...) just let me know by issue :)

Useful doc

Learning doc

Reference Rust pattern matching packages

Performance advices

https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#unicode-word-boundaries-may-prevent-the-dfa-from-being-used
there is no problem with using non-greedy matching or having lots of alternations in your regex https://github.com/rust-lang/regex/blob/master/PERFORMANCE.md#resist-the-temptation-to-optimize-regexes

Benchmark by Rust regex author

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.0

Aug 27, 2025

0.11.0

Sep 9, 2024

0.10.0

Aug 7, 2024

0.9.0

Jul 2, 2024

0.8.2

May 28, 2024

This version

0.7.5

Oct 18, 2023

0.7.4

Oct 10, 2023

0.7.3

Sep 25, 2023

0.7.2

Sep 16, 2023

0.7.1

Sep 16, 2023

0.7.0

Sep 16, 2023

0.6.1

Sep 7, 2023

0.5.1

Jul 13, 2023

0.5.0

Jul 13, 2023

0.4.0

Jul 12, 2023

0.3.0

Jul 10, 2023

0.2.13

Jun 28, 2023

0.2.12

Jun 26, 2023

0.2.11

Jun 25, 2023

0.2.10

Jun 23, 2023

0.2.9

Jun 22, 2023

0.2.8

Apr 24, 2023

0.2.7

Apr 16, 2023

0.2.6

Apr 15, 2023

0.2.5

Apr 10, 2023

0.2.4

Mar 30, 2023

0.2.1

Mar 29, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytextrust-0.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded Oct 18, 2023 CPython 3.11manylinux: glibc 2.17+ x86-64

pytextrust-0.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded Oct 18, 2023 CPython 3.10manylinux: glibc 2.17+ x86-64

pytextrust-0.7.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded Oct 18, 2023 CPython 3.9manylinux: glibc 2.17+ x86-64

pytextrust-0.7.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded Oct 18, 2023 CPython 3.8manylinux: glibc 2.17+ x86-64

pytextrust-0.7.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded Oct 18, 2023 CPython 3.7mmanylinux: glibc 2.17+ x86-64

File details

Details for the file pytextrust-0.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pytextrust-0.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Oct 18, 2023
Size: 2.6 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for pytextrust-0.7.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`cbf7e4595c3f06a95249d7f2b80bb21b83c0d9bd99ddd359313fc9b5915fa0e6`
MD5	`9866f8b2961a9dcd7a08d19e041fba73`
BLAKE2b-256	`65200ce14a7f48d708028ffcd3a3406902236e4eaa4622ab6123e48c41ebd628`

See more details on using hashes here.

File details

Details for the file pytextrust-0.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pytextrust-0.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Oct 18, 2023
Size: 2.6 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for pytextrust-0.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`72ffaebf77f85dac80986aedf68168d5d190da3b66b6e06e5f42a9f92132e283`
MD5	`a540de702ed3629ba2b23d48d2e145e4`
BLAKE2b-256	`dba819bfa351686567c960cf79912f65af88a28c49ed3eaf4cea9665f5d5bd4f`

See more details on using hashes here.

File details

Details for the file pytextrust-0.7.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pytextrust-0.7.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Oct 18, 2023
Size: 2.6 MB
Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for pytextrust-0.7.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`e4eec9ef537f141f3c9c21f03a8278cce14e7cc438c02f8f8933aa0d35b4c278`
MD5	`4298ad80efd2195f181826e4503627ad`
BLAKE2b-256	`f026311bb493981b1ed81b176dd77528e9c868efd40e0a919f43172c6d8f1c21`

See more details on using hashes here.

File details

Details for the file pytextrust-0.7.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pytextrust-0.7.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Oct 18, 2023
Size: 2.6 MB
Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for pytextrust-0.7.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`9cf22afa9e9330327f8d9bac9f697ec419bc89b73a5bf40672539097ea89eb74`
MD5	`f5140958507a525de347aa516e613ece`
BLAKE2b-256	`32b8c6d62876d287d9fe3828de5d31641d6a2d010264a1f5dc20118ced9bfa72`

See more details on using hashes here.

File details

Details for the file pytextrust-0.7.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pytextrust-0.7.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Oct 18, 2023
Size: 2.6 MB
Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.8

File hashes

Hashes for pytextrust-0.7.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`1beb1907fa9c06a694e0938da5599ea66bd0398ee8bfedb34247e00718a72c48`
MD5	`14990f85111a4f6cc279e71dcea315a3`
BLAKE2b-256	`084487b75a22c6167cb185a2a6a08590aa1d6ef70aa8613fdba4060bfd6c38ef`

See more details on using hashes here.

pytextrust 0.7.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyTextRust

Features

Special case

Find

Calling examples

Literal replacer

Calling examples

Entities

Calling examples

CICD

Useful doc

Learning doc

Reference Rust pattern matching packages

Performance advices

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes