a fast regex for object
Project description
pyrefo: a fast regex for object
This project is based on refo and the paper Regular Expression Matching: the Virtual Machine Approach, it use cffi to extend python with c to speed accelerate processing performance.
This project has done the following work:
- full compatiable with refo api, support all patterns and match, search, finditer methods;
- fix c source bug included in the paper;
- use cffi to extend python with c;
- add new feature which supports partial match;
- add new
Phrase
pattern which can realize'ab'
match['a', 'b', 'c']
list;
How to use it
"ab"
is Literal("a")+Literal("b")
"a*"
is Star(Literal("a"))
"aab?"
is Literal("a")+Literal("a")+Question(Literal("b"), greedy=False)
a{3,4}
is Repetition(Literal("a"), 3, 4, greedy=False)
"(ab)+|(bb*)?"
is
a = Literal("a")
b = Literal("b")
regex = Plus(a + b) | Star(b + b, greedy=False)
You can also assign a group to any sub-match and later on retrieve the matched content, for instance:
regex = Group(Plus(a + b), "foobar") | Star(b + b, greedy=False)
m = match(regex, "abab")
print(m.span("foobar"))
pyrefo
offers match
, search
, findall
, finditer
search functions:
- match: match pattern from first position
- search: search pattern from first position till find one
- findall: find all matched result
- finditer: return an iterator for all matched result
pyrefo
offers the following predicates:
- Any
- Literal
- Star
- Plus
- Question
- Group
- Repetition
- Phrase
Performance test
prerequisites
import jieba
text = '为什么在本店买东西?因为物流迅速+品质保证。为什么我购买的每件商品评价都一样呢?因为我买的东西太多了,积累了很多未评价的订单,所以我统一用这段话作为评价内容。如果我用了这段话作为评价,那就说明这款产品非常赞,非常好!'
tokens = list(jieba.cut(text))
CPython
- pyrefo
from pyrefo import search, Group, Star, Any, Literal
%timeit search(Group(Literal('物流') + Star(Any()) + Literal('迅速'), 'a'), tokens)
95.9 µs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
- refo
import refo
%timeit refo.search(refo.Group(refo.Literal('物流') + refo.Star(refo.Any()) + refo.Literal('迅速'), 'a'), tokens)
1.03 ms ± 7.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- re
import re
%timeit re.search('(物流.*速度)', text)
989 ns ± 4.69 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
PyPy
- pyrefo
from pyrefo import search, Group, Star, Any, Literal
%timeit search(Group(Literal('物流') + Star(Any()) + Literal('迅速'), 'a'), tokens)
53.4 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- refo
import refo
%timeit refo.search(refo.Group(refo.Literal('物流') + refo.Star(refo.Any()) + refo.Literal('迅速'), 'a'), tokens)
78 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- re
import re
%timeit re.search('(物流.*速度)', text)
347 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyrefo-0.4.tar.gz
.
File metadata
- Download URL: pyrefo-0.4.tar.gz
- Upload date:
- Size: 24.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | baa99189e1aec8c392b863b3b2eacd34ca2225b44ddbf0c62136051bae689fba |
|
MD5 | a46bde5a69de5ef06b6a66a569b9bf33 |
|
BLAKE2b-256 | a3e6d854c13ba3836cbe17aa989626bddd78c9a3f9065b086ab7e84407088441 |
File details
Details for the file pyrefo-0.4-cp37-cp37m-macosx_10_13_x86_64.whl
.
File metadata
- Download URL: pyrefo-0.4-cp37-cp37m-macosx_10_13_x86_64.whl
- Upload date:
- Size: 30.3 kB
- Tags: CPython 3.7m, macOS 10.13+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 75a3ed38ab5a3435b1e3f72fedc662cbfdc8b5b97c62a6c666e8af288831ae91 |
|
MD5 | 564faeae4ebd73e9f407576ad122c212 |
|
BLAKE2b-256 | a3e6dae36bc33491a4837aea44e7b84861cf9d6faeeea6b04eacb5f01b1d9d7b |