Skip to main content

a fast regex for object

Project description

pyrefo: a fast regex for object

Build Status

This project is based on refo and the paper Regular Expression Matching: the Virtual Machine Approach, it use cffi to extend python with c to speed accelerate processing performance.

This project has done the following work:

  1. full compatiable with refo api, support all patterns and match, search, finditer methods;
  2. fix c source bug included in the paper;
  3. use cffi to extend python with c;
  4. add new feature which supports partial match;
  5. add new Phrasepattern which can realize 'ab'match ['a', 'b', 'c']list;

How to use it

"ab" is Literal("a")+Literal("b")

"a*" is Star(Literal("a"))

"aab?" is Literal("a")+Literal("a")+Question(Literal("b"), greedy=False)

a{3,4} is Repetition(Literal("a"), 3, 4, greedy=False)

"(ab)+|(bb*)?" is

a = Literal("a")
b = Literal("b")
regex = Plus(a + b) | Star(b + b, greedy=False)

You can also assign a group to any sub-match and later on retrieve the matched content, for instance:

regex = Group(Plus(a + b), "foobar") | Star(b + b, greedy=False)
m = match(regex, "abab")
print(m.span("foobar"))

pyrefo offers match, search, findall, finditer search functions:

  • match: match pattern from first position
  • search: search pattern from first position till find one
  • findall: find all matched result
  • finditer: return an iterator for all matched result

pyrefo offers the following predicates:

  • Any
  • Literal
  • Star
  • Plus
  • Question
  • Group
  • Repetition
  • Phrase

Performance test

prerequisites

import jieba
text = '为什么在本店买东西?因为物流迅速+品质保证。为什么我购买的每件商品评价都一样呢?因为我买的东西太多了,积累了很多未评价的订单,所以我统一用这段话作为评价内容。如果我用了这段话作为评价,那就说明这款产品非常赞,非常好!'
tokens = list(jieba.cut(text))

CPython

  • pyrefo
from pyrefo import search, Group, Star, Any, Literal
%timeit search(Group(Literal('物流') + Star(Any()) + Literal('迅速'), 'a'), tokens)
95.9 µs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • refo
import refo
%timeit refo.search(refo.Group(refo.Literal('物流') + refo.Star(refo.Any()) + refo.Literal('迅速'), 'a'), tokens)
1.03 ms ± 7.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • re
import re
%timeit re.search('(物流.*速度)', text)
989 ns ± 4.69 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

PyPy

  • pyrefo
from pyrefo import search, Group, Star, Any, Literal
%timeit search(Group(Literal('物流') + Star(Any()) + Literal('迅速'), 'a'), tokens)
53.4 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • refo
import refo
%timeit refo.search(refo.Group(refo.Literal('物流') + refo.Star(refo.Any()) + refo.Literal('迅速'), 'a'), tokens)
78 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • re
import re
%timeit re.search('(物流.*速度)', text)
347 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrefo-0.4.tar.gz (24.7 kB view details)

Uploaded Source

Built Distribution

pyrefo-0.4-cp37-cp37m-macosx_10_13_x86_64.whl (30.3 kB view details)

Uploaded CPython 3.7m macOS 10.13+ x86-64

File details

Details for the file pyrefo-0.4.tar.gz.

File metadata

  • Download URL: pyrefo-0.4.tar.gz
  • Upload date:
  • Size: 24.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.1

File hashes

Hashes for pyrefo-0.4.tar.gz
Algorithm Hash digest
SHA256 baa99189e1aec8c392b863b3b2eacd34ca2225b44ddbf0c62136051bae689fba
MD5 a46bde5a69de5ef06b6a66a569b9bf33
BLAKE2b-256 a3e6d854c13ba3836cbe17aa989626bddd78c9a3f9065b086ab7e84407088441

See more details on using hashes here.

File details

Details for the file pyrefo-0.4-cp37-cp37m-macosx_10_13_x86_64.whl.

File metadata

  • Download URL: pyrefo-0.4-cp37-cp37m-macosx_10_13_x86_64.whl
  • Upload date:
  • Size: 30.3 kB
  • Tags: CPython 3.7m, macOS 10.13+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.1

File hashes

Hashes for pyrefo-0.4-cp37-cp37m-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 75a3ed38ab5a3435b1e3f72fedc662cbfdc8b5b97c62a6c666e8af288831ae91
MD5 564faeae4ebd73e9f407576ad122c212
BLAKE2b-256 a3e6dae36bc33491a4837aea44e7b84861cf9d6faeeea6b04eacb5f01b1d9d7b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page