Skip to main content

a fast regex for object

Project description

pyrefo: a fast regex for object

Build Status

This project is based on refo and the paper Regular Expression Matching: the Virtual Machine Approach, it use cffi to extend python with c to speed accelerate processing performance.

This project has done the following work:

  1. full compatiable with refo api, support all patterns and match, search, finditer methods;
  2. fix c source bug included in the paper;
  3. use cffi to extend python with c;
  4. add new feature which supports partial match;
  5. add new Phrasepattern which can realize 'ab'match ['a', 'b', 'c']list;

How to use it

"ab" is Literal("a")+Literal("b")

"a*" is Star(Literal("a"))

"aab?" is Literal("a")+Literal("a")+Question(Literal("b"), greedy=False)

a{3,4} is Repetition(Literal("a"), 3, 4, greedy=False)

"(ab)+|(bb*)?" is

a = Literal("a")
b = Literal("b")
regex = Plus(a + b) | Star(b + b, greedy=False)

You can also assign a group to any sub-match and later on retrieve the matched content, for instance:

regex = Group(Plus(a + b), "foobar") | Star(b + b, greedy=False)
m = match(regex, "abab")
print(m.span("foobar"))

pyrefo offers match, search, findall, finditer search functions:

  • match: match pattern from first position
  • search: search pattern from first position till find one
  • findall: find all matched result
  • finditer: return an iterator for all matched result

pyrefo offers the following predicates:

  • Any
  • Literal
  • Star
  • Plus
  • Question
  • Group
  • Repetition
  • Phrase

Performance test

prerequisites

import jieba
text = '为什么在本店买东西?因为物流迅速+品质保证。为什么我购买的每件商品评价都一样呢?因为我买的东西太多了,积累了很多未评价的订单,所以我统一用这段话作为评价内容。如果我用了这段话作为评价,那就说明这款产品非常赞,非常好!'
tokens = list(jieba.cut(text))

CPython

  • pyrefo
from pyrefo import search, Group, Star, Any, Literal
%timeit search(Group(Literal('物流') + Star(Any()) + Literal('迅速'), 'a'), tokens)
95.9 µs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • refo
import refo
%timeit refo.search(refo.Group(refo.Literal('物流') + refo.Star(refo.Any()) + refo.Literal('迅速'), 'a'), tokens)
1.03 ms ± 7.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • re
import re
%timeit re.search('(物流.*速度)', text)
989 ns ± 4.69 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

PyPy

  • pyrefo
from pyrefo import search, Group, Star, Any, Literal
%timeit search(Group(Literal('物流') + Star(Any()) + Literal('迅速'), 'a'), tokens)
53.4 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • refo
import refo
%timeit refo.search(refo.Group(refo.Literal('物流') + refo.Star(refo.Any()) + refo.Literal('迅速'), 'a'), tokens)
78 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • re
import re
%timeit re.search('(物流.*速度)', text)
347 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrefo-0.3.tar.gz (24.7 kB view hashes)

Uploaded Source

Built Distribution

pyrefo-0.3-cp37-cp37m-macosx_10_13_x86_64.whl (30.3 kB view hashes)

Uploaded CPython 3.7m macOS 10.13+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page