Skip to main content

a fast regex for object

Project description

pyrefo: a fast regex for object

Build Status

This project is based on refo and the paper Regular Expression Matching: the Virtual Machine Approach, it use cffi to extend python with c to speed accelerate processing performance.

This project has done the following work:

  1. full compatiable with refo api, support all patterns and match, search, finditer methods;
  2. fix c source bug included in the paper;
  3. use cffi to extend python with c;
  4. add new feature which supports partial match;
  5. add new Phrasepattern which can realize 'ab'match ['a', 'b', 'c']list;

How to use it

"ab" is Literal("a")+Literal("b")

"a*" is Star(Literal("a"))

"aab?" is Literal("a")+Literal("a")+Question(Literal("b"), greedy=False)

a{3,4} is Repetition(Literal("a"), 3, 4, greedy=False)

"(ab)+|(bb*)?" is

a = Literal("a")
b = Literal("b")
regex = Plus(a + b) | Star(b + b, greedy=False)

You can also assign a group to any sub-match and later on retrieve the matched content, for instance:

regex = Group(Plus(a + b), "foobar") | Star(b + b, greedy=False)
m = match(regex, "abab")
print(m.span("foobar"))

pyrefo offers match, search, findall, finditer search functions:

  • match: match pattern from first position
  • search: search pattern from first position till find one
  • findall: find all matched result
  • finditer: return an iterator for all matched result

pyrefo offers the following predicates:

  • Any
  • Literal
  • Star
  • Plus
  • Question
  • Group
  • Repetition
  • Phrase

Performance test

prerequisites

import jieba
text = '为什么在本店买东西?因为物流迅速+品质保证。为什么我购买的每件商品评价都一样呢?因为我买的东西太多了,积累了很多未评价的订单,所以我统一用这段话作为评价内容。如果我用了这段话作为评价,那就说明这款产品非常赞,非常好!'
tokens = list(jieba.cut(text))

CPython

  • pyrefo
from pyrefo import search, Group, Star, Any, Literal
%timeit search(Group(Literal('物流') + Star(Any()) + Literal('迅速'), 'a'), tokens)
95.9 µs ± 472 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  • refo
import refo
%timeit refo.search(refo.Group(refo.Literal('物流') + refo.Star(refo.Any()) + refo.Literal('迅速'), 'a'), tokens)
1.03 ms ± 7.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • re
import re
%timeit re.search('(物流.*速度)', text)
989 ns ± 4.69 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

PyPy

  • pyrefo
from pyrefo import search, Group, Star, Any, Literal
%timeit search(Group(Literal('物流') + Star(Any()) + Literal('迅速'), 'a'), tokens)
53.4 µs ± 28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • refo
import refo
%timeit refo.search(refo.Group(refo.Literal('物流') + refo.Star(refo.Any()) + refo.Literal('迅速'), 'a'), tokens)
78 µs ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • re
import re
%timeit re.search('(物流.*速度)', text)
347 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for pyrefo, version 0.4
Filename, size File type Python version Upload date Hashes
Filename, size pyrefo-0.4-cp37-cp37m-macosx_10_13_x86_64.whl (30.3 kB) File type Wheel Python version cp37 Upload date Hashes View
Filename, size pyrefo-0.4.tar.gz (24.7 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page