Skip to main content

High performance Trie and Ahocorasick automata (AC automata) for python

Project description

cyac

Python 2.7 Python 3.4+ High performance Trie & Keyword Match & Replace Tool.

It's implemented by cython, and will be compiled to cpp. The trie data structure is cedar, which is an optimized double array trie. it supports Python2.7 and 3.4+. It supports pickle to dump and load.

If you found this useful please give a star!

Quick Start

This module is written in cython. You need cython installed.

pip install cyac

Then create a Trie

>>> from cyac import Trie
>>> trie = Trie()

add/get/remove keyword

>>> trie.insert(u"哈哈") # return keyword id in trie, return -1 if doesn't exist
>>> trie.get(u"哈哈") # return keyword id in trie, return -1 if doesn't exist
>>> trie.remove(u"呵呵") # return keyword in trie
>>> trie[id] # return the word corresponding to the id
>>> trie[u"呵呵"] # similar to get but it will raise exeption if doesn't exist
>>> u"呵呵" in trie # test if the keyword is in trie

get all keywords

>>> for key, id_ in trie.items():
>>>     print(key, id_)

prefix/ predict

>>> # return the string in the trie which starts with given string
>>> for id_ in trie.predict(u"呵呵"):
>>>     print(id_)
>>> # return the prefix of given string which is in the trie.
>>> for id_, len_ in trie.prefix(u"呵呵"):
>>>     print(id_, len_)

trie extract,replace

>>> python_id = trie.insert(u"python")
>>> trie.replace_longest("python", {python_id: u"hahah"}, set([ord(" ")])) # the second parameter is seperator. If you specify seperators. it only matches strings tween seperators. e.g. It won't match 'apython'
>>> for id_, start, end in trie.match_longest(u"python", set([ord(" ")])):
>>>     print(id_, start, end)

Aho Corasick extract

>>> ac = AC.build([u"python", u"ruby"])
>>> for id, start, end in ac.match(u"python ruby"):
>>>     print(id, start, end)

Export to File, then we can use mmap to load file, share data between processes.

>>> ac = AC.build([u"python", u"ruby"])
>>> ac.save("filename")
>>> ac.to_buff(buff_object)

Init from Python Buffer

>>> import mmap
>>> with open("filename", "r+b") as bf:
        buff_object = mmap.mmap(bf.fileno(), 0)
>>> AC.from_buff(buff_object, copy=True) # it allocs new memory
>>> AC.from_buff(buff_object, copy=False) # it shares memory

Multi Process example

import mmap
from multiprocessing import Process
from cyac import AC

def get_mmap():
    with open("random_data", "r+b") as bf:
        buff_object = mmap.mmap(bf.fileno(), 0)
    ac_trie = AC.from_buff(buff_object, copy=False)
    # Do your aho searches here. "match" function is process safe.

processes_list = list()
for x in range(0, 6):
    p = Process(
        target=get_mmap,
    )
    p.start()
    processes_list.append(p)
for p in processes_list:
    p.join()

For more information about multiprocessing and memory analysis in cyac, see this issue.

Thread safety

The function "match" of the AC automaton is thread/process safe. It is possible to find matches in parrallel with a shared AC automaton, but not write/append patterns to it.

Performance

On Ubuntu 14.04.5/Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz.

Trie

Compared With HatTrie, Horizon axis is token num. Vertical axis is used time(seconds).

Insert

insert performance

Get

get performance

Remove

remove performance

KeyWord Extract/Replace

Compared With flashText. Regular Expression is too slow in this task (See flashText's bench mark). Horizon axis is char num to be match. Vertical axis is used time(seconds).

extract performance replace performance

Aho Corasick Algorithm

Compared With pyahocorasick, Horizon axis is char num to be match. Vertical axis is used time(seconds). ac performance

Unicode

>>> len(char.lower()) == len(char) # this is always true in python2, but not in python3
>>> len(u"İstanbul") != len(u"İstanbul".lower()) # in python3

In case insensitive matching, this library take care of the fact, and returns correct offset.

Run Test

python setup.py build

PYTHONPATH=$(pwd)/build/BUILD_DST python3 tests/test_all.py
PYTHONPATH=$(pwd)/build/BUILD_DST python3 bench/bench_*.py

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cyac-1.11.tar.gz (75.9 kB view details)

Uploaded Source

Built Distribution

cyac-1.11-cp312-cp312-macosx_14_0_arm64.whl (270.8 kB view details)

Uploaded CPython 3.12 macOS 14.0+ ARM64

File details

Details for the file cyac-1.11.tar.gz.

File metadata

  • Download URL: cyac-1.11.tar.gz
  • Upload date:
  • Size: 75.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for cyac-1.11.tar.gz
Algorithm Hash digest
SHA256 c0a53f334949872f7d8835553418cca30fc7807e36e1fa1900ce858961060e60
MD5 fc2b94776153a0083a7d6c1cc357acd0
BLAKE2b-256 cf0d2188dcae83786bafd07455e2114baa7acc81825d259e2a0f7233f891a924

See more details on using hashes here.

File details

Details for the file cyac-1.11-cp312-cp312-macosx_14_0_arm64.whl.

File metadata

File hashes

Hashes for cyac-1.11-cp312-cp312-macosx_14_0_arm64.whl
Algorithm Hash digest
SHA256 8c456489c83ecbb92968b3a970c871a7d0aee431b9d31c9c3c7b1bdb8b7aa605
MD5 ec373c46cb43d0bafd4e0b102f85ae2e
BLAKE2b-256 7316b0d5b49c14dd9b0b1cd5dc375a2269d28d6e1116a0495d5c3934332e089a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page