Skip to main content

suffix automaton by words, to get text common substrings

Project description

SuffixAutomaton 后缀自动机

find LCS (longest common substrings) by suffix automaton

usage

pip install SuffixAutomaton

from SuffixAutomaton import SuffixAutomaton, lcs1, lcs2, logger

raw = """
ASE : International Conference on Automated Software Engineering
ESEC/FSE : ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
ICSE : International Conference on Software Engineering
ISSTA : The International Symposium on Software Testing and Analysis
OOPSLA : Conference on Object-Oriented Programming Systems, Languages, and Applications
OSDI : Operating Systems Design and Implementation
PLDI : ACM SIGPLAN conference on Programming Language Design and Implementation
POPL : ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages
SOSP : ACM Symposium on Operating Systems Principles
"""
doc = raw.strip().splitlines()
doc = [x.split() for x in doc]
# for tokens
logger.info(lcs1(doc[1], doc[2], output_lcs=True))  # [(14, 2, 5, ['Software', 'Engineering'])]
logger.info(lcs2(doc[0], doc[1:4], output_lcs=True))  # [(1, 1, [':']), (4, 1, ['on']), (6, 1, ['Software'])]
logger.info(lcs1(doc[1], doc[2], 1, output_lcs=True)) # [(1, 1, 1, [':']), (7, 1, 3, ['Conference']), (10, 1, 4, ['on']), (14, 2, 5, ['Software', 'Engineering'])]
logger.info(lcs2(doc[0], doc[1:4], 1, output_lcs=True)) # [(1, 1, [':']), (4, 1, ['on']), (6, 1, ['Software'])]
logger.info(lcs2(doc[0], doc[1:4], 1, output_lcs=False)) # [(1, 1, None), (4, 1, None), (6, 1, None)]

# for chars
poet = "江天一色无纤尘皎皎空中孤月轮 江畔何人初见月江月何年初照人 人生代代无穷已江月年年望相似 不知江月待何人但见长江送流水"
doc = poet.split()   
logger.info(lcs1(doc[1], doc[3], output_lcs=True))  #  [(2, 2, 5, '何人'), (7, 2, 2, '江月')]
logger.info(lcs1(doc[1], doc[3], 1, output_lcs=True)) # [(0, 1, 10, '江'), (2, 2, 5, '何人'), (5, 1, 8, '见'), (7, 2, 2, '江月')]
# for lcs of doc
logger.info(lcs2(doc[2], doc[2:4], output_lcs=True))  # [(7, 2, '江月')]
logger.info(lcs2(doc[2], doc[2:4], 1 ,output_lcs=True)) # [(0, 1, '人'), (7, 2, '江月')]
# faster when iterally
sam = SuffixAutomaton(doc[0])
for x in doc[1:]:
    print((x, sam.lcs1(x, output_lcs=True)))
"""
('江畔何人初见月江月何年初照人', [(0, 1, 0, '江'), (12, 1, 6, '月')])
('人生代代无穷已江月年年望相似', [(0, 1, 7, '江'), (4, 1, 4, '无'), (12, 1, 8, '月')])
('不知江月待何人但见长江送流水', [(0, 1, 2, '江'), (12, 1, 3, '月')])
"""

# lcs() -> [(str, start, cand_start)], sort in length decending. may overlap. 
logger.info(lcs2("布架 拖把抹布悬挂沥水洁具架 ", ["抹布架"], 1, output_lcs=True))  # [(0, 2, '布架'), (5, 2, '抹布'), (13, 1, '架')]

feature

  • suffix automaton [in words] 可分词后缀自动机
  • [Longest] Common Substring of two lines 两文[最长]共串
  • [Longest] Common Substring of document 多文[最长]共串

inspired by

参照:https://www.cnblogs.com/shld/p/10444808.html
讲解:https://www.cnblogs.com/zjp-shadow/p/9218214.html
详解:https://www.cnblogs.com/1625--H/p/12416198.html
证明:https://oi-wiki.org/string/sam/
题解:https://www.cnblogs.com/Lyush/archive/2013/08/25/3281546.html https://www.cnblogs.com/mollnn/p/13175736.html

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SuffixAutomaton-0.1.6.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

SuffixAutomaton-0.1.6-py3-none-any.whl (6.8 kB view details)

Uploaded Python 3

File details

Details for the file SuffixAutomaton-0.1.6.tar.gz.

File metadata

  • Download URL: SuffixAutomaton-0.1.6.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for SuffixAutomaton-0.1.6.tar.gz
Algorithm Hash digest
SHA256 7a02e109880e30e0f4aea51b5454991ac22e737dd3a46201f29a2712e8494115
MD5 0c177657827caabfd6687041f9128e73
BLAKE2b-256 0291d2c1e28282cd598db5d40e6ce706e4c0737522e76e322cdb5e3780459c0e

See more details on using hashes here.

File details

Details for the file SuffixAutomaton-0.1.6-py3-none-any.whl.

File metadata

File hashes

Hashes for SuffixAutomaton-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 267e9dcec2bf61652143bed81bddf721422e7e703429fc02df2693fefff07df9
MD5 f27a023c0a1484c38b7ef19831ceaa7d
BLAKE2b-256 7404f869f36a19b9fe167d5464059fde2866bda85afaa720d42cf0f95681909f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page