Skip to main content

Common Strings of Multiple Strings - A fast Python library written in C++

Project description

GitHub

A fast Python Common substrings of multiple strings library with C++ implementation

Having a bunch of strings, can I print some substrings which appear K times ? Can I know which is longest substrings ?

Installing

Make sure Cython is installed properly !

Building from source:

cd python
python setup.py install

Usuage and examples

Step 1. Import the library

from common_multiple_strings import PyCommon_multiple_strings
tree = PyCommon_multiple_strings()  # init

Step 2. Build the data structure

Build from list of str:

tree.from_strings(list_str)

or build from file:

tree.from_path(<path_to_file>)

It is noted that sentences in file are presentened line-by-line. Each line is a sentence.

Sample code:

>>> from common_multiple_strings import PyCommon_multiple_strings
>>> tree = PyCommon_multiple_strings()  # init
>>> list_str = [
       'abc',
       'abcxa',
       'xamnb',
       'yamnc',
       'abcd'
    ]
>>> tree.from_strings(list_str)

Step 3. Query

This library introduces 4 types of query:

a) List some substrings appear TIMES times:

tree.query(times=TIMES)

Ouput is a dictionary with key is an integer K, value is a list of some substrings which appear exactly K times.

Sample code:

>>> print(tree.query(times=(2, None)))
{2: ['amn', 'xa', 'n', 'mn'], 3: ['bc', 'abc'], 4: ['c', 'b'], 5: ['a']}
>>> print(tree.query(times=(3, 3)))
{3: ['bc', 'abc']}

b) Length of the longest substring appears TIMES times:

tree.length_longest_substring(times=TIMES)

Ouput is an integer.

Sample code:

>>> print(tree.length_longest_substring(times=(2, None)))
3

c) Lengths of the longest substrings appear TIMES times:

tree.lengths_longest_substrings(times=TIMES)

Ouput is a dictionary with key is an integer K, value is the length of the longest substring which appear exactly K times.

Sample code:

>>> print(tree.lengths_longest_substrings(times=(2, None)))
{2: 3, 3: 3, 4: 1, 5: 1}

d) List some substrings which have length of L and appear TIMES times:

tree.filter_substrings_by_length(length_input=L, times=TIMES)

Ouput is a dictionary with key is an integer K, value is a list of some substrings which appear exactly K times and have length of L.

Sample code:

>>> print(tree.filter_substrings_by_length(length_input=2, times=(2, None)))
{2: ['xa', 'mn'], 3: ['bc']}
>>> print(tree.filter_substrings_by_length(length_input=10, times=(2, None)))
{}

Some Notes:

- Params:

times, default (None, None) is a tuple represents the minimum and maximum appearances of the desired output.

Query substrings appear exactly N times, then times=(N, N).

Query substrings appear more than N times, then times=(N, None).

Query substrings appear less than N times, then times=(None, N).

Query substrings appear less than N times and more than M times, then times=(M, N).

length_input is an integer, represents the length of substrings

- This library accepts various utf8 characters including punctuations, numbers, upper-case characters, ... For more details, checkout alphabet.cpp file. Thanks coccoc-tokenizer for providing these available lists

- Query a) and d) does not output all possible results, but you can use it for analytic purposes.

Algorithm and Complexity

Data structure suffix tree is the core of this library. It encourages fast query and efficient storing.

If the total length of all input strings are L, then the average complexity should be L * log(L).

References

Algorithm from the lecture of String Algorithms and Algrorithms in Computational Biology - Gusfield

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

common_multiple_strings-1.0.1.tar.gz (3.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

common_multiple_strings-1.0.1-py3-none-any.whl (3.6 kB view details)

Uploaded Python 3

File details

Details for the file common_multiple_strings-1.0.1.tar.gz.

File metadata

  • Download URL: common_multiple_strings-1.0.1.tar.gz
  • Upload date:
  • Size: 3.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.9

File hashes

Hashes for common_multiple_strings-1.0.1.tar.gz
Algorithm Hash digest
SHA256 883e3b44eb273b66908e51837c643e540f257d6bc4fafa71e2e4fabbcf7e36a2
MD5 44dd3cec9f119cc666fb8998d2a246c6
BLAKE2b-256 ebe6066feba968c8a1eeb32e4b41399db6bfc003193660126654de80ca542895

See more details on using hashes here.

File details

Details for the file common_multiple_strings-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for common_multiple_strings-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 01e49c63fea4f018d3f48d92ee7cd52d822f4d6a503af1bfaa120fe627e7145a
MD5 e076d7709b8f5c66ad45dd2370be36ad
BLAKE2b-256 7a017dc0e69c0ae0fd9778851e72a3d8600272126f15e88961371d794d91a720

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page