Implementation of frequent subgraph mining algorithm gSpan
Project description
gSpan
For Chinese readme, please go to README-Chinese.
gSpan is an algorithm for mining frequent subgraphs.
This program implements gSpan with Python. The repository on GitHub is https://github.com/betterenvi/gSpan. This implementation borrows some ideas from gboost.
Undirected Graphs
This program supports undirected graphs, and produces same results with gboost on the dataset graphdata/graph.data.
Directed Graphs
So far(date: 2016-10-29), gboost does not support directed graphs. This program implements gSpan for directed graphs. More specific, this program can mine frequent directed subgraph that has at least one node that can reach other nodes in the subgraph. But correctness is not guaranteed since the author did not do enough testing. After running several times on datasets graphdata/graph.data.directed.1 and graph.data.simple.5, there is no fault.
How to install
This program supports both Python 2 and Python 3.
Method 1
Install this project using pip:
pip install gspan-mining
Method 2
First, clone the project:
git clone https://github.com/betterenvi/gSpan.git
cd gSpan
You can optionally install this project as a third-party library so that you can run it under any path.
python setup.py install
How to run
The command is:
python -m gspan_mining [-s min_support] [-n num_graph] [-l min_num_vertices] [-u max_num_vertices] [-d True/False] [-v True/False] [-p True/False] [-w True/False] [-h] database_file_name
Some examples
- Read graph data from ./graphdata/graph.data, and mine undirected subgraphs given min support is 5000
python -m gspan_mining -s 5000 ./graphdata/graph.data
- Read graph data from ./graphdata/graph.data, mine undirected subgraphs given min support is 5000, and visualize these frequent subgraphs(matplotlib and networkx are required)
python -m gspan_mining -s 5000 -p True ./graphdata/graph.data
- Read graph data from ./graphdata/graph.data, and mine directed subgraphs given min support is 5000
python -m gspan_mining -s 5000 -d True ./graphdata/graph.data
- Print help info
python -m gspan_mining -h
The author also wrote example code using Jupyter Notebook. Mining results and visualizations are presented. For detail, please refer to main.ipynb.
Running time
-
Environment
- OS: Windows 10
- Python version: Python 2.7.12
- Processor: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz 3.60 GHz
- Ram: 8.00 GB
-
Running time On the dataset ./graphdata/graph.data, running time is listed below:
Min support | Number of frequent subgraphs | Time |
---|---|---|
5000 | 26 | 51.48 s |
3000 | 52 | 69.07 s |
1000 | 455 | 3 m 49 s |
600 | 1235 | 7 m 29 s |
400 | 2710 | 12 m 53 s |
Reference
gSpan: Graph-Based Substructure Pattern Mining, by X. Yan and J. Han. Proc. 2002 of Int. Conf. on Data Mining (ICDM'02).
One C++ implementation of gSpan.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gspan_mining-0.2.3.tar.gz
.
File metadata
- Download URL: gspan_mining-0.2.3.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.10.0 setuptools/35.0.1 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/2.7.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 917257f9aaf6703cda275f366f290fe4c094469f290bd3b6e77db21201abe56a |
|
MD5 | d1e1dabd51448b3f18c71e8e08430713 |
|
BLAKE2b-256 | 54b8de0f5bcca31efcf658390e4711a8247a2050611dcd7de7d746114c9b6fd6 |
File details
Details for the file gspan_mining-0.2.3-py3-none-any.whl
.
File metadata
- Download URL: gspan_mining-0.2.3-py3-none-any.whl
- Upload date:
- Size: 10.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.10.0 setuptools/35.0.1 requests-toolbelt/0.8.0 tqdm/4.23.4 CPython/2.7.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bc9609f3c78fa5946d0d0320ae4067fa90a22fdf99ee45d3ea6fe1ee17b155f3 |
|
MD5 | 39679639dd2cce5b2d7ce381e3a09c57 |
|
BLAKE2b-256 | f19aeacb5074229108a4161f4d446ce70468816278f5ee215bdc46de60ca74eb |