Triclustering and association rule mining using Suffix Forest and Frequent Closed Itemset Framework based algorithm.
Project description
USER GUIDE
Tri-clustering is a popular technique in data mining that can be used to uncover interesting patterns, association rules and relationships in large datasets. However, these techniques are often computationally expensive and can be challenging to apply to large datasets.
This program implements a novel approach that combines association rule mining, bi-clustering, and tri-clustering using suffix tree and suffix forest data structures. It is based on the frequent closed itemset framework and requires a unique scan of generated tree/forest. This data structure is used to reduce memory usage and at the same time provide more information on the association rules and frequent patterns.
We will discuss how to use the source code of this python program, as well as how to directly use this program as a python module just by installing it using pip.
We will be briefly discussing the following topics:
- Environment & python installation
- Installation of external Libraries
- Using the source code or using the package
- Transforming input dataset to suitable form
- Integrating the dataset with the python program
- Generating outputs
- Results
1. Environment & python installation:
I have used an windows PC with 64-bit operating system, x64-based processor for running the python program.
To run the program, a suitable python installation is required. I have used Python 3.10.6.
To install Python 3 on your machine, Visit the official Python website https://www.python.org and navigate to the Downloads section. Download a version of Python 3.10 or onwards.
2. Installation of external libraries
In this project we are using 2 external libraries on top of default python installation:
- Pandas
- PyDot
We are using pandas for the useful functionalities it provides for handling CSV files and DataFrames. PyDot is an interface to GraphViz which helps in creating graph based diagrams using python script.
To install Pandas use: python -m pip install pandas
To install PyDot: python -m pip install pydot
You may additionally need to install a Dot driver for PyDot to work properly.
3. Using the source code or the package
The source code of the program can be cloned from this git repository.
Source Code: https://github.com/damaclab/SuffixForest-Triclusters-Python
The program is also uploaded in pypi.org as a python package. The latest version is 2023.5.31.1
Package link: https://pypi.org/project/triclustering/
Instead of cloning the git repository, the package can be directly installed using the following pip install command.
python -m pip install triclustering --user |
---|
This will also install the dependencies - pandas and pydot, automatically.
4. Transforming Input Dataset into Suitable Form
The input dataset must have the following 3 attributes:
- An ID attribute which is unique for each row
- An item_list attribute which contains comma separated names or values – each name or value representing an item.
- A splitting attribute according to which the dataset can be split into multiple datasets each of which can further be transformed into an SFD.
We are assuming that the dataset has an item list column which contains comma separated items. But if the dataset does not have an item list column, and has an item column only, then suitable grouping and aggregation must be performed so that the item list column is created.
If the dataset has multiple attributes and we want to form clusters from attribute:value pairs, then we can add a column containing comma separated (attribute:value) pairs which will act as the item list column.
5. Integrating the dataset with our python program
Our python implementation has module named processor.py. This file acts as an interface between the user and the complete algorithmic process.
First import the class Processor from the processor module and initialize a Processor class.
If we are using the source code from git repository, we need to specify the processor.py file while importing Processor class.
from processor import Processor processor = Processor() |
---|
Otherwise, if we are directly using the package after installing it by pip, then we need to specify the package name ‘triclustering’.
from triclustering import Processor processor = Processor() |
---|
Next step is to integrate the input CSV file with the processor object. This is done using the Processor.set_input_dataset() method. The prototype and description of this method is given below.
triclustering.Processor.set_input_dataset() |
---|
def set_input_dataset( input_file_dir: str, input_file_name: str, oid_attribute: str, item_list_attribute: str, split_attribute: str ) -> None |
---|
Parameters | ||
---|---|---|
input_file_dir | Datatype | String Literal |
Optional | No | |
Description | Path to the directory where the input CSV file (dataset) is located. | |
input_file_name | Datatype | String Literal |
Optional | No | |
Description | Filename of the input CSV file including its file extension. | |
oid_attribute | Datatype | String Literal |
Optional | No | |
Description | Name of the ID column of the dataset. The ID column must have unique value for each row. | |
item_list_attribute | Datatype | String Literal |
Optional | No | |
Description | Name of the Item List column of the dataset. This column should contain comma separated item names. These item names will form itemsets. | |
split_attribute | Datatype | String Literal |
Optional | No | |
Description | Name of the column which will be used as the third dimension for tri-clustering. The dataset will be portioned into multiple SFDs according to the value of this column. The number of unique values this column contains will be number of SFDs that will be created. |
5. Generating Outputs
Now we can execute the process() method on the processor class object to generate the outputs as files. A directory named ‘output’ will be created (if doesn’t already exist) in the same directory as the input file’s directory. All the output files will be stored in this output directory.
The prototype and description of this method is given below.
triclustering.Processor.process() |
---|
def process(
min_support_percentage_number_table: float = 0.0,
min_support_count: int = 1,
min_confidence: float = 0.0,
produce_intermediate_imgs: bool = False,
produce_final_img: bool = False,
custom_name_mapping: Any | None = None,
dtype_key: Any | None = None
) -> None|
| :- |
Parameters | ||
---|---|---|
min_support_percentage_number_table | Datatype | Float |
Optional | Yes | |
Default Value | 0.0 | |
Description | If the provided value for this attribute is greater than 0.0, then that percentage of the total number of rows will be used as minimum support count for constructing the number table. If not provided, then min support count = 1 will be used for constructing the number table. | |
min_support_count | Datatype | Integer |
Optional | Yes | |
Default Value | 1 | |
Description | If provided, this value will be used as the minimum support count for generating FCPs and Association Rules. If min support percentage for number table is not provided, then this min support count will be used for constructing the number table as well. This value will also be embedded within the file names of the generated files. | |
min_confidence | Datatype | Float |
Optional | Yes | |
Default Value | 0.0 | |
Description | This value will be used as minimum confidence for generating the association rules. This value will also be embedded within the file names of the association rule files. | |
produce_intermediate_imgs | Datatype | Boolean |
Optional | Yes | |
Default Value | False | |
Description | If set to True, intermediate images of the suffix forest (during its building process) will be generated in the output directory. | |
produce_final_img | Datatype | Boolean |
Optional | Yes | |
Default Value | False | |
Description | If set to True, the final image of the fully constructed suffix forest will be generated in the output directory. | |
custom_name_mapping | Datatype | Dictionary |
Optional | Yes | |
Default Value | None | |
Description | This dictionary can be used to add an extra layer of decoding process on the item number. The dictionary should hold key-value pairs where the keys are item names as given in the input CSV file and values are the ‘names’ that we want to show on output instead of the given item names. | |
dtype_key | Datatype | Type specifier |
Optional | Yes | |
Default Value | None | |
Description | Required only if custom_name _mapping is provide. Its value is the datatype of the keys in custom_name_mapping |
6. Results
A directory named ‘output’ is created in the same directory as the input file’s directory. All the output files are stored in this directory.
These are the results:
Here, x is the minimum support count (integer) and y is the minimum confidence value (float).
- Triclusters:
- Encoded version is stored in these files:
- < input file name>.triclusters.ms=x.encoded.csv
- < input file name>.triclusters.ms=x.encoded.json
- Decoded version is stored in the following file:
- < input file name>.triclusters.ms=x.decoded.csv
- Encoded version is stored in these files:
- Association Rules:
-
Encoded version of exact association rules is stored in:
- < input file name>.rule.E.ms=x.mc=y.encoded.csv
- < input file name>.rule.E.ms=x.mc=y.encoded.json
-
Decoded version of exact association rules is stored in:
- < input file name>.rule.E.ms=x.mc=y.decoded.csv
-
Encoded version of approximate association rules is stored in:
- < input file name>.rule.SB.ms=x.mc=y.encoded.csv
- < input file name>.rule.SB.ms=x.mc=y.encoded.json
-
Decoded version of approximate association rules is stored in:
- < input file name>.rule.SB.ms=x.mc=y.decoded.csv
-
Encoded version of proper base approximate association rules is stored in:
- < input file name>.rule.PB.ms=x.mc=y.encoded.csv
- < input file name>.rule.PB.ms=x.mc=y.encoded.json
-
Decoded version of proper base approximate association rules is stored in:
- < input file name>.rule.PB.ms=x.mc=y.decoded.csv
-
Generator Closure Pairs:
- < input file name>.generators.ms=x.csv
-
Suffix Forest:
- < input file name>.forest.ms=x.encoded.json
- The intermediate and final image of the forest are also generated if produce_final_img and produce_intermediate_imgs flags are set. They are stored in files:
- < input file name>.forest.final.png
- < input file name>.forest.intermediate.png
-
Number Table:
- The number table or the item name to item number mapping is also generated and stored in the file < input file name>.number_table.ms=x.csv
-
7. References:
[1] Kartick Chandra Mondal, Moumita Ghosh, Rohmatul Fajriyah, Anirban Roy (2022) · Introducing Suffix Forest for Mining Tri-clusters from Time Series Data [Link]
[2] Kartick Chandra Mondal, Nicolas Pasquier, Anirban Mukhopadhyay, Ujjwal Maulik, and Sanghamitra Bandhopadyay (2012) · A New Approach for Association Rule Mining and Bi-clustering Using Formal Concept Analysis [Link]
[3] Kartick Chandra Mondal (2016) · Algorithms for Data Mining and Bio-informatics [Link]
[4] Python Software Foundation (2021). Python Language Reference, version 3.9.6. Retrieved from https://docs.python.org/3/reference/
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file triclustering-2023.6.3.0.tar.gz
.
File metadata
- Download URL: triclustering-2023.6.3.0.tar.gz
- Upload date:
- Size: 13.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb18c3cc0fe25410974d2ccf09c846228177a9e404fc2a907d57b5c1fb92bac0 |
|
MD5 | 51e7e392d90656a7430bd2ede7dda693 |
|
BLAKE2b-256 | 9d23a7a05280a5be7ecefc0b5e8585127fca100b24d81d3e855d0519b153f895 |
File details
Details for the file triclustering-2023.6.3.0-py3-none-any.whl
.
File metadata
- Download URL: triclustering-2023.6.3.0-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cbb0a77bb6b5966ad3358f5853647ea8b2f3e8851d967e45c00816edc9e2de8f |
|
MD5 | 08dcb024a86527e9334f0dab49d935a8 |
|
BLAKE2b-256 | 59f6b0c21ed9de46edfae7c91f4fa109bdee47608036dfce42a42b0903692280 |