Biclustering and association rule mining using Suffix Tree and Frequent Closed Itemset based algorithm.
Project description
FIST-Biclustering-Python
Implementation of FIST algorithm for generating Biclusters, Frequent Closed Patterns, Association Rules using Python
Abstract
Association rule mining and biclustering are two popular techniques in data mining that can be used to uncover interesting patterns and relationships in large datasets. However, these techniques are often computationally expensive and can be challenging to apply to large da-tasets. This paper presents a novel approach that combines association rule mining and bi-clustering using a suffix tree data structure. It is based on the frequent closed itemsets framework and requires a unique scan of the database. This data structure is used to reduce memory usage and improve the extraction efficiency, allowing parallel processing of the tree branches. Experimental results show that the proposed algorithm (Frequent Itemset Suffix Tree: FIST) is effective in uncovering meaningful patterns and relationships in large datasets and outperforms existing state-of-the-art algorithms in terms of efficiency and scalability.
USER GUIDE
In this section of the chapter, we will discuss how to use this python implementation of the algorithm. We will be briefly discussing the following topics:
- Environment & python installation
- Installation of external Libraries
- Using the source code or the package
- Transforming input dataset to suitable form
- Integrating the dataset with the python program and generating outputs
- Results
1. Environment & python installation:
I have used an windows PC with 64-bit operating system, x64-based processor for running the python program.
To run the program, a suitable python installation is required. I have used Python 3.10.6.
To install Python 3 on your machine, Visit the official Python website https://www.python.org and navigate to the Downloads section. Download a version of Python 3.10 or onwards.
2. Installation of external libraries:
In this project we are using 2 external libraries on top of default python installation:
- Pandas
- PyDot
We are using pandas for the useful functionalities it provides for handling CSV files and DataFrames. PyDot is an interface to GraphViz which helps in creating graph based diagrams using python script.
To install Pandas use: python -m pip install pandas
To install PyDot: python -m pip install pydot
You may additionally need to install a GraphViz driver for PyDot to work properly.
3. Using the source code or the package
The source code of the program can be cloned from this git repository.
Source Code: https://github.com/Mijanur08/biclusturing-using-suffixTree
The program is also uploaded in pypi.org as a python package. The latest version is 2023.5.31.3
Package link: https://pypi.org/project/biclustering/
Instead of cloning the git repository, the package can be directly installed using the following pip install command.
python -m pip install biclustering --user |
---|
4. Transforming Input Dataset into Suitable Form
The input dataset must be in csv file and have the following 2 attributes:
- An ID attribute which is unique for each of the rows
- An itemsets attribute which contains comma separated names or values – each name or value representing an item.
An example of the input dataset is –
Exampledata.csv
ID,Itemsets
O1,"P1,P3,A2,A4"
O2,"P1,P3,A1,A2"
O3,"P2,P5,A3"
O4,"P1,P3,P4,A2,A4"
O5,"P1,P2,P3,P5,A2,A4"
5. Integrating the dataset with our python program and generating output
Our python implementation has module named fist.py. This file acts as an interface between the user and the complete algorithmic process.
First import the class FIST from the fist module and initialize a FIST class.
If we are using the source code from git repository, we need to specify the fist.py file while importing FIST class.
from fist import FIST fist = FIST() |
---|
Otherwise, if we are directly using the package after installing it by pip, then we need to specify the package name ‘biclustering’.
from biclustering import FIST fist = FIST() |
---|
Next step is to call the fist.process function. This function has 9 parameters. The function signature and description of all parameters are given below-
def process(self, input_file_dir: str, input_dataset_name:str, id_attribute: str, itemset_attribute: str, max_entries = 10000, min_supp_percent=1.0, min_conf_percent=0.0, min_supp_count_outputs = 1,produce_final_img=False)->none |
---|
Parameters | ||
---|---|---|
Input_file_dir | Datatype | String Literal |
Optional | No | |
Default Value | NA | |
Description | Path to the directory where the input CSV file (dataset) is located and the outputs will be generated in an output folder in that directory. | |
input_dataset_name | Datatype | String Literal |
Optional | No | |
Default Value | NA | |
Description | Filename of the input CSV file including its file extension | |
id_attribute | Datatype | String Literal |
Optional | No | |
Default Value | NA | |
Description | Name of the ID column of the dataset. The ID column must have unique value for each row. | |
itemset_attribute | Datatype | String Literal |
Optional | No | |
Default Value | NA | |
Description | Name of the Item List column of the dataset. This column should contain comma separated item names. These item names will form itemsets. | |
max_entries | Datatype | Integer |
Optional | Yes | |
Default Value | 10000 | |
Description | It specifies maximum number of rows are to be read in a dataset. If a dataset has total number of rows greater than max_entries, the first max_entries number of rows will be read during the execution. | |
min_support_percent | Datatype | Float |
Optional | Yes | |
Default Value | 1\.0 | |
Description | The provided percentage of the total number of rows will be used as minimum support count for constructing the number table. This value will also be embedded within the file name of generated files. | |
min_conf_percent | Datatype | Float |
Optional | Yes | |
Default Value | 0\.0 | |
Description | Minimum confidence filters the generated association rules. If confidence of an association rule is greater than min_conf_percent, only then it will be written in the output. This value will also be embedded within the file name of generated files. | |
min_supp_count_outputs | Datatype | Integer |
Optional | Yes | |
Default Value | 1 | |
Description | To extract the FCPs and Bi-clusters we can use a different support count than the previous minimum support. This parameter carries the value of minimum support count which is used for generating the outputs. | |
produce_final_img | Datatype | Boolean |
Optional | Yes | |
Default Value | False | |
Description | This parameter decides whether to generate the image of the suffix tree or not. |
Here is the sample code for execution in python :
fist.process(“./”,”sample.csv”,”ID”,”Itemsets”,min_support_percent=30,min_conf_percent=40,produce_final_img=True) |
---|
6. Results
To avoid error during the execution, ensure that output folder exists in the input file directory. All the output files will be generated in this output folder. After execution, total 24 files including the image of the suffix tree is generated. They are as follows:
Number Table | NumberTable.dataset={dataset name}.minSupport={value}.csv |
---|---|
SFD | SFD.dataset={dataset name}.minSupport={value}.csv |
Suffix Tree | suffixTree.dataset={dataset name}.minSupport={value}.csv |
suffixTree.dataset={dataset name}.minSupport={value}.json | |
FCPs | FCP.dataset={dataset name}.minSupport={value}.csv |
FCP.dataset={dataset name}.minSupport={value}.json | |
Generators | Generators.dataset={dataset name}.minSupport={value}.csv |
Generators.dataset={dataset name}.minSupport={value}.json | |
Bi-clusters | biclusters.dataset={dataset name}.minSupport{value}.minSize={value}.csv |
biclusters.dataset={dataset name}.minSupport{value}.minSize={value}.json | |
Biclusters.withNames.dataset={dataset name}.minSupport{value}.minSize={value}.csv | |
Biclusters.withNames.dataset={dataset name}.minSupport{value}.minSize={value}.json | |
Association Rules - Exact - PB - SB | rule.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.csv |
rule.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.json | |
rule.withNames.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.csv | |
rule.withNames.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.json |
7. References
1. Kartick Chandra Mondal, Nicolas Pasquier, Anirban Mukhopadhyay, Ujjwal Maulik, and Sanghamitra Bandhopadyay: A New Approach for Association Rule Mining and Bi-clustering Using Formal Concept Analysis. (2012) [Link]
2. Katick Chandra Mondal: Algorithms for Data Mining and Bio-informatics. (2016) [Link]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file biclustering-2023.5.31.5.tar.gz
.
File metadata
- Download URL: biclustering-2023.5.31.5.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b0c12232119c503d7471a29b16abab2efa8860bd9cb86fc1439a3f199cf672b1 |
|
MD5 | 41daafb622848a1a8be4eb1397c716d4 |
|
BLAKE2b-256 | f19d29f0968f000162174afd64ca1a2827cd29abd9b76de35cb9e546e8a8fe02 |
File details
Details for the file biclustering-2023.5.31.5-py3-none-any.whl
.
File metadata
- Download URL: biclustering-2023.5.31.5-py3-none-any.whl
- Upload date:
- Size: 15.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46b849f6890d9e0ffd32f7b573225443b167a2b4a1cc06616364be0b2525c639 |
|
MD5 | 93f52b25d4ac1367862d5fa8ed2e6a4a |
|
BLAKE2b-256 | 5f381d2aa503c2e0cb8a019b2311ce23ebebf321916248ff524a7508ae6d47a4 |