Skip to main content

Biclustering and association rule mining using Suffix Tree and Frequent Closed Itemset based algorithm.

Project description

FIST-Biclustering-Python

Implementation of FIST algorithm for generating Biclusters, Frequent Closed Patterns, Association Rules using Python

Abstract

Association rule mining and biclustering are two popular techniques in data mining that can be used to uncover interesting patterns and relationships in large datasets. However, these techniques are often computationally expensive and can be challenging to apply to large da-tasets. This paper presents a novel approach that combines association rule mining and bi-clustering using a suffix tree data structure. It is based on the frequent closed itemsets framework and requires a unique scan of the database. This data structure is used to reduce memory usage and improve the extraction efficiency, allowing parallel processing of the tree branches. Experimental results show that the proposed algorithm (Frequent Itemset Suffix Tree: FIST) is effective in uncovering meaningful patterns and relationships in large datasets and outperforms existing state-of-the-art algorithms in terms of efficiency and scalability.

USER GUIDE

In this section of the chapter, we will discuss how to use this python implementation of the algorithm. We will be briefly discussing the following topics:

  • Environment & python installation
  • Installation of external Libraries
  • Using the source code or the package
  • Transforming input dataset to suitable form
  • Integrating the dataset with the python program and generating outputs
  • Results

1. Environment & python installation:

I have used an windows PC with 64-bit operating system, x64-based processor for running the python program.

To run the program, a suitable python installation is required. I have used Python 3.10.6.

To install Python 3 on your machine, Visit the official Python website https://www.python.org and navigate to the Downloads section. Download a version of Python 3.10 or onwards.

2. Installation of external libraries:

In this project we are using 2 external libraries on top of default python installation:

  • Pandas
  • PyDot

We are using pandas for the useful functionalities it provides for handling CSV files and DataFrames. PyDot is an interface to GraphViz which helps in creating graph based diagrams using python script.

To install Pandas use: python -m pip install pandas

To install PyDot: python -m pip install pydot

You may additionally need to install a GraphViz driver for PyDot to work properly.

3. Using the source code or the package

The source code of the program can be cloned from this git repository.

Source Code: https://github.com/Mijanur08/biclusturing-using-suffixTree

The program is also uploaded in pypi.org as a python package. The latest version is 2023.5.31.3

Package link: https://pypi.org/project/biclustering/

Instead of cloning the git repository, the package can be directly installed using the following pip install command.

python -m pip install biclustering --user

4. Transforming Input Dataset into Suitable Form

The input dataset must be in csv file and have the following 2 attributes:

  • An ID attribute which is unique for each of the rows
  • An itemsets attribute which contains comma separated names or values – each name or value representing an item.

An example of the input dataset is –

Exampledata.csv

ID,Itemsets

O1,"P1,P3,A2,A4"

O2,"P1,P3,A1,A2"

O3,"P2,P5,A3"

O4,"P1,P3,P4,A2,A4"

O5,"P1,P2,P3,P5,A2,A4"

5. Integrating the dataset with our python program and generating output

Our python implementation has module named fist.py. This file acts as an interface between the user and the complete algorithmic process.

First import the class FIST from the fist module and initialize a FIST class.

If we are using the source code from git repository, we need to specify the fist.py file while importing FIST class.

from fist import FIST

fist = FIST()

Otherwise, if we are directly using the package after installing it by pip, then we need to specify the package name ‘biclustering’.

from biclustering import FIST

fist = FIST()

Next step is to call the fist.process function. This function has 9 parameters. The function signature and description of all parameters are given below-

def process(self, input_file_dir: str, input_dataset_name:str,

id_attribute: str, itemset_attribute: str, max_entries = 10000,

min_supp_percent=1.0, min_conf_percent=0.0,

min_supp_count_outputs = 1,produce_final_img=False)->none

Parameters
Input_file_dirDatatypeString Literal
OptionalNo
Default ValueNA
DescriptionPath to the directory where the input CSV file (dataset) is located and the outputs will be generated in an output folder in that directory.
input_dataset_nameDatatypeString Literal
OptionalNo
Default ValueNA
DescriptionFilename of the input CSV file including its file extension
id_attributeDatatypeString Literal
OptionalNo
Default ValueNA
DescriptionName of the ID column of the dataset. The ID column must have unique value for each row.
itemset_attributeDatatypeString Literal
OptionalNo
Default ValueNA
DescriptionName of the Item List column of the dataset. This column should contain comma separated item names. These item names will form itemsets.
max_entriesDatatypeInteger
OptionalYes
Default Value10000
DescriptionIt specifies maximum number of rows are to be read in a dataset. If a dataset has total number of rows greater than max_entries, the first max_entries number of rows will be read during the execution.
min_support_percentDatatypeFloat
OptionalYes
Default Value1\.0
Description

The provided percentage of the total number of rows will be used as minimum support count for constructing the number table.

This value will also be embedded within the file name of generated files.

min_conf_percentDatatypeFloat
OptionalYes
Default Value0\.0
Description

Minimum confidence filters the generated association rules. If confidence of an association rule is greater than min_conf_percent, only then it will be written in the output.

This value will also be embedded within the file name of generated files.

min_supp_count_outputsDatatypeInteger
OptionalYes
Default Value1
DescriptionTo extract the FCPs and Bi-clusters we can use a different support count than the previous minimum support. This parameter carries the value of minimum support count which is used for generating the outputs.
produce_final_imgDatatypeBoolean
OptionalYes
Default ValueFalse
DescriptionThis parameter decides whether to generate the image of the suffix tree or not.

Here is the sample code for execution in python :

fist.process(“./”,”sample.csv”,”ID”,”Itemsets”,min_support_percent=30,min_conf_percent=40,produce_final_img=True)

6. Results

To avoid error during the execution, ensure that output folder exists in the input file directory. All the output files will be generated in this output folder. After execution, total 24 files including the image of the suffix tree is generated. They are as follows:

Number TableNumberTable.dataset={dataset name}.minSupport={value}.csv
SFDSFD.dataset={dataset name}.minSupport={value}.csv
Suffix TreesuffixTree.dataset={dataset name}.minSupport={value}.csv
suffixTree.dataset={dataset name}.minSupport={value}.json
FCPsFCP.dataset={dataset name}.minSupport={value}.csv
FCP.dataset={dataset name}.minSupport={value}.json
GeneratorsGenerators.dataset={dataset name}.minSupport={value}.csv
Generators.dataset={dataset name}.minSupport={value}.json
Bi-clustersbiclusters.dataset={dataset name}.minSupport{value}.minSize={value}.csv
biclusters.dataset={dataset name}.minSupport{value}.minSize={value}.json
Biclusters.withNames.dataset={dataset name}.minSupport{value}.minSize={value}.csv
Biclusters.withNames.dataset={dataset name}.minSupport{value}.minSize={value}.json

Association Rules

- Exact

- PB

- SB

rule.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.csv
rule.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.json
rule.withNames.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.csv
rule.withNames.{type}.dataset={dataset name}.minSupport={value}.minConf={value}.json

7. References

1. Kartick Chandra Mondal, Nicolas Pasquier, Anirban Mukhopadhyay, Ujjwal Maulik, and Sanghamitra Bandhopadyay: A New Approach for Association Rule Mining and Bi-clustering Using Formal Concept Analysis. (2012) [Link]

2. Katick Chandra Mondal: Algorithms for Data Mining and Bio-informatics. (2016) [Link]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biclustering-2023.5.31.5.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

biclustering-2023.5.31.5-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file biclustering-2023.5.31.5.tar.gz.

File metadata

  • Download URL: biclustering-2023.5.31.5.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.6

File hashes

Hashes for biclustering-2023.5.31.5.tar.gz
Algorithm Hash digest
SHA256 b0c12232119c503d7471a29b16abab2efa8860bd9cb86fc1439a3f199cf672b1
MD5 41daafb622848a1a8be4eb1397c716d4
BLAKE2b-256 f19d29f0968f000162174afd64ca1a2827cd29abd9b76de35cb9e546e8a8fe02

See more details on using hashes here.

File details

Details for the file biclustering-2023.5.31.5-py3-none-any.whl.

File metadata

File hashes

Hashes for biclustering-2023.5.31.5-py3-none-any.whl
Algorithm Hash digest
SHA256 46b849f6890d9e0ffd32f7b573225443b167a2b4a1cc06616364be0b2525c639
MD5 93f52b25d4ac1367862d5fa8ed2e6a4a
BLAKE2b-256 5f381d2aa503c2e0cb8a019b2311ce23ebebf321916248ff524a7508ae6d47a4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page