A python package for multi-modal entity resolution using the Fusion algorithm.
Project description
Fusion- Flexible Unification of Structured Intermodal Object Networks
Fusion is a Python package that provides solutions to the entity resolution in multimodal graphs problem. It implements the Fusion algorithm from the paper "Fusion: Flexible Unification of Structured Intermodal Object Networks" by Yoel Ashkenazi and Yoram Louzoun.
Table of Contents
- Installation
- Quick Tour
- Directory Tree
- Evaluation
- Plotting Graphs
- Configuration File Example
- Main Git Repository
Installation
To install the package and its dependencies, run:
pip install Fusion
Make sure you have Python 3.8+ installed.
Quick Tour
Directory Tree
project_root/
│
├── Fusion/
│ └── main.py
├── Entity_detection/
│ ├── my_algorithm.py
│ └── Record_linkage/
│ └── RL_test.py
├── evaluate.py
├── utils.py
├── data/
│ └── graph.gpickle
├── output/
│ ├── DatasetName_results.json
│ └── DatasetName_colored_graph.gpickle
├── requirements.txt
└── config.json
- Fusion/main.py: Main entry point.
- Entity_detection/: Contains model and record linkage code.
- evaluate.py: Evaluation functions.
- utils.py: Utility functions (drawing, graph manipulation).
- data/: Place your .gpickle graph files here.
- output/: Results and colored graphs are saved here.
Running the Fusion Model or Record Linkage Test Use the main script to run the fusion process or record linkage test:
python Fusion/main.py --config path/to/config.json --output path/to/output_folder
- --config: Path to your configuration file (see Configuration File Example).
- --output: Directory where results and colored graphs will be saved. Evaluating Results After running the model, results are saved as a pickle file in your output directory.
Evaluation
To evaluate the partition, use the get_truth_values function from evaluate.py:
from evaluate import get_truth_values
TP, FP, TN, FN = get_truth_values(graph, true_graph, partition, true_entities)
Explanation of Metrics:
- True Positives (TP): Vertices correctly grouped.
- False Positives (FP): Vertices incorrectly grouped (should not be in the partition).
- True Negatives (TN): Vertices correctly not grouped.
- False Negatives (FN): Vertices that should have been grouped but were not.
Calculating Performance Metrics: Using the above values, you can calculate:
- Precision: TP / (TP + FP) - Measures the accuracy of positive predictions.
- Recall: TP / (TP + FN) - Measures the ability to find all positive instances.
- F1-Score: 2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean of precision and recall.
Example:
precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1_score}")
Saving Results:
The results, including metrics, are saved as a JSON file in the output directory. For example:
import json
results = {
'TP': TP,
'FP': FP,
'TN': TN,
'FN': FN,
'precision': precision,
'recall': recall,
'F1_score': f1_score,
}
with open('output/results.json', 'w') as f:
json.dump(results, f, indent=4)
This ensures you can review and analyze the evaluation metrics later.
Plotting Graphs
To visualize the partitioned graph, use the draw method from utils.py. Below are examples of how to use it:
Example 1: Plotting a Colored Graph
import utils
import networkx as nx
# Load the true graph and partition
true_graph = utils.load_dataset("data/graph.gpickle")
partition = {"node1": 0, "node2": 1, "node3": 0} # Example partition
# Color the graph by partition
colored_graph = utils.color_by_partition(true_graph, partition)
# Plot the graph
utils.draw(colored_graph)
This will display the graph with nodes colored according to their partition.
Example 2: Saving the Colored Graph
import pickle as pkl
# Save the colored graph
with open("output/colored_graph.gpickle", "wb") as f:
pkl.dump(colored_graph, f, protocol=pkl.HIGHEST_PROTOCOL)
print("Colored graph saved to output/colored_graph.gpickle")
Example 3: Plotting with Custom Layout
import matplotlib.pyplot as plt
# Use a spring layout for better visualization
pos = nx.spring_layout(colored_graph)
# Draw the graph with the custom layout
utils.draw(colored_graph, pos=pos)
# Show the plot
plt.show()
Configuration File Example
Below is an example of a configuration file (config.json).
Note:
- graph_path must point to a .gpickle file.
- Parameters like blue_in, red_out, C, etc., affect the model.
- Parameters like test type, add_num, remove_num are for execution.
{
"verbosity_level": 1, // int: Logging level (0 = silent, 1 = basic info, 2 = detailed debug info)
"draw": false, // bool: Whether to plot graphs during execution
"blue_in": 1.0, // float: Weight for blue intra-cluster edges
"blue_out": 1.0, // float: Weight for blue inter-cluster edges
"red_in": 1.0, // float: Weight for red intra-cluster edges
"red_out": 1.0, // float: Weight for red inter-cluster edges
"C": 1.0, // float: Regularization parameter for the model
"epsilon": 1e-6, // float: Convergence threshold for iterative algorithms
"history": true, // bool: Whether to keep a history of iterations
"type_dist": null, // null or str: Type of distance metric (e.g., "euclidean", "cosine")
"quality_type": "adjusted_OOE", // str: Quality metric for evaluating partitions (e.g., "adjusted_OOE", "NMI")
"amplitude": 5.0, // float: Amplitude parameter for edge weight adjustments
"update_factor": 0.1, // float: Factor for updating weights during iterations
"ddelta": 0.1, // float: Step size for parameter updates
"iterator": false, // bool: Whether to use an iterative approach
"decompose": false, // bool: Whether to decompose the graph into subgraphs
"graph_path": "data/graph.gpickle", // str: Path to the input graph file (must be a .gpickle file)
"name": "DatasetName", // str: Name for the dataset (used in output file naming)
"test type": "GM", // str: Test type ("GM" for Fusion, "RL" for Record Linkage)
"add_num": 100, // int: Number of false identity edges to add to the graph
"remove_num": 100, // int: Number of identity edges to remove from the graph
"removal_chance": 0.2 // float: Probability of removing an edge during preprocessing
}
Key Notes:
-
graph_path: Ensure this points to a valid .gpickle file containing the graph data.
-
test type: Use "GM" for running the Fusion model or "RL" for Record Linkage tests.
-
Model Parameters: Parameters like blue_in, red_out, C, etc., directly affect the behavior of the Fusion model.
-
Execution Parameters: Parameters like add_num, remove_num, and removal_chance control preprocessing and execution behavior.
Main Git Repository
For further information, updates, and exemplary material, please refer to the main Git repository: GitHub Repository
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fusion_er-0.0.1.tar.gz.
File metadata
- Download URL: fusion_er-0.0.1.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ea64007d5968be717d1d90c82dd5cfcc61ba656fc9765f1856009dcc2a1ac0a
|
|
| MD5 |
28498a250ac77a03b6acb111f1972687
|
|
| BLAKE2b-256 |
c3a1cb204b275f98e38e2d1d64e2439fa25405929c1d50041f78f048e79adfb7
|
File details
Details for the file fusion_er-0.0.1-py3-none-any.whl.
File metadata
- Download URL: fusion_er-0.0.1-py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73723c771ffcae5d27c6fc548ab48f6e07953cd888ad4f10cc5931a035fa6127
|
|
| MD5 |
54d175a20722c4dbb2ee7f758faeb6ca
|
|
| BLAKE2b-256 |
561ef3caddb2384e16a1e30b2c63850427dc5b65a07583ce310306009a0364c7
|