Skip to main content

A python package for multi-modal entity resolution using the Fusion algorithm.

Project description

Fusion- Flexible Unification of Structured Intermodal Object Networks


Fusion is a Python package that provides solutions to the entity resolution in multimodal graphs problem. It implements the Fusion algorithm from the paper "Fusion: Flexible Unification of Structured Intermodal Object Networks" by Yoel Ashkenazi and Yoram Louzoun.

Table of Contents


Installation

To install the package and its dependencies, run:

pip install Fusion

Make sure you have Python 3.8+ installed.


Quick Tour

Directory Tree

project_root/

├── Fusion/
   └── main.py
├── Entity_detection/
   ├── my_algorithm.py
   └── Record_linkage/
       └── RL_test.py
├── evaluate.py
├── utils.py
├── data/
   └── graph.gpickle
├── output/
   ├── DatasetName_results.json
   └── DatasetName_colored_graph.gpickle
├── requirements.txt
└── config.json
  1. Fusion/main.py: Main entry point.
  2. Entity_detection/: Contains model and record linkage code.
  3. evaluate.py: Evaluation functions.
  4. utils.py: Utility functions (drawing, graph manipulation).
  5. data/: Place your .gpickle graph files here.
  6. output/: Results and colored graphs are saved here.

Running the Fusion Model or Record Linkage Test Use the main script to run the fusion process or record linkage test:

python Fusion/main.py --config path/to/config.json --output path/to/output_folder
  • --config: Path to your configuration file (see Configuration File Example).
  • --output: Directory where results and colored graphs will be saved. Evaluating Results After running the model, results are saved as a pickle file in your output directory.

Evaluation

To evaluate the partition, use the get_truth_values function from evaluate.py:

from evaluate import get_truth_values

TP, FP, TN, FN = get_truth_values(graph, true_graph, partition, true_entities)

Explanation of Metrics:

  1. True Positives (TP): Vertices correctly grouped.
  2. False Positives (FP): Vertices incorrectly grouped (should not be in the partition).
  3. True Negatives (TN): Vertices correctly not grouped.
  4. False Negatives (FN): Vertices that should have been grouped but were not.

Calculating Performance Metrics: Using the above values, you can calculate:

  • Precision: TP / (TP + FP) - Measures the accuracy of positive predictions.
  • Recall: TP / (TP + FN) - Measures the ability to find all positive instances.
  • F1-Score: 2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean of precision and recall.

Example:

precision = TP / (TP + FP) if (TP + FP) > 0 else 0
recall = TP / (TP + FN) if (TP + FN) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1_score}")

Saving Results:

The results, including metrics, are saved as a JSON file in the output directory. For example:

import json

results = {
    'TP': TP,
    'FP': FP,
    'TN': TN,
    'FN': FN,
    'precision': precision,
    'recall': recall,
    'F1_score': f1_score,
}

with open('output/results.json', 'w') as f:
    json.dump(results, f, indent=4)

This ensures you can review and analyze the evaluation metrics later.


Plotting Graphs

To visualize the partitioned graph, use the draw method from utils.py. Below are examples of how to use it:

Example 1: Plotting a Colored Graph

import utils
import networkx as nx

# Load the true graph and partition
true_graph = utils.load_dataset("data/graph.gpickle")
partition = {"node1": 0, "node2": 1, "node3": 0}  # Example partition

# Color the graph by partition
colored_graph = utils.color_by_partition(true_graph, partition)

# Plot the graph
utils.draw(colored_graph)

This will display the graph with nodes colored according to their partition.

Example 2: Saving the Colored Graph

import pickle as pkl

# Save the colored graph
with open("output/colored_graph.gpickle", "wb") as f:
    pkl.dump(colored_graph, f, protocol=pkl.HIGHEST_PROTOCOL)

print("Colored graph saved to output/colored_graph.gpickle")

Example 3: Plotting with Custom Layout

import matplotlib.pyplot as plt

# Use a spring layout for better visualization
pos = nx.spring_layout(colored_graph)

# Draw the graph with the custom layout
utils.draw(colored_graph, pos=pos)

# Show the plot
plt.show()

Configuration File Example

Below is an example of a configuration file (config.json).

Note:

  1. graph_path must point to a .gpickle file.
  2. Parameters like blue_in, red_out, C, etc., affect the model.
  3. Parameters like test type, add_num, remove_num are for execution.
{
    "verbosity_level": 1,           // int: Logging level (0 = silent, 1 = basic info, 2 = detailed debug info)
    "draw": false,                  // bool: Whether to plot graphs during execution
    "blue_in": 1.0,                 // float: Weight for blue intra-cluster edges
    "blue_out": 1.0,                // float: Weight for blue inter-cluster edges
    "red_in": 1.0,                  // float: Weight for red intra-cluster edges
    "red_out": 1.0,                 // float: Weight for red inter-cluster edges
    "C": 1.0,                       // float: Regularization parameter for the model
    "epsilon": 1e-6,                // float: Convergence threshold for iterative algorithms
    "history": true,                // bool: Whether to keep a history of iterations
    "type_dist": null,              // null or str: Type of distance metric (e.g., "euclidean", "cosine")
    "quality_type": "adjusted_OOE", // str: Quality metric for evaluating partitions (e.g., "adjusted_OOE", "NMI")
    "amplitude": 5.0,               // float: Amplitude parameter for edge weight adjustments
    "update_factor": 0.1,           // float: Factor for updating weights during iterations
    "ddelta": 0.1,                  // float: Step size for parameter updates
    "iterator": false,              // bool: Whether to use an iterative approach
    "decompose": false,             // bool: Whether to decompose the graph into subgraphs
    "graph_path": "data/graph.gpickle", // str: Path to the input graph file (must be a .gpickle file)
    "name": "DatasetName",          // str: Name for the dataset (used in output file naming)
    "test type": "GM",              // str: Test type ("GM" for Fusion, "RL" for Record Linkage)
    "add_num": 100,                 // int: Number of false identity edges to add to the graph
    "remove_num": 100,              // int: Number of identity edges to remove from the graph
    "removal_chance": 0.2           // float: Probability of removing an edge during preprocessing
}

Key Notes:

  • graph_path: Ensure this points to a valid .gpickle file containing the graph data.

  • test type: Use "GM" for running the Fusion model or "RL" for Record Linkage tests.

  • Model Parameters: Parameters like blue_in, red_out, C, etc., directly affect the behavior of the Fusion model.

  • Execution Parameters: Parameters like add_num, remove_num, and removal_chance control preprocessing and execution behavior.


Main Git Repository

For further information, updates, and exemplary material, please refer to the main Git repository: GitHub Repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fusion_er-0.0.1.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fusion_er-0.0.1-py3-none-any.whl (5.1 kB view details)

Uploaded Python 3

File details

Details for the file fusion_er-0.0.1.tar.gz.

File metadata

  • Download URL: fusion_er-0.0.1.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for fusion_er-0.0.1.tar.gz
Algorithm Hash digest
SHA256 1ea64007d5968be717d1d90c82dd5cfcc61ba656fc9765f1856009dcc2a1ac0a
MD5 28498a250ac77a03b6acb111f1972687
BLAKE2b-256 c3a1cb204b275f98e38e2d1d64e2439fa25405929c1d50041f78f048e79adfb7

See more details on using hashes here.

File details

Details for the file fusion_er-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: fusion_er-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 5.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.0

File hashes

Hashes for fusion_er-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 73723c771ffcae5d27c6fc548ab48f6e07953cd888ad4f10cc5931a035fa6127
MD5 54d175a20722c4dbb2ee7f758faeb6ca
BLAKE2b-256 561ef3caddb2384e16a1e30b2c63850427dc5b65a07583ce310306009a0364c7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page