Skip to main content

A utility to find potential join keys (matching columns) across multiple pandas DataFrames.

Project description

FindMyJoint

A Python utility to analyze and compare columns across multiple pandas DataFrames, suggesting potential join keys and visualizing the relationships.

When working with multiple disparate datasets, finding common columns to join them on is a tedious manual task. findmyjoint automates this by:

  1. Profiling each DataFrame's columns (dtype, uniqueness, nulls).
  2. Comparing all possible column pairs across datasets.
  3. Scoring pairs based on name similarity (using rapidfuzz) and content similarity (using Jaccard index).
  4. Suggesting join confidence levels.
  5. Visualizing the connections as an interactive network graph (using pyvis).

Installation

You will be able to install this via pip once it's published:

pip install findmyjoint

Quickstart

You can get a comparison matrix or an interactive graph with a single line of code.

1. Create toy datasets

df1 = pd.DataFrame({
    'age': [21, 25, 30, 45],
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'user_id': ['001', '002', '003', '004']
})

df2 = pd.DataFrame({
    'Age': ['21', '25', '30', '45'],
    'full_name': ['Alice', 'Bob', 'Charlie', 'David'],
    'customer_id': [1, 2, 3, 4]
})

df3 = pd.DataFrame({
    'client_identifier': ['001', '002', '003', '004'],
    'location': ['USA', 'CAN', 'USA', 'MEX'],
    'years_old': [21, 25, 30, 45]
})

datasets = [df1, df2, df3]
names = ['hr', 'crm', 'finance']

# 2. Get the comparison matrix
print("--- Comparison Matrix ---")
matrix = fmj.compare(datasets, names=names, name_threshold=0.6)
print(matrix.head())

# 3. Generate the interactive network graph
print("\n--- Generating Network Graph ---")

# This will create and automatically open 'joint_graph.html'
fmj.network(datasets, names=names, threshold=0.6)
print("Graph 'joint_graph.html' created.")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

findmyjoint-0.0.1.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

findmyjoint-0.0.1-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file findmyjoint-0.0.1.tar.gz.

File metadata

  • Download URL: findmyjoint-0.0.1.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for findmyjoint-0.0.1.tar.gz
Algorithm Hash digest
SHA256 934c35de577372e7ecde9a79887c9b0f670a3ab41f4fef725a7921ffa11c2de8
MD5 1c9708d0781f808bbb79d11fc2eaa3c3
BLAKE2b-256 502f1b165f0e406f7476be6a01abee5ed4510586090bbd39d50ee57874a26209

See more details on using hashes here.

File details

Details for the file findmyjoint-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: findmyjoint-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for findmyjoint-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 274a487fcff92aba126a7a88672ace6d292562d9ce0cb03ea0eb8d7545c08184
MD5 4f9667a5dc3d2cf13c9e3feca71fab8f
BLAKE2b-256 792355996270f6be1f5e83bdf643f2e2664da1e2ecc15f5f6c0350a8c1246434

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page