Skip to main content

FuzzyMergeParallel is a Python package that enables efficient fuzzy merging of two dataframes based on string columns. With FuzzyMergeParallel, users can easily merge datasets, benefitting from enhanced performance through parallel computing with multiprocessing and Dask.

Project description

fuzzymerge_parallel

Python package

Merge two pandas dataframes by using a function to calculate the edit distance (Levenshtein Distance) using multiprocessing for parallelization on a single node or Dask for distributed computation across multiple nodes.

Efficient Matching and Merging

Matching and merging data can be demanding in terms of time and memory usage. That's why our package is specifically crafted to address these challenges effectively.

Optimized Execution

To boost execution performance, we've fine-tuned our package to work seamlessly in both single-node and multi-node environments. Whether you're processing data on a local machine or distributing tasks across a cluster, our package is optimized to get the job done efficiently.

Smart Memory Management

Memory efficiency is a priority. Our algorithm estimates memory requirements and divides the workload into manageable batches. This ensures that your data operations fit comfortably within your available memory. Plus, you have the flexibility to customize these settings to match your specific needs.

With this package, you can confidently tackle data matching and merging tasks while optimizing both time and memory resources.

Description

fuzzymerge_parallel offers two modes for faster execution:

1. Multiprocessing mode

This mode runs on a single machine and it is able to use multi-CPU cores. This mode does not require dask and it's ideal for local processing tasks, utilizing Numpy and Python's multiprocessing libraries to speed things up.

2. Dask mode

In this mode, fuzzymerge_parallel utilizes a Dask client, which can be configured for single or multi-node setups. To leverage the dask mode it is suggested to use it with multiple nodes. Multi-node dask clients distribute computations across clusters of machines, making it suitable for heavy-duty processing. Using dask mode offers numerous benefits, automating tasks that would otherwise require manual intervention, such as enhancing performance, expanding scalability, ensuring fault tolerance, optimizing resource utilization, and enabling parallelism.

Important remarks: When using the package on a single node, it is recommended to opt for the multiprocessing mode. This choice is driven by the fact that multiprocessing generally offers faster execution times compared to Dask on a single node. Dask introduces certain overheads, including data copying, fault tolerance mechanisms, and resource management, which may not be as beneficial in single-node scenarios. Therefore, it is strongly advisable to leverage Dask when deploying the package in a multi-node cluster.

Features

  • Performs fuzzy merging of dataframes based on string columns
  • Utilize distance functions (e.g., Levenshtein) for intelligent matching
  • Benefit from parallel computing techniques for enhanced performance
  • Easily integrate into your existing data processing pipelines

Installation

Install from PyPi

To download and install the fuzzymerge_parallel Python package from PyPi, run the following instruction:

Install from GitHub

To download and install the fuzzymerge_parallel Python package from GitHub, you can follow these improved instructions:

    pip install fuzzymerge-parallel

To install FuzzyMergeParallel via pip from its GitHub repository, follow these steps:

  1. Download the Package: Begin by downloading the package from GitHub. You can use git to clone the repository to your local machine:

    git clone https://github.com/ULHPC/fuzzymerge_parallel.git
    
  2. Navigate to the Package Directory: Open a terminal or command prompt and change your current directory to the downloaded package folder:

    cd fuzzymerge_parallel
    
  3. Install the Package: Finally, use pip to install the package in "editable" mode (with the -e flag) to allow for development and updates. There are two options:

    Option 1: If you plan to use the package on a single node and don't need the dask and distributed dependencies, simply run:

    pip install -e .
    

    Option 2: If you intend to use the package in both single and multi-node environments with dask and distributed support, use the following command:

    pip install -e ".[dask]"
    

This command will install the package along with its dependencies. You can now import and use FuzzyMergeParallel in your Python projects.

Dependencies

To use this package, you will need to have the following dependencies installed:

Description

The FuzzyMergeParallel class is exposed and it is highly configurable. The following parameters and other attributes can be set up before doing the merge operation:

Parameter Description
left The left input data to be merged.
right The right input data to be merged.
left_on Column(s) in the left DataFrame to use as merge keys.
right_on Column(s) in the right DataFrame to use as merge keys.

Example create a FuzzyMergeParallel class:

fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
Attribute Description
uselower Whether to convert strings to lowercase before comparison. Default is True.
threshold The threshold value for fuzzy matching similarity. Default is 0.9.
how The type of merge to be performed. Default is 'outer'.
on Column(s) to merge on if not specified in left_on or right_on. Default is False.
left_index Whether to use the left DataFrame's index as merge key(s). Default is False.
right_index Whether to use the right DataFrame's index as merge key(s). Default is False.
parallel Whether to perform the merge operation in parallel. Default is True.
n_threads The number of threads to use for parallel execution. Default is 'all' (a thread per each available core).
hide_progress Whether to display a progress bar during the merge operation. Default is False.
num_batches The number of batches to split the ratio computation. Default is automatic.
ratio_function The distance ratio function. Defaults to Levenshtein.ratio().
dask_client A dask client object.

Example set extra attributes by stating the name of the attribute and its value with set_parameter():

fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('threshold', 0.75)

Usage

Single node

Sequential execution

fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('parallel', False)
# Run the merge sequentially
result = fuzzy_merger.merge()

Multiprocessing execution

fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('n_threads', 64)
# Run the merge multiprocessing
result = fuzzy_merger.merge()

Multi-node (dask)

Local client

fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')

# Set parameters for dask
## Create a dask client
from dask.distributed import Client
client = Client(...)  # Connect to distributed cluster and override default
fuzzy_merger.set_parameter('parallel', True)
fuzzy_merger.set_parameter('dask_client', client)
# Run the merge in dask
result = fuzzy_merger.merge()

How to create a dask client?

There are different options to create a dask client. Extensive documentation can be found on their websites:

A couple of examples:

# Launch dask on a local cluster (singlenode)
from dask.distributed import Client, LocalCluster
# Create a local Dask cluster
cluster = LocalCluster()
# Create a Dask client to connect to the cluster
client = Client(cluster)
# Launch dask on a SLURM cluster
from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(
    queue='regular',
    account="myaccount",
    cores=128,
    memory="500 GB"
)

cluster.scale(jobs=10)  # ask for 10 jobs

client = Client(cluster)

Contributing

Contributions are welcome! If you encounter any issues, have suggestions, or want to contribute improvements, please submit a pull request or open an issue on the GitHub repository.

Authors

  • Oscar J. Castro Lopez (oscar.castro@uni.lu)
    • Parallel Computing & Optimisation Group (PCOG) - University of Luxembourg

This package is based on the levenpandas package (https://github.com/fangzhou-xie/levenpandas).

License

This project is licensed under the MIT License.

======= History

1.0.0 (2023-12-06) 0.1.0 (2023-06-21)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzymerge_parallel-1.0.0.tar.gz (46.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fuzzymerge_parallel-1.0.0-py2.py3-none-any.whl (16.5 kB view details)

Uploaded Python 2Python 3

File details

Details for the file fuzzymerge_parallel-1.0.0.tar.gz.

File metadata

  • Download URL: fuzzymerge_parallel-1.0.0.tar.gz
  • Upload date:
  • Size: 46.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.12

File hashes

Hashes for fuzzymerge_parallel-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c328029c1781f6266ddf872f10b81cfc3632412e55b7491c2d12bf7ec36dd1a9
MD5 ebb7a48383f6b648a81b7c113c511cf2
BLAKE2b-256 6342e277fa80876430126dd47cb7fc6e7c1f04208fdf6d4474b265467a058aad

See more details on using hashes here.

File details

Details for the file fuzzymerge_parallel-1.0.0-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for fuzzymerge_parallel-1.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2bc323b284389891cd8d05c5acfd875ca360a217fd678230977386c13e1b880e
MD5 7ccf6ad51eb14e392a821e9592bde21a
BLAKE2b-256 bc2c1eaa20c78c03c28b96585be6d9150871f2cafab99d0d17727ac724a57087

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page