FuzzyMergeParallel is a Python package that enables efficient fuzzy merging of two dataframes based on string columns. With FuzzyMergeParallel, users can easily merge datasets, benefitting from enhanced performance through parallel computing with multiprocessing and Dask.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

fuzzymerge_parallel

Merge two pandas dataframes by using a function to calculate the edit distance (Levenshtein Distance) using multiprocessing for parallelization on a single node or Dask for distributed computation across multiple nodes.

Efficient Matching and Merging

Matching and merging data can be demanding in terms of time and memory usage. That's why our package is specifically crafted to address these challenges effectively.

Optimized Execution

To boost execution performance, we've fine-tuned our package to work seamlessly in both single-node and multi-node environments. Whether you're processing data on a local machine or distributing tasks across a cluster, our package is optimized to get the job done efficiently.

Smart Memory Management

Memory efficiency is a priority. Our algorithm estimates memory requirements and divides the workload into manageable batches. This ensures that your data operations fit comfortably within your available memory. Plus, you have the flexibility to customize these settings to match your specific needs.

With this package, you can confidently tackle data matching and merging tasks while optimizing both time and memory resources.

Description

fuzzymerge_parallel offers two modes for faster execution:

1. Multiprocessing mode

This mode runs on a single machine and it is able to use multi-CPU cores. This mode does not require dask and it's ideal for local processing tasks, utilizing Numpy and Python's multiprocessing libraries to speed things up.

2. Dask mode

In this mode, fuzzymerge_parallel utilizes a Dask client, which can be configured for single or multi-node setups. To leverage the dask mode it is suggested to use it with multiple nodes. Multi-node dask clients distribute computations across clusters of machines, making it suitable for heavy-duty processing. Using dask mode offers numerous benefits, automating tasks that would otherwise require manual intervention, such as enhancing performance, expanding scalability, ensuring fault tolerance, optimizing resource utilization, and enabling parallelism.

Important remarks: When using the package on a single node, it is recommended to opt for the multiprocessing mode. This choice is driven by the fact that multiprocessing generally offers faster execution times compared to Dask on a single node. Dask introduces certain overheads, including data copying, fault tolerance mechanisms, and resource management, which may not be as beneficial in single-node scenarios. Therefore, it is strongly advisable to leverage Dask when deploying the package in a multi-node cluster.

Features

Performs fuzzy merging of dataframes based on string columns
Utilize distance functions (e.g., Levenshtein) for intelligent matching
Benefit from parallel computing techniques for enhanced performance
Easily integrate into your existing data processing pipelines

Installation

Install from PyPi

To download and install the fuzzymerge_parallel Python package from PyPi, run the following instruction:

Install from GitHub

To download and install the fuzzymerge_parallel Python package from GitHub, you can follow these improved instructions:

    pip install fuzzymerge-parallel

To install FuzzyMergeParallel via pip from its GitHub repository, follow these steps:

Download the Package: Begin by downloading the package from GitHub. You can use git to clone the repository to your local machine:
```
git clone https://github.com/ULHPC/fuzzymerge_parallel.git
```
Navigate to the Package Directory: Open a terminal or command prompt and change your current directory to the downloaded package folder:
```
cd fuzzymerge_parallel
```
Install the Package: Finally, use pip to install the package in "editable" mode (with the -e flag) to allow for development and updates. There are two options:

Option 1: If you plan to use the package on a single node and don't need the dask and distributed dependencies, simply run:
```
pip install -e .
```
Option 2: If you intend to use the package in both single and multi-node environments with dask and distributed support, use the following command:
```
pip install -e ".[dask]"
```

This command will install the package along with its dependencies. You can now import and use FuzzyMergeParallel in your Python projects.

Dependencies

To use this package, you will need to have the following dependencies installed:

Click >= 7.0
dask[distributed] >= 2023.5.0 (Optional: Only needed for multi-node)
Levenshtein >= 0.21.0 (Optional: Only needed for multi-node)
nltk >= 3.8.1
numpy >= 1.23.5
pandas >= 1.5.3
tqdm >= 4.65.0
psutil == 5.9.5
pytest >= 7.4.1

Description

The FuzzyMergeParallel class is exposed and it is highly configurable. The following parameters and other attributes can be set up before doing the merge operation:

Parameter	Description
left	The left input data to be merged.
right	The right input data to be merged.
left_on	Column(s) in the left DataFrame to use as merge keys.
right_on	Column(s) in the right DataFrame to use as merge keys.

Example create a FuzzyMergeParallel class:

fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')

Attribute	Description
uselower	Whether to convert strings to lowercase before comparison. Default is True.
threshold	The threshold value for fuzzy matching similarity. Default is 0.9.
how	The type of merge to be performed. Default is 'outer'.
on	Column(s) to merge on if not specified in left_on or right_on. Default is False.
left_index	Whether to use the left DataFrame's index as merge key(s). Default is False.
right_index	Whether to use the right DataFrame's index as merge key(s). Default is False.
parallel	Whether to perform the merge operation in parallel. Default is True.
n_threads	The number of threads to use for parallel execution. Default is 'all' (a thread per each available core).
hide_progress	Whether to display a progress bar during the merge operation. Default is False.
num_batches	The number of batches to split the ratio computation. Default is automatic.
ratio_function	The distance ratio function. Defaults to `Levenshtein.ratio()`.
dask_client	A dask client object.

Example set extra attributes by stating the name of the attribute and its value with set_parameter():

fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('threshold', 0.75)

Usage

Single node

Sequential execution

fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('parallel', False)
# Run the merge sequentially
result = fuzzy_merger.merge()

Multiprocessing execution

fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')
fuzzy_merger.set_parameter('n_threads', 64)
# Run the merge multiprocessing
result = fuzzy_merger.merge()

Multi-node (dask)

Local client

fuzzy_merger = FuzzyMergeParallel(left_df, right_df, left_on='left_column_name', right_on='right_column_name')
# Set parameters
fuzzy_merger.set_parameter('how', 'inner')

# Set parameters for dask
## Create a dask client
from dask.distributed import Client
client = Client(...)  # Connect to distributed cluster and override default
fuzzy_merger.set_parameter('parallel', True)
fuzzy_merger.set_parameter('dask_client', client)
# Run the merge in dask
result = fuzzy_merger.merge()

How to create a dask client?

There are different options to create a dask client. Extensive documentation can be found on their websites:

A couple of examples:

# Launch dask on a local cluster (singlenode)
from dask.distributed import Client, LocalCluster
# Create a local Dask cluster
cluster = LocalCluster()
# Create a Dask client to connect to the cluster
client = Client(cluster)

# Launch dask on a SLURM cluster
from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(
    queue='regular',
    account="myaccount",
    cores=128,
    memory="500 GB"
)

cluster.scale(jobs=10)  # ask for 10 jobs

client = Client(cluster)

Contributing

Contributions are welcome! If you encounter any issues, have suggestions, or want to contribute improvements, please submit a pull request or open an issue on the GitHub repository.

Authors

Oscar J. Castro Lopez (oscar.castro@uni.lu)
- Parallel Computing & Optimisation Group (PCOG) - University of Luxembourg

This package is based on the levenpandas package (https://github.com/fangzhou-xie/levenpandas).

License

This project is licensed under the MIT License.

======= History

1.0.0 (2023-12-06) 0.1.0 (2023-06-21)

First release on PyPI.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

This version

1.0.0

Dec 6, 2023

0.1.0

Dec 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzymerge_parallel-1.0.0.tar.gz (46.2 kB view hashes)

Uploaded Dec 6, 2023 Source

Built Distribution

fuzzymerge_parallel-1.0.0-py2.py3-none-any.whl (16.5 kB view hashes)

Uploaded Dec 6, 2023 Python 2 Python 3

Hashes for fuzzymerge_parallel-1.0.0.tar.gz

Hashes for fuzzymerge_parallel-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c328029c1781f6266ddf872f10b81cfc3632412e55b7491c2d12bf7ec36dd1a9`
MD5	`ebb7a48383f6b648a81b7c113c511cf2`
BLAKE2b-256	`6342e277fa80876430126dd47cb7fc6e7c1f04208fdf6d4474b265467a058aad`

Hashes for fuzzymerge_parallel-1.0.0-py2.py3-none-any.whl

Hashes for fuzzymerge_parallel-1.0.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`2bc323b284389891cd8d05c5acfd875ca360a217fd678230977386c13e1b880e`
MD5	`7ccf6ad51eb14e392a821e9592bde21a`
BLAKE2b-256	`bc2c1eaa20c78c03c28b96585be6d9150871f2cafab99d0d17727ac724a57087`

fuzzymerge-parallel 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

fuzzymerge_parallel

Description

1. Multiprocessing mode

2. Dask mode

Features

Installation

Install from PyPi

Install from GitHub

Dependencies

Description

Usage

Single node

Sequential execution

Multiprocessing execution

Multi-node (dask)

Local client

Contributing

Authors

License

======= History

1.0.0 (2023-12-06) 0.1.0 (2023-06-21)

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution