Merges two DataFrames using fuzzy matching on specified columns
Project description
Merges two DataFrames using fuzzy matching on specified columns
Tested against Windows / Python 3.11 / Anaconda
pip install a-pandas-ex-fuzzymerge
This function performs a fuzzy matching between two DataFrames `df1` and `df2`
based on the columns specified in `right_on` and `left_on`. Fuzzy matching allows
you to find similar values between these columns, making it useful for matching
data with small variations, such as typos or abbreviations.
Parameters:
df1 (DataFrame): The first DataFrame to be merged.
df2 (DataFrame): The second DataFrame to be merged.
right_on (str): The column name in `df2` to be used for matching.
left_on (str): The column name in `df1` to be used for matching.
usedtype (numpy.dtype, optional): The data type to use for the distance matrix.
Defaults to `np.uint8`.
scorer (function, optional): The scoring function to use for fuzzy matching.
Defaults to `fuzz.WRatio`.
concat_value (bool, optional): Whether to add a 'concat_value' column in the result DataFrame,
containing the similarity scores. Defaults to `True`.
**kwargs: Additional keyword arguments to pass to the `pandas.merge` function.
Returns:
DataFrame: A merged DataFrame with rows that matched based on the specified fuzzy criteria.
Example:
from a_pandas_ex_fuzzymerge import pd_add_fuzzymerge
import pandas as pd
import numpy as np
from rapidfuzz import fuzz
pd_add_fuzzymerge()
df1 = pd.read_csv(
"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)
df2 = df1.copy()
df2 = pd.concat([df2 for x in range(3)], ignore_index=True)
df2.Name = (df2.Name + np.random.uniform(1, 2000, len(df2)).astype("U"))
df1 = pd.concat([df1 for x in range(3)], ignore_index=True)
df1.Name = (df1.Name + np.random.uniform(1, 2000, len(df1)).astype("U"))
df3 = df1.d_fuzzy_merge(df2, right_on='Name', left_on='Name', usedtype=np.uint8, scorer=fuzz.partial_ratio,
concat_value=True)
print(df3)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file a_pandas_ex_fuzzymerge-0.10.tar.gz
.
File metadata
- Download URL: a_pandas_ex_fuzzymerge-0.10.tar.gz
- Upload date:
- Size: 22.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 757b1d8511570adc1be41c3732f9b93e895e318de93a2af6c12c9d148d791a16 |
|
MD5 | 3cad1120edd1697734a8e912561f7b1c |
|
BLAKE2b-256 | 6aa46a4f9217e0a30abfb127478c19539b66cff1d4aea6d7170323035ae59be0 |
File details
Details for the file a_pandas_ex_fuzzymerge-0.10-py3-none-any.whl
.
File metadata
- Download URL: a_pandas_ex_fuzzymerge-0.10-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5701d08ce76cc3a0668f9e1c3c622a2def97671e8dd5cd9df38dce3ecbe10601 |
|
MD5 | 144bc03787efce8f448807d70b2d6d1f |
|
BLAKE2b-256 | 7e4a42e8db0a2db08ab7751bf2ae8eed8f3dee494267572b959d22f5f1ad1e96 |