Skip to main content

Intuitive way of using fuzz matching in pandas

Project description

Intuitive way of using fuzz matching in pandas

Updates

05.10.2022 - Added compare rows

Installation

#Try it first like this: 
#rapidfuzz is a lot faster than fuzzywuzzy, but I had some problems installing it, #even with Visual C++ 2019 redistributable installed   a-pandas-ex-fuzz will try to import this module first
pip install a-pandas-ex-plode-tool
pip install a-pandas-ex-df-to-string
pip install rapidfuzz #https://github.com/maxbachmann/RapidFuzz
pip install --no-deps a-pandas-ex-fuzz

#if rapidfuzz does not work, use:
pip install a-pandas-ex-plode-tool
pip install a-pandas-ex-df-to-string
pip install fuzzywuzzy 
pip install --no-deps a-pandas-ex-fuzz


 #Or if you want to try to install everything:
 pip install a-pandas-ex-fuzz

Compare values in column against each other: pandas.Series.s_fuzz_all_values_in_col()

from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.   
import pandas as pd


df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    )  
df11 = df.Name.s_fuzz_all_values_in_column(
    limit=5, merge_with_series=True, partial_full_weighted="weighted"
)
df22 = df.Name.s_fuzz_all_values_in_column(
    limit=2, merge_with_series=False, partial_full_weighted="full"
)
df33 = df.Name.s_fuzz_all_values_in_column(
    limit=1, merge_with_series=True, partial_full_weighted="partial"
)

df22

    0  Braund...     70.833333          477    Cann, ...     63.829787
1  Angle,...     55.445545          518    Astor,...     53.061224
2  Sinkko...     79.069767          747    Honkan...     77.272727
3  Futrel...     77.142857          137    Potter...     52.873563
4  Gilles...     84.615385          722    Saunde...     77.777778
5  Bracke...     77.777778          221    Scanla...     76.470588
6  O'Brie...     65.116279          552    Maisne...     58.536585
7  Goodwi...     68.852459          386    Palsso...     67.857143
8  Rosblo...     62.068966          254    Hockin...      59.52381
9  Nasser...     74.074074          122    Astor,...     58.536585
  fuzz_index_1
0         37
1        700
2        216
3        879
4         12
5        468
6        464
7        374
8        774
9        700

    Parameters:
        df: [pd.Series]
        limit: int
            How many results do you want to have?
            Each result will have 3 columns [string, match, position in column]
            (default=5)
        partial_full_weighted: str
            weighted = fuzz.WRatio
            full = fuzz.ratio
            partial = fuzz.partial_ratio
            (default="weighted")
        merge_with_series: str
            (default=True)
    Returns:
        pd.DataFrame

Compare values in column against list: pandas.Series.s_fuzz_from_list()

from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.   
import pandas as pd   

df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    ) 

df111 = df.Name.s_fuzz_from_list(
    list_to_compare=["Johannes", "Paulo", "Kevin"],
    limit=2,
    merge_with_series=True,
    partial_full_weighted="partial",
)
df222 = df.Name.s_fuzz_from_list(
    list_to_compare=["John", "Johannes", "Paulo", "Kevin"],
    limit=3,
    merge_with_series=False,
    partial_full_weighted="full",
)
df333 = df.Name.s_fuzz_from_list(
    list_to_compare=["Maria", "Anna"],
    limit=1,
    merge_with_series=False,
    partial_full_weighted="partial",
)
df333
        fuzz_string_0 fuzz_match_0 fuzz_index_0
0           Maria         60.0            0
1           Maria    44.444444            0
2            Anna         75.0            1
3           Maria         40.0            0
4           Maria         40.0            0
..            ...          ...          ...
886         Maria         40.0            0
887         Maria         80.0            0
888         Maria         60.0            0
889         Maria         40.0            0
890         Maria         60.0            0
[891 rows x 3 columns]

    Parameters:
        df: [pd.Series]
        list_to_compare: list
            The strings you want to be compared
        limit: int
            How many results do you want to have?
            Each result will have 3 columns [string, match, position in column]
            (default=5)
        partial_full_weighted: str
            weighted = fuzz.WRatio
            full = fuzz.ratio
            partial = fuzz.partial_ratio
            (default="weighted")
        merge_with_series: str
            (default=True)
    Returns:
        pd.DataFrame

Compare values in column against list: pandas.Series.s_fuzz_one_word()

from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.   
import pandas as pd   

df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    ) 

df1 = df.Name.s_fuzz_one_word(
word_to_search="Karolina", partial_full_weighted="weighted"
)
df2 = df.Name.s_fuzz_one_word(word_to_search="Karolina", partial_full_weighted="full")
df3 = df.Name.s_fuzz_one_word(
    word_to_search="Karolina", partial_full_weighted="partial"
)
df1
                                                  Name fuzz_string_0  \
0                              Braund, Mr. Owen Harris      Karolina
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)      Karolina
2                               Heikkinen, Miss. Laina      Karolina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)      Karolina
4                             Allen, Mr. William Henry      Karolina
5                                     Moran, Mr. James      Karolina
6                              McCarthy, Mr. Timothy J      Karolina
7                       Palsson, Master. Gosta Leonard      Karolina
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      Karolina
9                  Nasser, Mrs. Nicholas (Adele Achem)      Karolina
   fuzz_match_0
0     41.538462
1     33.750000
2     60.000000
3     33.750000
4     42.750000
5     30.000000
6     27.692308
7     45.000000
8     45.600000
9     42.750000

df2
                                                  Name fuzz_string_0  \
0                              Braund, Mr. Owen Harris      Karolina
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)      Karolina
2                               Heikkinen, Miss. Laina      Karolina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)      Karolina
4                             Allen, Mr. William Henry      Karolina
5                                     Moran, Mr. James      Karolina
6                              McCarthy, Mr. Timothy J      Karolina
7                       Palsson, Master. Gosta Leonard      Karolina
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      Karolina
9                  Nasser, Mrs. Nicholas (Adele Achem)      Karolina
   fuzz_match_0
0     32.258065
1     17.241379
2     33.333333
3     15.686275
4     31.250000
5     25.000000
6     19.354839
7     31.578947
8     21.428571
9     23.809524

df3
                                                  Name fuzz_string_0  \
0                              Braund, Mr. Owen Harris      Karolina
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)      Karolina
2                               Heikkinen, Miss. Laina      Karolina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)      Karolina
4                             Allen, Mr. William Henry      Karolina
5                                     Moran, Mr. James      Karolina
6                              McCarthy, Mr. Timothy J      Karolina
7                       Palsson, Master. Gosta Leonard      Karolina
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      Karolina
9                  Nasser, Mrs. Nicholas (Adele Achem)      Karolina
   fuzz_match_0
0     46.153846
1     37.500000
2     66.666667
3     37.500000
4     46.153846
5     33.333333
6     30.769231
7     50.000000
8     50.000000
9     40.000000

    Parameters:
        df: [pd.Series]
        word_to_search: str
        partial_full_weighted: str
            weighted = fuzz.WRatio
            full = fuzz.ratio
            partial = fuzz.partial_ratio
            (default="weighted")
    Returns:
        pd.DataFrame

pandas.Series.ds_fuzz_compare_row_to_others/ pandas.DataFrame.ds_fuzz_compare_row_to_others

    from a_pandas_ex_fuzz import pd_add_fuzzy_matching
    pd_add_fuzzy_matching()
    import pandas as pd
    df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")
    df.ds_fuzz_compare_row_to_others(2,loc_or_iloc='iloc', partial_full_weighted='full', sort_values=True)    
    
    
    Out[4]:   
    
         PassengerId  Survived  Pclass  ... Cabin Embarked  aa_fuzz_match
    2              3         1       3  ...   NaN        S     100.000000
    216          217         1       3  ...   NaN        S      90.816327
    816          817         0       3  ...   NaN        S      88.118812
    382          383         0       3  ...   NaN        S      83.769634
    400          401         1       3  ...   NaN        S      83.769634
    ..           ...       ...     ...  ...   ...      ...            ...
    745          746         0       1  ...   B22        S      54.450262
    556          557         1       1  ...   A16        C      53.744493
    581          582         1       1  ...   C68        C      53.456221
    669          670         1       1  ...  C126        S      52.132701
    307          308         1       1  ...   C65        C      51.612903
    [891 rows x 13 columns]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

a_pandas_ex_fuzz-0.12.tar.gz (18.7 kB view details)

Uploaded Source

Built Distribution

a_pandas_ex_fuzz-0.12-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file a_pandas_ex_fuzz-0.12.tar.gz.

File metadata

  • Download URL: a_pandas_ex_fuzz-0.12.tar.gz
  • Upload date:
  • Size: 18.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for a_pandas_ex_fuzz-0.12.tar.gz
Algorithm Hash digest
SHA256 c0a29a686e66ea96f9baff362424e0ecdd5ca5c1cf0080ecb0987a73f64c58fa
MD5 021afea85fd77b7ee9ac30d72608181a
BLAKE2b-256 f7a87317ff2b1f5648f31d2b18af9f1ec4bb1631f57048cfc38820164aef1a5e

See more details on using hashes here.

Provenance

File details

Details for the file a_pandas_ex_fuzz-0.12-py3-none-any.whl.

File metadata

File hashes

Hashes for a_pandas_ex_fuzz-0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 c10e7353cc0db481a94eaf578f5932da156cf1a7991839455c1f0b3428d22630
MD5 28da37a69ac53bed0f1f41630386366d
BLAKE2b-256 2f8c50526c72e5ec103dd89073cc2bd2483d4a4ead35afb0c27dbe263eacffe2

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page