Skip to main content

Intuitive way of using fuzz matching in pandas

Project description

Intuitive way of using fuzz matching in pandas

Installation

#Try it first like this: 
#rapidfuzz is a lot faster than fuzzywuzzy, but I had some problems installing it, #even with Visual C++ 2019 redistributable installed   a-pandas-ex-fuzz will try to import this module first
pip install a-pandas-ex-plode-tool
pip install a-pandas-ex-df-to-string
pip install rapidfuzz #https://github.com/maxbachmann/RapidFuzz
pip install --no-deps a-pandas-ex-fuzz

#if rapidfuzz does not work, use:
pip install a-pandas-ex-plode-tool
pip install a-pandas-ex-df-to-string
pip install fuzzywuzzy 
pip install --no-deps a-pandas-ex-fuzz


 #Or if you want to try to install everything:
 pip install a-pandas-ex-fuzz

Compare values in column against each other: Series.s_fuzz_all_values_in_col()

from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.   
import pandas as pd


df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    )  
df11 = df.Name.s_fuzz_all_values_in_column(
	limit=5, merge_with_series=True, partial_full_weighted="weighted"
)
df22 = df.Name.s_fuzz_all_values_in_column(
	limit=2, merge_with_series=False, partial_full_weighted="full"
)
df33 = df.Name.s_fuzz_all_values_in_column(
	limit=1, merge_with_series=True, partial_full_weighted="partial"
)

df22

	0  Braund...     70.833333          477    Cann, ...     63.829787
1  Angle,...     55.445545          518    Astor,...     53.061224
2  Sinkko...     79.069767          747    Honkan...     77.272727
3  Futrel...     77.142857          137    Potter...     52.873563
4  Gilles...     84.615385          722    Saunde...     77.777778
5  Bracke...     77.777778          221    Scanla...     76.470588
6  O'Brie...     65.116279          552    Maisne...     58.536585
7  Goodwi...     68.852459          386    Palsso...     67.857143
8  Rosblo...     62.068966          254    Hockin...      59.52381
9  Nasser...     74.074074          122    Astor,...     58.536585
  fuzz_index_1
0         37
1        700
2        216
3        879
4         12
5        468
6        464
7        374
8        774
9        700

	Parameters:
		df: [pd.Series]
		limit: int
			How many results do you want to have?
			Each result will have 3 columns [string, match, position in column]
			(default=5)
		partial_full_weighted: str
			weighted = fuzz.WRatio
			full = fuzz.ratio
			partial = fuzz.partial_ratio
			(default="weighted")
		merge_with_series: str
			(default=True)
	Returns:
		pd.DataFrame

Compare values in column against list: Series.s_fuzz_from_list()

from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.   
import pandas as pd   

df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    ) 

df111 = df.Name.s_fuzz_from_list(
	list_to_compare=["Johannes", "Paulo", "Kevin"],
	limit=2,
	merge_with_series=True,
	partial_full_weighted="partial",
)
df222 = df.Name.s_fuzz_from_list(
	list_to_compare=["John", "Johannes", "Paulo", "Kevin"],
	limit=3,
	merge_with_series=False,
	partial_full_weighted="full",
)
df333 = df.Name.s_fuzz_from_list(
	list_to_compare=["Maria", "Anna"],
	limit=1,
	merge_with_series=False,
	partial_full_weighted="partial",
)
df333
		fuzz_string_0 fuzz_match_0 fuzz_index_0
0           Maria         60.0            0
1           Maria    44.444444            0
2            Anna         75.0            1
3           Maria         40.0            0
4           Maria         40.0            0
..            ...          ...          ...
886         Maria         40.0            0
887         Maria         80.0            0
888         Maria         60.0            0
889         Maria         40.0            0
890         Maria         60.0            0
[891 rows x 3 columns]

	Parameters:
		df: [pd.Series]
		list_to_compare: list
			The strings you want to be compared
		limit: int
			How many results do you want to have?
			Each result will have 3 columns [string, match, position in column]
			(default=5)
		partial_full_weighted: str
			weighted = fuzz.WRatio
			full = fuzz.ratio
			partial = fuzz.partial_ratio
			(default="weighted")
		merge_with_series: str
			(default=True)
	Returns:
		pd.DataFrame

Compare values in column against list: Series.s_fuzz_one_word()

from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.   
import pandas as pd   

df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    ) 

df1 = df.Name.s_fuzz_one_word(
word_to_search="Karolina", partial_full_weighted="weighted"
)
df2 = df.Name.s_fuzz_one_word(word_to_search="Karolina", partial_full_weighted="full")
df3 = df.Name.s_fuzz_one_word(
	word_to_search="Karolina", partial_full_weighted="partial"
)
df1
												  Name fuzz_string_0  \
0                              Braund, Mr. Owen Harris      Karolina
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)      Karolina
2                               Heikkinen, Miss. Laina      Karolina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)      Karolina
4                             Allen, Mr. William Henry      Karolina
5                                     Moran, Mr. James      Karolina
6                              McCarthy, Mr. Timothy J      Karolina
7                       Palsson, Master. Gosta Leonard      Karolina
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      Karolina
9                  Nasser, Mrs. Nicholas (Adele Achem)      Karolina
   fuzz_match_0
0     41.538462
1     33.750000
2     60.000000
3     33.750000
4     42.750000
5     30.000000
6     27.692308
7     45.000000
8     45.600000
9     42.750000

df2
												  Name fuzz_string_0  \
0                              Braund, Mr. Owen Harris      Karolina
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)      Karolina
2                               Heikkinen, Miss. Laina      Karolina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)      Karolina
4                             Allen, Mr. William Henry      Karolina
5                                     Moran, Mr. James      Karolina
6                              McCarthy, Mr. Timothy J      Karolina
7                       Palsson, Master. Gosta Leonard      Karolina
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      Karolina
9                  Nasser, Mrs. Nicholas (Adele Achem)      Karolina
   fuzz_match_0
0     32.258065
1     17.241379
2     33.333333
3     15.686275
4     31.250000
5     25.000000
6     19.354839
7     31.578947
8     21.428571
9     23.809524

df3
												  Name fuzz_string_0  \
0                              Braund, Mr. Owen Harris      Karolina
1  Cumings, Mrs. John Bradley (Florence Briggs Thayer)      Karolina
2                               Heikkinen, Miss. Laina      Karolina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)      Karolina
4                             Allen, Mr. William Henry      Karolina
5                                     Moran, Mr. James      Karolina
6                              McCarthy, Mr. Timothy J      Karolina
7                       Palsson, Master. Gosta Leonard      Karolina
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      Karolina
9                  Nasser, Mrs. Nicholas (Adele Achem)      Karolina
   fuzz_match_0
0     46.153846
1     37.500000
2     66.666667
3     37.500000
4     46.153846
5     33.333333
6     30.769231
7     50.000000
8     50.000000
9     40.000000

	Parameters:
		df: [pd.Series]
		word_to_search: str
		partial_full_weighted: str
			weighted = fuzz.WRatio
			full = fuzz.ratio
			partial = fuzz.partial_ratio
			(default="weighted")
	Returns:
		pd.DataFrame

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

a_pandas_ex_fuzz-0.11.tar.gz (17.2 kB view hashes)

Uploaded Source

Built Distribution

a_pandas_ex_fuzz-0.11-py3-none-any.whl (17.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page