Toolkit for performing fuzzy joins with Symspell framework
FuzzyPanda was created to support fuzzy join operations with Pandas DataFrames using Python Ver. 3. These fuzzy joins are a form of approximate string matching to join relational data that contain "errors" or minor modifications that preclude direct string comparison.
FuzzyPanda will match strings that
- Are within a user-specified edit distance (e.g. "test" == "taste" with edit distance 2)
- Are independent of case (e.g. "Test" == "test")
- Are Whitespace-delimited strings are matched regardless of token order (e.g. "dark and stormy night" == "stormy and dark night")
- Are independent of special symbols (e.g. "this-string" == "this string")
The criteria in steps 2-4 can be modified via modification of the
The primary API is the
fuzzypanda.matching.get_fuzzy_columns function that takes two Pandas DataFrames and a set of column names, and creates a new column in the "left" DataFrame that contains the closest entries by string edit distance to the associated values in the "right" DataFrame columns. The Pandas
join functions can later be used to perform full joins on the DataFrames.
FuzzyPanda can be installed using
pip install fuzzypanda
This version of FuzzyPanda currently supports the
fuzzypanda.matching.get_fuzzy_columns function. More functions are expected in future releases.
Create Fuzzy Matched Columns
Main fuzzy joining API for the fuzzy joining of the given
right_dataframe. Given a string or list of strings to the cols argument, this function will add fuzzy columns to the
left_dataframe that best match the columns of the
right_dataframe. This operation can then be followed up with a Pandas
join to perform the actual joining operation.
left_dataframe(pandas.DataFrame): left Pandas dataframe to which columns will be added
right_dataframe(pandas.DataFrame): right Pandas dataframe from which fuzzy values in the
left_dataframewill be compared and suggested
left_cols(List(str)): A list of strings of column names present in
left_dataframethat will be compared to the corresponding columns in
right_cols(List(str)): A list of strings of column names present in
right_dataframeused for comparison to those in given in
left_dataframe. If both dataframes share the column names on which fuzzy columns will be created, this parameter can be set to
Noneand the values given in
left_colswill be used as the names in both dataframes. Default is
null_return(string): The string used if a match isn't found. Can be used to set NULL values if a fuzzy match isn't found in the
right_dataframe. Setting to
Nonewill return the string used to search for the fuzzy value. Default is
preprocesser: an instance of the
fuzzypanda.preprocess.PreProcessorclass containing the
preprocessmethod used to pre-process the input strings. If set to
None, will instantiate the default pre-processor. This option can be used to create a custom pre-processor to pass to the
get_fuzzy_columnsfunction. Default is
max_edit_distance(int): The maximum edit distance that will be considered when comparing columns. The higher the number, the more "incorrect" the
left_dataframecolumns can be to be searched in the
right_dataframecolumns. Increasing this number heavily impacts runtime and should be set as low as possible. Default is 2.
- Returns: Performs an in-place creation of fuzzy columns within
left_dataframe. Each given left column in
left_colswill have a
'fuzzy_' + left_col_namecorresponding to the matched column.
Suppose you wish to join the following two dataframes on columns
col_2, where the columns in
left_df contain entries that are misspelled and/or jumbled tokens of those in
print(left_df) > ID col_1 col_2 > 0 123314 kitten oboe > 1 123213 siting trvmpet > 2 43543 the times of best over te rainbow > 3 35435 the worst times in Symphony C > 4 987 not in there not in there print(right_df) > ID col_1 col_2 > 0 12783314 kitten oboe > 1 12352213 sitting trumpet > 2 43233543 the best of times over the rainbow > 3 23432420 the worst of times Symphony in C#
We can now call
fuzzypanda.matching.get_fuzzy_columns. Notice that the results are columns added to
left_df in-place, rather than returning a new DataFrame.
fuzzypanda.matching.get_fuzzy_columns(left_dataframe=left_df, right_dataframe=right_df, left_cols=['col_1', 'col_2']) print(left_df) > ID col_1 col_2 fuzzy_col_1 \ > 0 123314 kitten oboe kitten > 1 123213 siting trvmpet sitting > 2 43543 the times of best over te rainbow the best of times > 3 35435 the worst times in Symphony C the worst of times > 4 987 not in there not in there not in there > > fuzzy_col_2 > 0 oboe > 1 trumpet > 2 over the rainbow > 3 Symphony in C# > 4 not in there
This package uses the Symspell Python port symspellpy by mammothb of the original C# implementation of Symspell by Wolf Garbe. This fuzzy column creation approach applies a Pandas-friendly wrapper around the Symspell Symmetric Delete spelling correction algorithm to allow substantially faster fuzzy joining. Tools such as fuzzywuzzy will run in Omega(mn) to find the best-matching strings in a column of n values compared to the m values of another column, whereas this model is expected to have a runtime of Omega(m + n) due to the pre-processing of the right DataFrame columns as a spellchecker corpus that searched using the Symmetric Delete spelling correction algorithm.
This method is best suited for fuzzy searches of large DataFrames due to the comparatively large amount of pre-processing but faster search performance.
The algorithm operates as follows:
- A "left" Pandas DataFrame and a "right" Pandas DataFrame are input to
get_fuzzy_columnswith the column names used for comparison.
- Each right DataFrame is copied into a temporary corpus text file.
- Each entry in the corpus text file is preprocessed using either the default
fuzzypanda.preprocess.PreProcessoror a user-supplied object containing a
preprocessmethod and copied to another preprocessed text file. An in-memory index is created to translate processed strings to preprocessed strings.
- A symspellpy object is instantiated and the corpus file is used to create a lookup dictionary.
- Each record from the left DataFrame is preprocessed and queried from the dictionary using the
symspellpy.lookupfunction to find the closest string in terms of edit distance, and the suggested string (or a substitute string if one isn't found) is placed in an intemediate list.
- When all records of the left DataFrame have been processed, a new column containing the results of the fuzzy lookup is added to the left DataFrame in a column labeled 'fuzzy_' + queried column name.
- Directly implement pandas
symspellpywith a C++ implementation of Symspell to speed lookup calculations
- Create option for multiprocessing and multithreading column record queries.
- Add API to directly process CSV files
- Add API to use Pandas DataFrame chunks
- Expand functionality to use SparkSQL DataFrames
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
|Filename, size||File type||Python version||Upload date||Hashes|
|Filename, size fuzzypanda-0.1.1-py3-none-any.whl (12.3 kB)||File type Wheel||Python version py3||Upload date||Hashes View|
|Filename, size fuzzypanda-0.1.1.tar.gz (13.1 kB)||File type Source||Python version None||Upload date||Hashes View|
Hashes for fuzzypanda-0.1.1-py3-none-any.whl