Toolkit for performing fuzzy joins with Symspell framework
Project description
FuzzyPanda
FuzzyPanda was created to support fuzzy join operations with Pandas DataFrames using Python Ver. 3. These fuzzy joins are a form of approximate string matching to join relational data that contain "errors" or minor modifications that preclude direct string comparison.
FuzzyPanda will match strings that
- Are within a user-specified edit distance (e.g. "test" == "taste" with edit distance 2)
- Are independent of case (e.g. "Test" == "test")
- Are Whitespace-delimited strings are matched regardless of token order (e.g. "dark and stormy night" == "stormy and dark night")
- Are independent of special symbols (e.g. "this-string" == "this string")
The criteria in steps 2-4 can be modified via modification of the fuzzypanda.preprocess.PreProcessor
class.
The primary API is the fuzzypanda.matching.get_fuzzy_columns
function that takes two Pandas DataFrames and a set of column names, and creates a new column in the "left" DataFrame that contains the closest entries by string edit distance to the associated values in the "right" DataFrame columns. The Pandas merge
or join
functions can later be used to perform full joins on the DataFrames.
Installation
FuzzyPanda can be installed using pip
:
pip install fuzzypanda
Usage
This version of FuzzyPanda currently supports the fuzzypanda.matching.get_fuzzy_columns
function. More functions are expected in future releases.
Create Fuzzy Matched Columns
Main fuzzy joining API for the fuzzy joining of the given left_dataframe
and right_dataframe
. Given a string or list of strings to the cols argument, this function will add fuzzy columns to the left_dataframe
that best match the columns of the right_dataframe
. This operation can then be followed up with a Pandas merge
or join
to perform the actual joining operation.
fuzzypanda.matching.get_fuzzy_columns
Arguments:left_dataframe
(pandas.DataFrame): left Pandas dataframe to which columns will be addedright_dataframe
(pandas.DataFrame): right Pandas dataframe from which fuzzy values in theleft_dataframe
will be compared and suggestedleft_cols
(List(str)): A list of strings of column names present inleft_dataframe
that will be compared to the corresponding columns inright_dataframe
.right_cols
(List(str)): A list of strings of column names present inright_dataframe
used for comparison to those in given inleft_dataframe
. If both dataframes share the column names on which fuzzy columns will be created, this parameter can be set toNone
and the values given inleft_cols
will be used as the names in both dataframes. Default isNone
.null_return
(string): The string used if a match isn't found. Can be used to set NULL values if a fuzzy match isn't found in theright_dataframe
. Setting toNone
will return the string used to search for the fuzzy value. Default isNone
.preprocesser
: an instance of thefuzzypanda.preprocess.PreProcessor
class containing thepreprocess
method used to pre-process the input strings. If set toNone
, will instantiate the default pre-processor. This option can be used to create a custom pre-processor to pass to theget_fuzzy_columns
function. Default isNone
max_edit_distance
(int): The maximum edit distance that will be considered when comparing columns. The higher the number, the more "incorrect" theleft_dataframe
columns can be to be searched in theright_dataframe
columns. Increasing this number heavily impacts runtime and should be set as low as possible. Default is 2.
- Returns: Performs an in-place creation of fuzzy columns within
left_dataframe
. Each given left column inleft_cols
will have a'fuzzy_' + left_col_name
corresponding to the matched column.
get_fuzzy_columns Example
Suppose you wish to join the following two dataframes on columns col_1
and col_2
, where the columns in left_df
contain entries that are misspelled and/or jumbled tokens of those in right_df
:
print(left_df)
> ID col_1 col_2
> 0 123314 kitten oboe
> 1 123213 siting trvmpet
> 2 43543 the times of best over te rainbow
> 3 35435 the worst times in Symphony C
> 4 987 not in there not in there
print(right_df)
> ID col_1 col_2
> 0 12783314 kitten oboe
> 1 12352213 sitting trumpet
> 2 43233543 the best of times over the rainbow
> 3 23432420 the worst of times Symphony in C#
We can now call fuzzypanda.matching.get_fuzzy_columns
. Notice that the results are columns added to left_df
in-place, rather than returning a new DataFrame.
fuzzypanda.matching.get_fuzzy_columns(left_dataframe=left_df,
right_dataframe=right_df,
left_cols=['col_1', 'col_2'])
print(left_df)
> ID col_1 col_2 fuzzy_col_1 \
> 0 123314 kitten oboe kitten
> 1 123213 siting trvmpet sitting
> 2 43543 the times of best over te rainbow the best of times
> 3 35435 the worst times in Symphony C the worst of times
> 4 987 not in there not in there not in there
>
> fuzzy_col_2
> 0 oboe
> 1 trumpet
> 2 over the rainbow
> 3 Symphony in C#
> 4 not in there
Methodology
This package uses the Symspell Python port symspellpy by mammothb of the original C# implementation of Symspell by Wolf Garbe. This fuzzy column creation approach applies a Pandas-friendly wrapper around the Symspell Symmetric Delete spelling correction algorithm to allow substantially faster fuzzy joining. Tools such as fuzzywuzzy will run in Omega(mn) to find the best-matching strings in a column of n values compared to the m values of another column, whereas this model is expected to have a runtime of Omega(m + n) due to the pre-processing of the right DataFrame columns as a spellchecker corpus that searched using the Symmetric Delete spelling correction algorithm.
This method is best suited for fuzzy searches of large DataFrames due to the comparatively large amount of pre-processing but faster search performance.
The algorithm operates as follows:
- A "left" Pandas DataFrame and a "right" Pandas DataFrame are input to
get_fuzzy_columns
with the column names used for comparison. - Each right DataFrame is copied into a temporary corpus text file.
- Each entry in the corpus text file is preprocessed using either the default
fuzzypanda.preprocess.PreProcessor
or a user-supplied object containing apreprocess
method and copied to another preprocessed text file. An in-memory index is created to translate processed strings to preprocessed strings. - A symspellpy object is instantiated and the corpus file is used to create a lookup dictionary.
- Each record from the left DataFrame is preprocessed and queried from the dictionary using the
symspellpy.lookup
function to find the closest string in terms of edit distance, and the suggested string (or a substitute string if one isn't found) is placed in an intemediate list. - When all records of the left DataFrame have been processed, a new column containing the results of the fuzzy lookup is added to the left DataFrame in a column labeled 'fuzzy_' + queried column name.
Future Work
- Directly implement pandas
merge
andjoin
- Replace
symspellpy
with a C++ implementation of Symspell to speed lookup calculations - Create option for multiprocessing and multithreading column record queries.
- Add API to directly process CSV files
- Add API to use Pandas DataFrame chunks
- Expand functionality to use SparkSQL DataFrames
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fuzzypanda-0.1.1.tar.gz
.
File metadata
- Download URL: fuzzypanda-0.1.1.tar.gz
- Upload date:
- Size: 13.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58821907fb53a0758ddd63cbe0253516e4eead370d59799d46748719364a2137 |
|
MD5 | 956d9c50b87c25a99f78cfc93f1ce904 |
|
BLAKE2b-256 | 79c95022f76fa9b6aa7695bfe126ef19740bf70eda0f88cfb4e665df13b9d3a8 |
File details
Details for the file fuzzypanda-0.1.1-py3-none-any.whl
.
File metadata
- Download URL: fuzzypanda-0.1.1-py3-none-any.whl
- Upload date:
- Size: 12.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 99e4eb7776b85c1fc3dac027c95e091225281987f1c129fe4ef3749d92edfdce |
|
MD5 | a47889eb537cb5b5e4a2d7cc1987ed70 |
|
BLAKE2b-256 | befe7863f9566f8df73ac0b666bcac4fef8014581cdf40160dad19a2a62c380f |