Skip to main content

Perform fuzzy matching against two pandas dataframes with optional exact matches

Project description

FuzzyExact

Perform fuzzy matching against two pandas dataframes with optional exact matches.

Requirements

Python 2.7 or higher
Pandas
FuzzyWuzzy
python-Levenshtein (optional)


Installation

pip install fuzzyexact

Usage

import fuzzyexact

remove_punctuation

fuzzyexact.remove_punctuation(df, 'column_name')

remove_punctuation is a helper function which strips all punctuation out of a column in pandas dataframe and returns the cleaned dataframe

clean_address

fuzzyexact.clean_address(df, 'address_column_name')

clean_address is a helper function which cleans an address column of a pandas dataframe by capitalizing, abbreviating road types (street>ST, road>RD), and stripping out building and suite numbers.

fuzzyexact

fuzzyexact.fuzzyexact(df_left, df_right, id_col='Source_ID', key=('first_name', 'address'), block1='state', block2='last_name', threshold=80)

FuzzyExact leverages FuzzyWuzzy's process.extractOne ability, integrates it into pandas dataframes, and enables for up to two exact matches (or blocks) to significantly speed up performance of matching against large datasets. The function returns all rows from df_left along with the best match for each row in df_right. The id_col argument is an optional argument which allows the user to supply an id column from df_right to allow for easier lookups of matched records. The fuzzy match is performed against the key supplied by the user. block1 and block2 are optional arguments which specify exact matches between the two dataframes. The threshold is defaulted to 80, but can be altered by the user and will feed the cutoff to define a "good" match in FuzzyWuzzy's process.extractOne function.

Contact

Project: https://github.com/eric-tomasi/fuzzyexact
Email: etomasi2323@gmail.com

Acknowledgments

FuzzyWuzzy

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzyexact-1.0.0.tar.gz (3.1 kB view details)

Uploaded Source

Built Distribution

fuzzyexact-1.0.0-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file fuzzyexact-1.0.0.tar.gz.

File metadata

  • Download URL: fuzzyexact-1.0.0.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.4.2 requests/2.21.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for fuzzyexact-1.0.0.tar.gz
Algorithm Hash digest
SHA256 324e6421d0f5e02bff349546267c12cf8174ca2ca394de8fdab28d2c0eedb67f
MD5 c5ed983d38ecabc4a1bdc5279af98a4d
BLAKE2b-256 0510ae46b22abaed79609b74e336ad1eed77ea670f40b32f2b41da7106bc0548

See more details on using hashes here.

File details

Details for the file fuzzyexact-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: fuzzyexact-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.4.2 requests/2.21.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.1

File hashes

Hashes for fuzzyexact-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9430323ab865e30014ff2c887032eb37db0cceb3110a9789b81033c96fb7c604
MD5 c6421c30bdcd2488b0abea8024677e81
BLAKE2b-256 ca550273d9f25f6167e3de39f5db27769d50a8ca4b5d2c99c96a34d291c7b5cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page