Customisable class to produce a matrix of match scores across multiple dimensions
Project description
About
MultiMatch | Multi-Dimensional Match Scores
How to use this package:
- Customise the built-in class to generate a matrix of match scores across multiple dimensions.
- Apply your own rules to the scores in the matrix as per your specific use-case to identify matches.
Key features:
- Methods to cleanse and standardise company names.
- A method to dedupe data sets and add new IDs.
- Functionality to match in subsets to optimise performance.
Dependencies
Required
Package | Version | License |
---|---|---|
pandas | 1.3.5 | BSD License (BSD 3-Clause License |
numpy | 1.20.3 | BSD License |
rapidfuzz | 1.9.1 | MIT License (MIT) |
unidecode | 1.2.0 | GNU General Public License v2 or later (GPLv2+) |
Optional
Package | Version | License |
---|---|---|
networkx | TBC | BSD License |
Data Preparation
Instructions:
- Create one frame comprising the data to lookup and one frame comprising the data to match it to.
- Ensure each frame has (1) a key which is (a) a string and (b) the index; and (2) a primary field on which to match.
- Ideally there should be secondary fields to increase the robustness of the matching.
- The two data sets should have the exactly same column names.
- To de-dupe a record set the same frame can be used as both the lookup and match frames.
Build a Custom Class
Initialise the class
Positional arguments:
- lookup_frame
- lookup_key
- lookup_name
- match_frame
- match_key
- match_name
- threshold
Optional arguments:
- primary_field_type
- company_name: this will do company name specific cleansing and STANDARDISE legal text.
- company_name_stripped: this will do company name specific cleansing and REMOVE legal text.
- address: can be used to cleanse first line of address and city.
- domain_stripped: this will remove the suffixes at the end of the domains.
- match_function
- fuzz.ratio: fuzzy matching using Levenstein Distance (the default)
- fuzz.token_set_ratio: set intersection match score
- set_intersection_match_score: alternative to the above
- remove_spaces: True or False (useful for postcodes)
- log_path: self explanatory
Example code below.
# import objects
from matchmatrix import MatchManager
# optional: create a company suffixes dictionary to integrate with default one
csd = {'ltd': ['ltd', 'ltd.', 'limited', 'l.i.m.i.t.e.d.', 'limmyted']
, 'plc': ['plc', 'plc.', 'public limited company', 'pee ell see']
, 'gmbh': ['gmbh', 'gmbh.']}
# create class
MyMatch = MatchManager()
# build class
MyMatch.build_class(
# positional arguments
lookup_frame, lookup_key, lookup_name
, match_frame, match_key, match_name
, 70
# optional arguments
, primary_field_type='company_name'
# , match_function = 'fuzz.ratio'
# , remove_spaces = False
, log_path=log_path
)
# optional: integrate company suffixes dictionary with default one (if defined above)
# set replace=True to replace the default one completely
MyMatch.integrate_company_suffixes_dictionary(csd)
# if necessary re-instate the default dictionary
# MyMatch.reset_company_dictionary()
Optional (but recommended): define subsetting fields.
This will do the matching in subsets e.g. county e.g. Yorkshire to Yorkshire to SIGNIFICANTLY speed up processing time.\
MyMatch.subset_fields = ('county_name', 'county_name')
Optional (but recommended): cleanse the primary matching fields
MyMatch.cleanse_primary_fields()
Get the primary matches
MyMatch.get_primary_matches()
Have a look at the results
initial_results = MyMatch.initial_results
Optional (but recommended): add secondary matches to the class
Positional argugments:
- lookup_name: name of column to match from in lookup frame.
- match_name: name of column to match to in match frame.
Optional argugments:
- field_type: same options as primary_field_type above.
- remove_spaces: as defined above.
- match_function: as defined above.
- duplicate_suffix: text to add to a duplicate column e.g. '_stripped'.
- cleanse_data: True or False.
- secondary_match_name: a name to refer to later e.g. if doing additional join matches.
Example code below.
MyMatch.clear_secondary_matches()
MyMatch.add_secondary_match('address', 'address', field_type='address', cleanse_data=True, secondary_match_name='address')
MyMatch.add_secondary_match('postcode', 'postcode', field_type='address', remove_spaces=True, cleanse_data=True, secondary_match_name='postcode')
Optional: get additional primary matches by joining a pair of secondary match fields
Positional arguments:
- secondary_match_name: the name defined above
Optional arguments:
- explode_column: True or False (if a column contains multiple values separated by a comma).
Example code below.
# get additional join matches
MyMatch.get_additional_join_matches('postcode')
Optional: create and integrate a separate class e.g. a primary match on address.
# to be developed
Optional: Get secondary matches (if defined above)
MyMatch.get_secondary_matches()
Have a look at the results
initial_results = MyMatch.initial_results
Identify matches
Instructions:
-
Apply a series of rules across the different scores in MyMatch.initial_results to identify matches.\
-
Create the frame MyMatch.matched_results in the class comprising matches ONLY.\
Dedupe matched results and add match ids
# add deduped ids to original keys
MyMatch.create_deduped_match_ids()
# get xref frame that maps original keys to deduped ids
final_key_map = MyMatch.final_key_map
# get dimension table at a deduped id level
match_ref_data = MyMatch.match_ref_data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for matchmatrix-2.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 703cd180263e123dd8687acc65d441f7a9e9b343cfe3e49807eff81c37730ba8 |
|
MD5 | 9984aa2856da6a7ad7aa99a027c157c8 |
|
BLAKE2b-256 | a7d2f44c7ccebd7fb3eb1d2ff021ef92a249b65e9bad513c8dc6ccb996dbc24d |