Customisable class to produce a matrix of match scores across multiple dimensions
Project description
multimatch | multi-dimensional match scores
About
How to use this package:
- Customise the built-in class to generate a matrix of match scores across multiple dimensions.
- Apply your own rules to the scores in the matrix as per your specific use-case to identify matches.
Key features:
- Methods to cleanse and standardise company names.
- A method to dedupe data sets and add new IDs.
- Functionality to match subsets to optimise performance.
Dependencies
python 3.8 is required for this package
Required Packages
Package | Version | License |
---|---|---|
pandas | 1.3.5 | BSD License (BSD 3-Clause License |
numpy | 1.20.3 | BSD License |
rapidfuzz | 1.9.1 | MIT License (MIT) |
unidecode | 1.2.0 | GNU General Public License v2 or later (GPLv2+) |
Optional Packages
Package | Version | License |
---|---|---|
networkx | TBC | BSD License |
Data Preparation
Instructions:
- Create one frame comprising the data to lookup and one frame comprising the data to match it to.
- Ensure each frame has (1) a key which is (a) a string and (b) the index; and (2) a primary field on which to match.
- Ideally there should be secondary fields to increase the robustness of the matching.
- The two data sets should have the exactly same column names.
- To de-dupe a record set the same frame can be used as both the lookup and match frames.
Build a Custom Class
Initialise the class
Positional arguments:
- lookup_frame
- lookup_key
- lookup_name
- match_frame
- match_key
- match_name
- threshold
Optional arguments:
- primary_field_type
- company_name: this will do company name specific cleansing and STANDARDISE legal text.
- company_name_stripped: this will do company name specific cleansing and REMOVE legal text.
- address: can be used to cleanse first line of address and city.
- domain_stripped: this will remove the suffixes at the end of the domains.
- match_function
- fuzz.ratio: fuzzy matching using Levenstein Distance (the default)
- fuzz.token_set_ratio: set intersection match score
- set_intersection_match_score: alternative to the above
- remove_spaces: True or False (useful for postcodes)
- log_path: self explanatory
Example code below.
# import match manager class
from matchmatrix import MatchManager
# optional: create a company suffixes dictionary to integrate with default one
csd = {'ltd': ['ltd', 'ltd.', 'limited', 'l.i.m.i.t.e.d.', 'limmyted']
, 'plc': ['plc', 'plc.', 'public limited company', 'pee ell see']
, 'gmbh': ['gmbh', 'gmbh.']}
# initialise class
MyMatch = MatchManager()
# build class
MyMatch.build_class(
# positional arguments
lookup_frame, lookup_key, lookup_name
, match_frame, match_key, match_name
, 70
# optional arguments
, primary_field_type='company_name'
# , match_function = 'fuzz.ratio'
# , remove_spaces = False
, log_path=log_path
)
# optional: integrate company suffixes dictionary with default one (if defined above)
# set replace=True to replace the default one completely
MyMatch.integrate_company_suffixes_dictionary(csd)
Optional (but recommended): define subsetting fields.
This will do the matching in subsets e.g. county (Yorkshire to Yorkshire etc.) to SIGNIFICANTLY speed up processing time.
MyMatch.subset_fields = ('county_name', 'county_name')
Optional (but recommended): cleanse the primary matching fields
MyMatch.cleanse_primary_fields()
Get the primary matches
MyMatch.get_primary_matches()
Have a look at the results
initial_results = MyMatch.initial_results
Optional (but recommended): add secondary matches to the class
Positional argugments:
- lookup_column: name of column to match from in lookup frame.
- match_column: name of column to match to in match frame.
Optional argugments:
- field_type: same options as primary_field_type above.
- remove_spaces: as defined above.
- match_function: as defined above.
- duplicate_suffix: text to add to a duplicate column e.g. '_stripped'.
- cleanse_data: True or False.
- secondary_match_name: a name to refer to later e.g. if doing additional join matches.
Example code below.
MyMatch.clear_secondary_matches()
MyMatch.add_secondary_match('first_line_of_address', 'first_line_of_address', field_type='address', cleanse_data=True, secondary_match_name='first_line_of_address')
MyMatch.add_secondary_match('post_code', 'post_code', field_type='address', remove_spaces=True, cleanse_data=True, secondary_match_name='post_code')
Optional: get additional primary matches by joining a pair of secondary match fields
Positional arguments:
- secondary_match_name: the name defined above
Optional arguments:
- explode_column: True or False (if a column contains multiple values separated by a comma).
Example code below.
# get additional join matches
MyMatch.get_additional_join_matches('post_code')
Optional: create and integrate a separate class e.g. a primary match on first_line_of_address.
# to be developed
Optional: Get secondary matches (if defined above)
MyMatch.get_secondary_matches()
Have a look at the results
initial_results = MyMatch.initial_results
Identify matches
Instructions:
-
Apply a series of rules across the different scores in MyMatch.initial_results to identify matches.
-
Create the frame MyMatch.matched_results in the class comprising matches ONLY.
Dedupe matched results and add match ids
# add deduped ids to original keys
MyMatch.create_deduped_match_ids()
# get xref frame that maps original keys to deduped ids
final_key_map = MyMatch.final_key_map
# get dimension table at a deduped id level
match_ref_data = MyMatch.match_ref_data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for matchmatrix-2.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15d2ef2aae8938457a0b274bec39ac95456485c8fb1f2e70603eda5336f9983c |
|
MD5 | c131358c1286d0d56e36b8f46bc45d7e |
|
BLAKE2b-256 | 3e5b60e10086e5b438675048dccd2a3b35145e4d332f64fb2d785b07e4335fbf |