String grouper contains functions to do string matching using TF-IDF and the cossine similarity. Based on https://bergvca.github.io/2017/10/14/super-fast-string-matching.html
Project description
String Grouper
string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.
The library contains three high level functions that can be used directly, and one class that allows for a more iterative approach. The three functions are:
-
match_strings(master: pd.Series, duplicates: Optional[pd.Series] = None, master_id: Optional[pd.Series] = None, duplicates_id: Optional[pd.Series] = None, **kwargs) -> pd.DataFrame: Returns all highly similar strings. If only the
master
parameter is given an argument, it will return highly similar strings within master. This can be seen as a self-join. If bothmaster
andduplicates
are given, it will return highly similar strings between master and duplicates. This can be seen as an inner-join.The function also supports optionally supplying IDs for the rows of the text values being matched. If an argument is given to the
master_id
parameter the value from its column (master_id) in the same row as the value in the master column will be returned. If bothmaster_id
andduplicates_id
are given, the respective values from their columns in the same row of the relevant master and duplicates values will be returned. -
match_most_similar(master: pd.Series, duplicates: pd.Series, **kwargs) -> pd.Series: Returns a series of strings of the same length as the parameter
duplicates
where for each string in duplicates the most similar string inmaster
is returned. If there are no similar strings in master for a given string in duplicates (there is no potential match where the cosine similarity is above the threshold (default: 0.8)) the original string in duplicates is returned.For example if the input series
[foooo, bar, baz]
is passed as the argument tomaster
, and[foooob, bar, new]
as the argument toduplicates
, the function will return:[foooo, bar, new]
-
group_similar_strings(strings_to_group: pandas.Series, **kwargs) -> pandas.Series: Takes a single series of strings and groups these together by picking a single string in each group of similar strings, and return this as output.
For example the input series:
[foooo, foooob, bar]
will return[foooo, foooo, bar]
. Herefoooo
andfoooob
are grouped together into groupfoooo
because they are found to be similar.
All functions are built using a class StringGrouper. This class can be used through pre-defined functions, for example the three high level functions above, as well as using a more iterative approach where matches can be added or removed if needed by calling the StringGrouper
class directly.
kwargs
All keyword arguments not mentioned in the function definition are used to update the default settings. The following optional arguments can be used:
- ngram_size: The amount of characters in each n-gram. Optional. Default is
3
- regex: The regex string used to cleanup the input string. Optional. Default is
[,-./]|\s
- max_n_matches: The maximum number of matches allowed per string. Default is
20
. - min_similarity: The minimum cosine similarity for two strings to be considered a match.
Defaults to
0.8
- number_of_processes: The number of processes used by the cosine similarity calculation. Defaults to
number of cores on a machine - 1.
Installing
pip install string-grouper
Examples
In this section we will cover a few use cases for which string_grouper may be used. We will use the same dataset of company names as used in: Super Fast String Matching in Python.
Find all matches within a single dataset
import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper
company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches.left_side != matches.right_side].head()
left_side | right_side | similarity | |
---|---|---|---|
15 | 0210, LLC | 90210 LLC | 0.870291 |
167 | 1 800 MUTUALS ADVISOR SERIES | 1 800 MUTUALS ADVISORS SERIES | 0.931616 |
169 | 1 800 MUTUALS ADVISORS SERIES | 1 800 MUTUALS ADVISOR SERIES | 0.931616 |
171 | 1 800 RADIATOR FRANCHISE INC | 1-800-RADIATOR FRANCHISE INC. | 1.000000 |
178 | 1 FINANCIAL MARKETPLACE SECURITIES LLC ... | 1 FINANCIAL MARKETPLACE SECURITIES, LLC | 0.949364 |
Find all matches in between two datasets.
The match_strings
function finds similar items between two datasets as well. This can be seen as an inner join between two datasets:
# Create a small set of artificial company names
duplicates = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_strings(companies['Company Name'], duplicates)
matches
left_side | right_side | similarity | |
---|---|---|---|
0 | 012 SMILE.COMMUNICATIONS LTD | 012 SMILE.COMMUNICATIONS | 0.944092 |
1 | B.A.S. MEDIA GROUP | S MEDIA GROUP | 0.854383 |
2 | B4UTRADE COM CORP | B4UTRADE COM CORP | 1.000000 |
3 | B4UTRADE COM INC | B4UTRADE COM CORP | 0.810217 |
4 | B4UTRADE CORP | B4UTRADE COM CORP | 0.878276 |
Out of the 4 company names in duplicates
, 3 companies are found in the original company dataset. One company is found 3 times.
Finding duplicates from a (database extract to) pandas DataFrame where IDs for rows are supplied.
A very common scenario is the case where duplicate records for an entity have been entered into a database. That is, there are two or more records where a name field has slightly different spelling. For example, "A.B. Corporation" and "AB Corporation". Using the optional 'ID' parameter in the match_strings function duplicates can be found easily. A tutorial that steps though the process with an example data set is available.
For a second dataset, find only the most similar match
In the example above, it's possible that multiple matches are found for a single string. Sometimes we just want a string to match with a single most similar string. If there are no similar strings found, the original string should be returned:
# Create a small set of artificial company names
new_companies = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_most_similar(companies['Company Name'], new_companies)
# Display the results:
pd.DataFrame({'new_companies': new_companies, 'duplicates': matches})
new_companies | duplicates | |
---|---|---|
0 | S MEDIA GROUP | B.A.S. MEDIA GROUP |
1 | 012 SMILE.COMMUNICATIONS | 012 SMILE.COMMUNICATIONS LTD |
2 | foo bar | foo bar |
3 | B4UTRADE COM CORP | B4UTRADE COM CORP |
Deduplicate a single dataset and show items with most duplicates
The group_similar_strings
function groups strings that are similar using a single linkage clustering algorithm. That is, if item A and item B are similar, and item B and item C are similar, but the similarity between A and C is below the threshold, all three items are grouped together.
# Add the grouped strings
companies['deduplicated_name'] = group_similar_strings(companies['Company Name'])
# Show items with most duplicates
companies.groupby('deduplicated_name').count().sort_values('Line Number', ascending=False).head(10)['Line Number']
deduplicated_name
ADVISORS DISCIPLINED TRUST 1100 188
ACE SECURITIES CORP HOME EQUITY LOAN TRUST 2005-HE4 32
AMERCREDIT AUTOMOBILE RECEIVABLES TRUST 2010-1 28
ADVENT LATIN AMERICAN PRIVATE EQUITY FUND II-A CV 25
ALLSTATE LIFE GLOBAL FUNDING TRUST 2004-1 24
ADVENT INTERNATIONAL GPE VII LIMITED PARTNERSHIP 24
7ADVISORS DISCIPLINED TRUST 1197 23
AMERICREDIT AUTOMOBILE RECEIVABLES TRUST 2002 - D 23
ALLY AUTO RECEIVABLES TRUST 2010-1 23
ANDERSON DAVID A 23
Name: Line Number, dtype: int64
The StringGrouper class
The three functions mentioned above all create a StringGrouper
object behind the scenes and call different functions on it. The StringGrouper
class keeps track of all tuples of similar strings and creates the groups out of these. Since matches are often not perfect, a common workflow is to:
- Create matches
- Manually inspect the results
- Add and remove matches where necessary
- Create groups of similar strings
The StringGrouper
class allows for this without having to re-calculate the cosine similarity matrix. See below for an example.
company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example
companies = pd.read_csv(company_names)
- Create matches
# Create a new StringGrouper
string_grouper = StringGrouper(companies['Company Name'])
# Check if the ngram function does what we expect:
string_grouper.n_grams('McDonalds')
['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']
# Now fit the StringGrouper - this will take a while since we are calculating cosine similarities on 600k strings
string_grouper = string_grouper.fit()
# Add the grouped strings
companies['deduplicated_name'] = string_grouper.get_groups()
Suppose we know that PWC HOLDING CORP and PRICEWATERHOUSECOOPERS LLP are the same company. StringGrouper will not match these since they are not similar enough.
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number | Company Name | Company CIK Key | deduplicated_name | |
---|---|---|---|---|
478441 | 478442 | PRICEWATERHOUSECOOPERS LLP /TA | 1064284 | PRICEWATERHOUSECOOPERS LLP /TA |
478442 | 478443 | PRICEWATERHOUSECOOPERS LLP | 1186612 | PRICEWATERHOUSECOOPERS LLP /TA |
478443 | 478444 | PRICEWATERHOUSECOOPERS SECURITIES LLC | 1018444 | PRICEWATERHOUSECOOPERS LLP /TA |
companies[companies.deduplicated_name.str.contains('PWC')]
Line Number | Company Name | Company CIK Key | deduplicated_name | |
---|---|---|---|---|
485535 | 485536 | PWC CAPITAL INC. | 1690640 | PWC CAPITAL INC. |
485536 | 485537 | PWC HOLDING CORP | 1456450 | PWC HOLDING CORP |
485537 | 485538 | PWC INVESTORS, LLC | 1480311 | PWC INVESTORS, LLC |
485538 | 485539 | PWC REAL ESTATE VALUE FUND I LLC | 1668928 | PWC REAL ESTATE VALUE FUND I LLC |
485539 | 485540 | PWC SECURITIES CORP /BD | 1023989 | PWC SECURITIES CORP /BD |
485540 | 485541 | PWC SECURITIES CORPORATION | 1023989 | PWC SECURITIES CORPORATION |
485541 | 485542 | PWCC LTD | 1172241 | PWCC LTD |
485542 | 485543 | PWCG BROKERAGE, INC. | 67301 | PWCG BROKERAGE, INC. |
We can add these with the add function:
string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'PWC HOLDING CORP')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number | Company Name | Company CIK Key | deduplicated_name | |
---|---|---|---|---|
478441 | 478442 | PRICEWATERHOUSECOOPERS LLP /TA | 1064284 | PRICEWATERHOUSECOOPERS LLP /TA |
478442 | 478443 | PRICEWATERHOUSECOOPERS LLP | 1186612 | PRICEWATERHOUSECOOPERS LLP /TA |
478443 | 478444 | PRICEWATERHOUSECOOPERS SECURITIES LLC | 1018444 | PRICEWATERHOUSECOOPERS LLP /TA |
485536 | 485537 | PWC HOLDING CORP | 1456450 | PRICEWATERHOUSECOOPERS LLP /TA |
This can also be used to merge two groups:
string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number | Company Name | Company CIK Key | deduplicated_name | |
---|---|---|---|---|
478441 | 478442 | PRICEWATERHOUSECOOPERS LLP /TA | 1064284 | PRICEWATERHOUSECOOPERS LLP /TA |
478442 | 478443 | PRICEWATERHOUSECOOPERS LLP | 1186612 | PRICEWATERHOUSECOOPERS LLP /TA |
478443 | 478444 | PRICEWATERHOUSECOOPERS SECURITIES LLC | 1018444 | PRICEWATERHOUSECOOPERS LLP /TA |
485536 | 485537 | PWC HOLDING CORP | 1456450 | PRICEWATERHOUSECOOPERS LLP /TA |
662585 | 662586 | ZUCKER MICHAEL | 1629018 | PRICEWATERHOUSECOOPERS LLP /TA |
662604 | 662605 | ZUCKERMAN MICHAEL | 1303321 | PRICEWATERHOUSECOOPERS LLP /TA |
662605 | 662606 | ZUCKERMAN MICHAEL | 1496366 | PRICEWATERHOUSECOOPERS LLP /TA |
We can remove strings from groups in the same way:
string_grouper = string_grouper.remove_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
Line Number | Company Name | Company CIK Key | deduplicated_name | |
---|---|---|---|---|
478441 | 478442 | PRICEWATERHOUSECOOPERS LLP /TA | 1064284 | PRICEWATERHOUSECOOPERS LLP /TA |
478442 | 478443 | PRICEWATERHOUSECOOPERS LLP | 1186612 | PRICEWATERHOUSECOOPERS LLP /TA |
478443 | 478444 | PRICEWATERHOUSECOOPERS SECURITIES LLC | 1018444 | PRICEWATERHOUSECOOPERS LLP /TA |
485536 | 485537 | PWC HOLDING CORP | 1456450 | PRICEWATERHOUSECOOPERS LLP /TA |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for string_grouper-0.3.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 70b6ad543544e0eb068f5ffd1413bd76537f6d18eac4b3f466e1032f140e4695 |
|
MD5 | c228854301781f08ddcdf7d6d5d22b35 |
|
BLAKE2b-256 | 202841528826bf7a9d89e2c720c489f25887725270d1c5e8164ad6aa3a290dab |