Skip to main content

String grouper contains functions to do string matching using TF-IDF and the cossine similarity. Based on https://bergvca.github.io/2017/10/14/super-fast-string-matching.html

Project description

string_grouper

string_grouper is a library that makes finding groups of similar strings within a single or within multiple lists of strings easy. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

The library contains 3 high level functions that can be used directly, and 1 class that allows for a more iterative approach. The three functions are:

  • match_strings(master: pd.Series, duplicates: Optional[pd.Series] = None, **kwargs) -> pd.DataFrame: Returns all highly similar strings. If only 'master' is given, it will return highly similar strings within master. This can be seen as an self-join. If both master and duplicates is given, it will return highly similar strings between master and duplicates. This can be seen as an inner-join.

  • match_most_similar(master: pd.Series, duplicates: pd.Series, **kwargs) -> pd.Series: Returns a series of strings of the same length as 'duplicates' where for each string in duplicates the most similar string in 'master' is returned. If there are no similar strings in master for a given string in duplicates (there is no potential match where the cosine similarity is above the threshold (default: 0.8)) the original string in duplicates is returned.

    For example the input series [foooo, bar, baz] (master) and [foooob, bar, new] will return: [foooo, bar, new]

  • group_similar_strings(strings_to_group: pandas.Series, **kwargs) -> pandas.Series: Takes a single series of strings and groups these together by picking a single string in each group of similar strings, and return this as output.

    For example the input series: [foooo, foooob, bar] will return [foooo, foooo, bar]. Here foooo and foooob are grouped together into group foooo because they are found to be similar.

All functions are build using a class StringGrouper. This class can be used directly as well to allow for a more an more iterative approach where matches can be added or removed if needed.

kwargs

All keyword arguments not mentioned in the function definition are used to update the default settings. The following optional arguments can be used:

  • ngram_size: The amount of characters in each n-gram. Optional. Default is 3
  • regex: The regex string used to cleanup the input string. Optional. Default is [,-./]|\s
  • max_n_matches: The maximum number of matches allowed per string. Default is 20.
  • min_similarity: The minium cossine similarity for two strings to be considered a match. Defaults to 0.8
  • number_of_processes: The number of processes used by the cosine similarity calculation. Defaults to 1 - number of cores on a machine.

Installing

todo

Examples

In this section we will cover a few use cases for which string_grouper may be used. We will use the same dataset of company names as used in: Super Fast String Matching in Python.

Find all matches within a single dataset

import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, group_similar_strings, StringGrouper
company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches.left_side != matches.right_side].head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
left_side right_side similarity
15 0210, LLC 90210 LLC 0.870291
167 1 800 MUTUALS ADVISOR SERIES 1 800 MUTUALS ADVISORS SERIES 0.931616
169 1 800 MUTUALS ADVISORS SERIES 1 800 MUTUALS ADVISOR SERIES 0.931616
171 1 800 RADIATOR FRANCHISE INC 1-800-RADIATOR FRANCHISE INC. 1.000000
178 1 FINANCIAL MARKETPLACE SECURITIES LLC ... 1 FINANCIAL MARKETPLACE SECURITIES, LLC 0.949364

Find all matches in between two datasets.

The match_string function allows to find similar items between two datasets as well. This can be seen as an inner join between two datasets:

# Create a small set of artifical company names
duplicates = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_strings(companies['Company Name'], duplicates)
matches
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
left_side right_side similarity
0 012 SMILE.COMMUNICATIONS LTD 012 SMILE.COMMUNICATIONS 0.944092
1 B.A.S. MEDIA GROUP S MEDIA GROUP 0.854383
2 B4UTRADE COM CORP B4UTRADE COM CORP 1.000000
3 B4UTRADE COM INC B4UTRADE COM CORP 0.810217
4 B4UTRADE CORP B4UTRADE COM CORP 0.878276

Out ouf the 4 company names in duplicates, 3 companies are found in the original company dataset. One company is found 3 times.

For a second dataset, find only the most similar match

In the example above, it's possible that multiple matches are found for a single string. Sometimes we just want a string to match with a single most similar string. If there are no similar strings found, the original string should be returned:

# Create a small set of artifical company names
new_companies = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_most_similar(companies['Company Name'], new_companies)
# Display the results:
pd.DataFrame({'new_companies': new_companies, 'duplicates': matches})
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
new_companies duplicates
0 S MEDIA GROUP B.A.S. MEDIA GROUP
1 012 SMILE.COMMUNICATIONS 012 SMILE.COMMUNICATIONS LTD
2 foo bar foo bar
3 B4UTRADE COM CORP B4UTRADE COM CORP

Deduplicate a single dataset and show items with most duplicates

The group_similar_strings functions groups strings that are similar using a single linkage clustering algorithm. That is, if item A and item B are similar, and item B and item C are similar but the similarity between A and C is below the threshold, all three items are grouped together.

# Add the grouped strings
companies['deduplicated_name'] = group_similar_strings(companies['Company Name'])
# Show items with most duplicates
companies.groupby('deduplicated_name').count().sort_values('Line Number', ascending=False).head(10)['Line Number']
deduplicated_name
ADVISORS DISCIPLINED TRUST 1100                        188
ACE SECURITIES CORP HOME EQUITY LOAN TRUST 2005-HE4     32
AMERCREDIT AUTOMOBILE RECEIVABLES TRUST 2010-1          28
ADVENT LATIN AMERICAN PRIVATE EQUITY FUND II-A CV       25
ALLSTATE LIFE GLOBAL FUNDING TRUST 2004-1               24
ADVENT INTERNATIONAL GPE VII LIMITED PARTNERSHIP        24
7ADVISORS DISCIPLINED TRUST 1197                        23
AMERICREDIT AUTOMOBILE RECEIVABLES TRUST  2002 - D      23
ALLY AUTO RECEIVABLES TRUST 2010-1                      23
ANDERSON DAVID  A                                       23
Name: Line Number, dtype: int64

The StringGrouper class

The three functions mentioned above all create a StringGrouper object behind the scenes and call different functions on it. The StringGrouper class keeps track of all tuples of similar strings and creates the groups out of these. Since matches are often not perfect, a common workflow is to:

  1. Create matches
  2. Manually inspect the results
  3. Add and remove matches were neccesary
  4. Create groups of similar Strings

The StringGrouper allows for this without having to re-calculate the cosine similarity matrix. See below for an example.

company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example
companies = pd.read_csv(company_names)
  1. Create matches
# Create a new StringGrouper
string_grouper = StringGrouper(companies['Company Name'])
# Check if the ngram function does what we expect:
string_grouper.n_grams('McDonalds')
['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']
# Now fit the StringGrouper - this will take a while since we are calculating cosine similarities on 600k strings
string_grouper = string_grouper.fit()
# Add the grouped strings
companies['deduplicated_name'] = string_grouper.get_groups()
pd.set_option('display.max_colwidth', -1)

Suppose we know that PWC HOLDING CORP and PRICEWATERHOUSECOOPERS LLP are the same company. The StringGrouper will not match these, since they are not similar enough.

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
companies[companies.deduplicated_name.str.contains('PWC')]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Line Number Company Name Company CIK Key deduplicated_name
485535 485536 PWC CAPITAL INC. 1690640 PWC CAPITAL INC.
485536 485537 PWC HOLDING CORP 1456450 PWC HOLDING CORP
485537 485538 PWC INVESTORS, LLC 1480311 PWC INVESTORS, LLC
485538 485539 PWC REAL ESTATE VALUE FUND I LLC 1668928 PWC REAL ESTATE VALUE FUND I LLC
485539 485540 PWC SECURITIES CORP /BD 1023989 PWC SECURITIES CORP /BD
485540 485541 PWC SECURITIES CORPORATION 1023989 PWC SECURITIES CORPORATION
485541 485542 PWCC LTD 1172241 PWCC LTD
485542 485543 PWCG BROKERAGE, INC. 67301 PWCG BROKERAGE, INC.

We can add these with the add function:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'PWC HOLDING CORP')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA

This can also be used to merge two groups:

string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA
662585 662586 ZUCKER MICHAEL 1629018 PRICEWATERHOUSECOOPERS LLP /TA
662604 662605 ZUCKERMAN MICHAEL 1303321 PRICEWATERHOUSECOOPERS LLP /TA
662605 662606 ZUCKERMAN MICHAEL 1496366 PRICEWATERHOUSECOOPERS LLP /TA

We can remove strings from groups in the same way:

string_grouper = string_grouper.remove_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Line Number Company Name Company CIK Key deduplicated_name
478441 478442 PRICEWATERHOUSECOOPERS LLP /TA 1064284 PRICEWATERHOUSECOOPERS LLP /TA
478442 478443 PRICEWATERHOUSECOOPERS LLP 1186612 PRICEWATERHOUSECOOPERS LLP /TA
478443 478444 PRICEWATERHOUSECOOPERS SECURITIES LLC 1018444 PRICEWATERHOUSECOOPERS LLP /TA
485536 485537 PWC HOLDING CORP 1456450 PRICEWATERHOUSECOOPERS LLP /TA

Project details


Release history Release notifications

This version

0.1.0

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for string-grouper, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size string_grouper-0.1.0-py3-none-any.whl (11.3 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size string_grouper-0.1.0.tar.gz (14.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page