Row Equivalence Discoverer (red) based on string_grouper. This package finds similarities between rows of a table.
Project description
Red String Grouper
Record Equivalence Discoverer based on
String Grouper
(Red String Grouper) is a python package that finds similarities between
rows/records of a table with multiple fields.
It is an extension of
String Grouper, a library that
makes finding groups of similar strings within one or two lists of strings
easy — and fast.
Usage
How do we achieve matching for multiple fields and priorities?
Import the function record_linkage()
from red_string_grouper
and specify
the fields of the table over which the comparison will be made.
import pandas as pd
import numpy as np
from red_string_grouper import record_linkage
matches = record_linkage(data_frame, fields_2b_matched_fuzzily,
fields_2b_matched_exactly=None,
hierarchical=True, max_n_matches=None,
similarity_dtype=np.float32, force_symmetries=True,
n_blocks=None)
This is a function that combines similarity-matching results of several fields of a
DataFrame (data_frame
) and returns them in another DataFrame (matches
).
Parameter | Status | Description |
---|---|---|
data_frame |
Required | pandas.DataFrame of strings which is the table over which the comparisons will be made. |
fields_2b_matched_fuzzily |
Required | List of tuples. Each tuple is a quadruple: (<field name>, <threshold>, <ngram_size>, <weight>). <field name> is the name of a field in data_frame which is to be matched using a threshold similarity score of <threshold> and an ngram size of <ngram_size>. <weight> is a number that defines the relative importance of the field to other fields -- the field's contribution to the mean similarity will be weighted by this number. <weighted mean similarity score> = (Σfield <weight>field × <similarity>field) / (Σfieldweightfield), where Σfield means "sum over fields". |
fields_2b_matched_exactly |
Optional | List of tuples. Each tuple is a pair: (<field name>, <weight>). <field name> is the name of a field in data_frame which is to be matched exactly. <weight> has the same meaning as in parameter fields_2b_matched_fuzzily . Defaults to None . |
hierarchical |
Optional | bool . Determines if the output DataFrame will have a hierarchical column-structure (True ) or not (False ). Defaults to True . |
max_n_matches |
Optional | int . Maximum number of matches allowed per string. Defaults to the total number of rows. |
similarity_dtype |
Optional | numpy type. Either np.float32 (the default) or np.float64 . A value of np.float32 allows for less memory overhead during computation but less numerical precision, while np.float64 allows for greater numerical precision but a larger memory overhead. |
force_symmetries |
Optional | bool . Specifies whether corrections should be made to the results to account for symmetry thus circumventing some errors which result from loss of numerical significance. Defaults to True . |
n_blocks |
Optional | (int, int) . This parameter is provided to boost performance, if possible, by splitting the dataset into n_blocks[0] blocks for the left operand (of the "comparison operator") and into n_blocks[1] blocks for the right operand before performing the string-comparisons blockwise. |
Examples
import pandas as pd
from red_string_grouper import record_linkage
Prepare the Input Data:
Here's some sample data:
inputfilename = 'data/us-cities-real-estate-sample-zenrows.csv'
df = pd.read_csv(inputfilename, dtype=str)
Note that the data has been read into memory as strings (dtype=str
), since
red_string_grouper
is based on string comparison.
The dataset is a table of 10 000 records:
len(df)
10000
Let us examine the data to determine which fields to compare (that is, use only columns without null or NaN data). At the same time, we will also check how many unique values each field has:
for field in df.columns:
if df[field].nunique() > 1 and not df[field].isna().values.any():
print(f'{field} : {df[field].nunique()}')
zpid : 10000
id : 10000
imgSrc : 9940
detailUrl : 10000
statusText : 24
address : 10000
addressState : 51
addressZipcode : 6446
isUndisclosedAddress : 2
isZillowOwned : 2
has3DModel : 2
hasVideo : 2
isFeaturedListing : 2
list : 2
We may set field 'zpid'
as the index, since it has exactly the same number
of unique values as the number of rows. zpid
will thus be used to identify
each row.
df.set_index('zpid', inplace=True)
Call record_linkage()
:
There is more than one way to achieve the same matching result. But some ways are faster than others, depending on the data.
Plot comparing Runtimes of record_linkage()
calls with and without grouping on a test field having a varying number of unique values in a 10 000-row DataFrame
1. Grouping by fields that are to be matched exactly
Note that those fields that have very few unique values distributed among a
large number of rows, such as 'hasVideo' (2 unique values) and 'addressState'
(51 unique values), can be specified as "fields that are to be matched exactly"
(that is, in parameter fields_2b_matched_exactly
) which can lead to a significant
performance boost.
Behind the scenes, this allows record_linkage()
to avoid using cosine-
similarity matching on these fields (which is time-consuming, since many
matches are likely to be found), and instead group by these fields.
In this way, cosine-similarity matching can be performed only on the other
fields (in parameter fields_2b_matched_fuzzily
) for each group.
On the other hand, grouping by 'addressZipcode' (which has 6446 unique values) degrades performance, since the groups by this field are so many.
To illustrate, the following call took ≈ 5 minutes to run:
matches = record_linkage(
df,
fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
('address', 0.8, 3, 1)],
fields_2b_matched_exactly=[('addressZipcode', 2),
('addressState', 4),
('hasVideo', 1)],
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_corrections=False
)
whereas, the following call (which produces the same result) took ≈8 seconds to run:
matches = record_linkage(
df,
fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
('address', 0.8, 3, 1),
('addressZipcode', 0.999999, 3, 2)],
fields_2b_matched_exactly=[('addressState', 4),
('hasVideo', 1)],
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_symmetries=False
)
Let's display the results:
matches
Exactly Matched Fields | Fuzzily Matched Fields | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
addressState | hasVideo | statusText | address | addressZipcode | |||||||||
Weighted Mean Similarity Score | left | similarity | right | left | similarity | right | left | similarity | right | ||||
left_zpid | right_zpid | ||||||||||||
2077667803 | 2077679643 | 1.000000 | WY | false | Lot / Land for sale | 1.0 | Lot / Land for sale | Jlda Minor Sub Division LOT C, Buffalo, WY 82834 | 1.000000 | Jlda Minor Subdivision LOT C, Buffalo, WY 82834 | 82834 | 1.0 | 82834 |
2075244057 | 2075358943 | 0.997100 | OH | false | Lot / Land for sale | 1.0 | Lot / Land for sale | 0 Township Road 118, Kimbolton, OH 43749 | 0.973904 | Township Road 118, Kimbolton, OH 43749 | 43749 | 1.0 | 43749 |
2077676622 | 2077676809 | 0.993867 | ND | false | Lot / Land for sale | 1.0 | Lot / Land for sale | 4 55th St SE, Christine, ND 58015 | 0.944802 | 2 55th St SE, Christine, ND 58015 | 58015 | 1.0 | 58015 |
2077093064 | 2078843498 | 0.993328 | SD | false | Lot / Land for sale | 1.0 | Lot / Land for sale | 17 Sidney Park Rd, Custer, SD 57730 | 0.939948 | Sidney Park Rd, Custer, SD 57730 | 57730 | 1.0 | 57730 |
150690392 | 2076123604 | 0.992909 | NJ | false | Lot / Land for sale | 1.0 | Lot / Land for sale | 5 Windsor Ln, Gladstone, NJ 07934 | 0.936180 | 0 Windsor Ln, Gladstone, NJ 07934 | 7934 | 1.0 | 7934 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2070837516 | 2072047318 | 0.978032 | HI | false | New construction | 1.0 | New construction | D12C Plan, Kaikoi at Hoopili | 0.802290 | D12B Plan, Kaikoi at Hoopili | 96706 | 1.0 | 96706 |
305578084 | 90035758 | 0.977991 | MO | false | Condo for sale | 1.0 | Condo for sale | 210 N 17th St UNIT 203, Saint Louis, MO 63103 | 0.801920 | 210 N 17th St UNIT 1202, Saint Louis, MO 63103 | 63103 | 1.0 | 63103 |
2071195670 | 88086529 | 0.977983 | MI | false | Condo for sale | 1.0 | Condo for sale | 6533 E Jefferson Ave APT 426, Detroit, MI 48207 | 0.801844 | 6533 E Jefferson Ave APT 102E, Detroit, MI 48207 | 48207 | 1.0 | 48207 |
247263033 | 247263136 | 0.977941 | IA | false | New construction | 1.0 | New construction | 1 University Way #511, Iowa City, IA 52246 | 0.801474 | 1 University Way #503, Iowa City, IA 52246 | 52246 | 1.0 | 52246 |
2083656138 | 2083656146 | 0.977873 | IN | false | Condo for sale | 1.0 | Condo for sale | 3789 S Anderson Dr, Terre Haute, IN 47803 | 0.800855 | 3776 S Anderson Dr, Terre Haute, IN 47803 | 47803 | 1.0 | 47803 |
94 rows × 12 columns
2. No grouping
The results above can be obtained in yet another way. However, as mentioned above, it can take much longer to compute in cases where some fuzzily matched fields have very few uniques values.
The following call took ≈3 minutes 30 seconds to run:
record_linkage(
df,
fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
('address', 0.8, 3, 1),
('addressZipcode', 0.999999, 3, 2),
('hasVideo', 0.999999, 3, 1),
('addressState', 0.999999, 2, 4)],
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_symmetries=False
)
statusText | address | addressZipcode | hasVideo | addressState | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Weighted Mean Similarity Score | left | similarity | right | left | similarity | right | left | similarity | right | left | similarity | right | left | similarity | right | ||
left_zpid | right_zpid | ||||||||||||||||
2077667803 | 2077679643 | 1.000000 | Lot / Land for sale | 1.0 | Lot / Land for sale | Jlda Minor Sub Division LOT C, Buffalo, WY 82834 | 1.000000 | Jlda Minor Subdivision LOT C, Buffalo, WY 82834 | 82834 | 1.0 | 82834 | false | 1.0 | false | WY | 1.0 | WY |
2075244057 | 2075358943 | 0.997100 | Lot / Land for sale | 1.0 | Lot / Land for sale | 0 Township Road 118, Kimbolton, OH 43749 | 0.973904 | Township Road 118, Kimbolton, OH 43749 | 43749 | 1.0 | 43749 | false | 1.0 | false | OH | 1.0 | OH |
2077676622 | 2077676809 | 0.993867 | Lot / Land for sale | 1.0 | Lot / Land for sale | 4 55th St SE, Christine, ND 58015 | 0.944802 | 2 55th St SE, Christine, ND 58015 | 58015 | 1.0 | 58015 | false | 1.0 | false | ND | 1.0 | ND |
2077093064 | 2078843498 | 0.993328 | Lot / Land for sale | 1.0 | Lot / Land for sale | 17 Sidney Park Rd, Custer, SD 57730 | 0.939948 | Sidney Park Rd, Custer, SD 57730 | 57730 | 1.0 | 57730 | false | 1.0 | false | SD | 1.0 | SD |
150690392 | 2076123604 | 0.992909 | Lot / Land for sale | 1.0 | Lot / Land for sale | 5 Windsor Ln, Gladstone, NJ 07934 | 0.936180 | 0 Windsor Ln, Gladstone, NJ 07934 | 7934 | 1.0 | 7934 | false | 1.0 | false | NJ | 1.0 | NJ |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2070837516 | 2072047318 | 0.978032 | New construction | 1.0 | New construction | D12C Plan, Kaikoi at Hoopili | 0.802290 | D12B Plan, Kaikoi at Hoopili | 96706 | 1.0 | 96706 | false | 1.0 | false | HI | 1.0 | HI |
305578084 | 90035758 | 0.977991 | Condo for sale | 1.0 | Condo for sale | 210 N 17th St UNIT 203, Saint Louis, MO 63103 | 0.801920 | 210 N 17th St UNIT 1202, Saint Louis, MO 63103 | 63103 | 1.0 | 63103 | false | 1.0 | false | MO | 1.0 | MO |
2071195670 | 88086529 | 0.977983 | Condo for sale | 1.0 | Condo for sale | 6533 E Jefferson Ave APT 426, Detroit, MI 48207 | 0.801844 | 6533 E Jefferson Ave APT 102E, Detroit, MI 48207 | 48207 | 1.0 | 48207 | false | 1.0 | false | MI | 1.0 | MI |
247263033 | 247263136 | 0.977941 | New construction | 1.0 | New construction | 1 University Way #511, Iowa City, IA 52246 | 0.801474 | 1 University Way #503, Iowa City, IA 52246 | 52246 | 1.0 | 52246 | false | 1.0 | false | IA | 1.0 | IA |
2083656138 | 2083656146 | 0.977873 | Condo for sale | 1.0 | Condo for sale | 3789 S Anderson Dr, Terre Haute, IN 47803 | 0.800855 | 3776 S Anderson Dr, Terre Haute, IN 47803 | 47803 | 1.0 | 47803 | false | 1.0 | false | IN | 1.0 | IN |
94 rows × 16 columns
One may choose to remove the field-values and output single-level column-
headings by setting hierarchical to False
:
record_linkage(
df,
fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
('address', 0.8, 3, 1),
('addressZipcode', 0.999999, 3, 2),
('hasVideo', 0.999999, 3, 1),
('addressState', 0.999999, 2, 4)],
hierarchical=False,
max_n_matches=10000,
similarity_dtype=np.float32,
force_symmetries=False
)
Weighted Mean Similarity Score | statusText | address | addressZipcode | hasVideo | addressState | ||
---|---|---|---|---|---|---|---|
left_zpid | right_zpid | ||||||
2077667803 | 2077679643 | 1.000000 | 1.0 | 1.000000 | 1.0 | 1.0 | 1.0 |
2075244057 | 2075358943 | 0.997100 | 1.0 | 0.973904 | 1.0 | 1.0 | 1.0 |
2077676622 | 2077676809 | 0.993867 | 1.0 | 0.944802 | 1.0 | 1.0 | 1.0 |
2077093064 | 2078843498 | 0.993328 | 1.0 | 0.939948 | 1.0 | 1.0 | 1.0 |
150690392 | 2076123604 | 0.992909 | 1.0 | 0.936180 | 1.0 | 1.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... |
2070837516 | 2072047318 | 0.978032 | 1.0 | 0.802290 | 1.0 | 1.0 | 1.0 |
305578084 | 90035758 | 0.977991 | 1.0 | 0.801920 | 1.0 | 1.0 | 1.0 |
2071195670 | 88086529 | 0.977983 | 1.0 | 0.801844 | 1.0 | 1.0 | 1.0 |
247263033 | 247263136 | 0.977941 | 1.0 | 0.801474 | 1.0 | 1.0 | 1.0 |
2083656138 | 2083656146 | 0.977873 | 1.0 | 0.800855 | 1.0 | 1.0 | 1.0 |
94 rows × 6 columns
Performance
Plots of Runtimes of record_linkage()
vs the number of blocks (#blocks
) into which the left matrix-operand of the dataset (380 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number of blocks into which the left matrix-operand was split.
String comparison, as implemented by string_grouper
, is essentially matrix
multiplication. A DataFrame of strings is converted (tokenized) into a
matrix. Then that matrix is multiplied by itself.
Here is an illustration of multiplication of two matrices M and D:
It turns out that when the matrix (or DataFrame) is very large, the computer
proceeds quite slowly with the multiplication (apparently due to the RAM being
too full). Some computers give up with an OverflowError
.
To circumvent this issue, red_string_grouper
allows to divide the DataFrame
into smaller chunks (or blocks) and multiply the chunks one pair at a time
instead to get the same result:
But surprise ... the run-time of the process is sometimes drastically reduced
as a result. For example, the speed-up of the following call is about 200%
(here, the DataFrame is divided into 4 blocks, that is,
4 blocks on the left × 4 on the right) compared to the same call with
n_blocks=(1, 1)
(the default) which is equivalent to string_grouper
's
match_strings()
:
companies = pd.read_csv('data/sec__edgar_company_info.csv')
# the following call produces the same result as
# string_grouper using
# match_strings(companies['Company Name'])
record_linkage(
companies,
fields_2b_matched_fuzzily=[('Company Name', 0.8, 3, 1)],
fields_2b_matched_exactly=None,
hierarchical=True,
max_n_matches=10000,
similarity_dtype=np.float32,
force_corrections=True,
n_blocks=(4, 4)
)
Further exploration of the block number space shows that if the left operand is not split but the right operand is, then even more gains in speed can be made:
Here are some plots of the results of some experiments performed:
From the plot above, it can be seen that the optimum split-configuration (run-time ≈3 minutes) is when the left operand is not split (#blocks = 1) and the right operand is split into six blocks (#nblocks = 6).
So what are the optimum block number values for a given DataFrame? That is anyone's guess, and the answer may vary from computer to computer.
We however encourage the user to make judicious use of the n_blocks
parameter.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.