Skip to main content

Row Equivalence Discoverer (red) based on string_grouper. This package finds similarities between rows of one or two tables.

Project description

Red String Grouper

Record Equivalence Discoverer based on String Grouper (Red String Grouper) is a python package that finds similarities between rows/records of a table with multiple fields.
It is an extension of String Grouper, a library that makes finding groups of similar strings within one or two lists of strings easy — and fast.

Installation

pip install red_string_grouper

Usage

How do we achieve matching for multiple fields and priorities?

Import the function record_linkage() from red_string_grouper and specify the fields of the table over which the comparisons will be made.

import pandas as pd 
import numpy as np 
from red_string_grouper import record_linkage, field, field_pair

matches = record_linkage(data_frames, fields_2b_matched_fuzzily,
                         fields_2b_matched_exactly=None,
                         hierarchical=True, **kwargs)

This is a function that combines similarity-matching results of several fields of one or two DataFrames (data_frames) and returns them in another DataFrame (matches).

Examples are given below.

Parameter Status Description
data_frames Required Either a pandas.DataFrame or a list of two pandas.DataFrames of strings which are the tables over which the comparisons will be made.
fields_2b_matched_fuzzily Required List of tuples. When data_frames is a DataFrame, each tuple must be a triple:
(〈field name〉, 〈weight〉, 〈field_kwargs〉)
which is best input using the provided auxiliary function: field() in the following way:
field(<field name>, weight=<weight>, **kwargs).
field name〉 is a string denoting the name of a field in data_frame whose values are to be matched.

When data_frames is a list of two DataFrames, then each tuple must be a quadruple:
(〈field name 1〉, 〈field name 2〉, 〈weight〉, 〈field_kwargs〉)
which is best input using the provided auxiliary function: field_pair() in the following way:
field_pair(<field name1>, <field name2>, weight=<weight>, **kwargs).
field name1〉 is a string denoting the name of a field in data_frames[0]. Similarly, field name2〉 is a string denoting the name of a field in data_frames[1] whose values are to be matched with those of 〈field name1〉.

weight〉 is a number (default: 1.0) that defines the relative importance of the field (or field-pair) to other fields (or field-pairs) -- the field's (or field-pair's) contribution to the mean similarity will be weighted by this number.
weighted mean similarity score〉 =
        (Σfieldweightfield × 〈similarityfield) / (Σfieldweightfield),
where Σfield means "sum over fields".

field_kwargs〉 is a dict capturing any keyword arguments to be passed to StringGrouper for this field. Any StringGrouper keyword arguments specified for the **kwargs of the field() (or field_pair()) function will be captured by 〈field_kwargs〉.
fields_2b_matched_exactly Optional List of tuples. When data_frames is a DataFrame, each tuple must be a pair:
(〈field name〉, 〈weight〉).

field name〉 is the name of a field in data_frame which is to be matched exactly. The auxiliary function: field() may be used to enter each tuple in the following way:
field(<field name>, weight=<weight>, **kwargs).

When data_frames is a list of two DataFrames, then each tuple must be a triple:
(〈field name 1〉, 〈field name 2〉, 〈weight〉)
which is best input using the provided auxiliary function: field_pair() in the following way:
field_pair(<field name1>, <field name2>, weight=<weight>).

field name1〉 is a string denoting the name of a field in data_frames[0]. Similarly, field name2〉 is a string denoting the name of a field in data_frames[1] whose values are to be matched with those of 〈field name1〉.

weight〉 has the same meaning as in parameter fields_2b_matched_fuzzily. Defaults to None.
hierarchical Optional bool. Determines if the output DataFrame will have a hierarchical column-structure (True) or not (False). Defaults to True.
**kwargs Optional Any string_grouper keyword arguments may be specified here. These will apply to all fields (or field-pairs) in fields_2b_matched_fuzzily. However, any keyword arguments already given in fields_2b_matched_fuzzily take precedence over those specified here.

Examples

import pandas as pd 
import numpy as np
from red_string_grouper import record_linkage, field

Prepare the Input Data:

Here's some sample data:

file = 'data/us-cities-real-estate-sample-zenrows.csv'
df = pd.read_csv(file, dtype = str, keep_default_na=False)  # all strings
dfna = pd.read_csv(file, dtype = str, keep_default_na=True) # has null-values

Note that the data has been read into memory as strings (dtype=str), since red_string_grouper is based on string comparison.

The dataset is a table of 10 000 records:

len(df)
10000

Let us examine the data to determine which fields are good to compare (that is, we will use only columns without null or NaN data). At the same time, we will also check how many unique values each field has and its maximum string-length (to inform our choice of its n-gram size):

column_info = pd.concat(
    [
        df.nunique().rename('#unique'),
        dfna.isna().any().rename('has null?'),
        df.applymap(len).max().rename('max strlen')
    ],
    axis=1
).rename_axis("column").reset_index()

column_info[
    (column_info['#unique'] > 1) & (~column_info['has null?'])
]
column #unique has null? max strlen
0 zpid 10000 False 10
1 id 10000 False 10
3 imgSrc 9940 False 231
5 detailUrl 10000 False 121
7 statusText 24 False 26
11 address 10000 False 72
14 addressState 51 False 2
15 addressZipcode 6446 False 5
16 isUndisclosedAddress 2 False 5
22 isZillowOwned 2 False 5
31 has3DModel 2 False 5
32 hasVideo 2 False 5
38 isFeaturedListing 2 False 5
39 list 2 False 5

Thus we may set field 'zpid' as the index, since it has exactly the same number of unique values as the number of rows. zpid will thus be used to identify each row in the matching results.

df.set_index('zpid', inplace=True)

Call record_linkage():

There is more than one way to achieve the same matching result. But some ways are faster than others, depending on the data.

Plot comparing run-times of record_linkage() calls with and without grouping on a test field having a varying number of unique values in a 10 000-row DataFrame

1. Grouping by fields that are to be matched exactly

Note that those fields that have very few unique values distributed among a large number of rows, such as 'hasVideo' (2 unique values) and 'addressState' (51 unique values), can be specified as "fields that are to be matched exactly" (that is, in parameter fields_2b_matched_exactly) which can lead to a significant performance boost.

Behind the scenes, this allows record_linkage() to avoid using cosine- similarity matching on these fields (which is time-consuming, since many matches are likely to be found), and instead group the data by these fields.

In this way, cosine-similarity matching can be performed only on the other fields (in parameter fields_2b_matched_fuzzily) for each group.

On the other hand, grouping by 'addressZipcode' (which has 6446 unique values) degrades performance, since the groups by this field are so many.

To illustrate, the following call took ≈ 5 minutes to run:

matches = record_linkage(
    df,
    fields_2b_matched_fuzzily=[
        field('statusText'),
        field('address')
    ],
    fields_2b_matched_exactly=[
        field('addressZipcode', weight=2),
        field('addressState', weight=4),
        field('hasVideo')
    ]
)

whereas, the following call (which produces the same result) took ≈8 seconds to run:

matches = record_linkage(
    df,
    fields_2b_matched_fuzzily=[
        field('statusText'),
        field('address'),
        field('addressZipcode', weight=2, min_similarity=0.9999)
    ],
    fields_2b_matched_exactly=[
        field('addressState', weight=4),
        field('hasVideo')
    ]
)

Note that in the above calls, unspecified options, such as weight and string_grouper keyword arguments) such as min_similarity and ngram_size take up their default values. The default value for weight is 1.0, while the default values for the string_grouper keyword arguments can be found at this link. Let's display the results:

matches
Exactly Matched Fields Fuzzily Matched Fields
addressState hasVideo statusText address addressZipcode
Weighted Mean Similarity Score left similarity right left similarity right left similarity right
left_zpid right_zpid
2077667803 2077679643 1.000000 WY false Lot / Land for sale 1.0 Lot / Land for sale Jlda Minor Sub Division LOT C, Buffalo, WY 82834 1.000000 Jlda Minor Subdivision LOT C, Buffalo, WY 82834 82834 1.0 82834
2075244057 2075358943 0.997100 OH false Lot / Land for sale 1.0 Lot / Land for sale 0 Township Road 118, Kimbolton, OH 43749 0.973904 Township Road 118, Kimbolton, OH 43749 43749 1.0 43749
2077676622 2077676809 0.993867 ND false Lot / Land for sale 1.0 Lot / Land for sale 4 55th St SE, Christine, ND 58015 0.944802 2 55th St SE, Christine, ND 58015 58015 1.0 58015
2077093064 2078843498 0.993328 SD false Lot / Land for sale 1.0 Lot / Land for sale 17 Sidney Park Rd, Custer, SD 57730 0.939948 Sidney Park Rd, Custer, SD 57730 57730 1.0 57730
150690392 2076123604 0.992909 NJ false Lot / Land for sale 1.0 Lot / Land for sale 5 Windsor Ln, Gladstone, NJ 07934 0.936180 0 Windsor Ln, Gladstone, NJ 07934 7934 1.0 7934
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2070837516 2072047318 0.978032 HI false New construction 1.0 New construction D12C Plan, Kaikoi at Hoopili 0.802290 D12B Plan, Kaikoi at Hoopili 96706 1.0 96706
305578084 90035758 0.977991 MO false Condo for sale 1.0 Condo for sale 210 N 17th St UNIT 203, Saint Louis, MO 63103 0.801920 210 N 17th St UNIT 1202, Saint Louis, MO 63103 63103 1.0 63103
2071195670 88086529 0.977983 MI false Condo for sale 1.0 Condo for sale 6533 E Jefferson Ave APT 426, Detroit, MI 48207 0.801844 6533 E Jefferson Ave APT 102E, Detroit, MI 48207 48207 1.0 48207
247263033 247263136 0.977941 IA false New construction 1.0 New construction 1 University Way #511, Iowa City, IA 52246 0.801474 1 University Way #503, Iowa City, IA 52246 52246 1.0 52246
2083656138 2083656146 0.977873 IN false Condo for sale 1.0 Condo for sale 3789 S Anderson Dr, Terre Haute, IN 47803 0.800855 3776 S Anderson Dr, Terre Haute, IN 47803 47803 1.0 47803

94 rows × 12 columns

2. No grouping

The results above can be obtained in yet another way. However, as mentioned above, it can take much longer to compute in cases where some fuzzily matched fields have very few uniques values.

The following call took ≈3 minutes 30 seconds to run:

record_linkage(
    df,
    fields_2b_matched_fuzzily=[
        field('statusText'),
        field('address'),
        field('addressZipcode', weight=2, min_similarity=0.9999),
        field('addressState', weight=4, ngram_size=2, min_similarity=0.9999),
        field('hasVideo', min_similarity=0.9999)
    ]
)
statusText address addressZipcode hasVideo addressState
Weighted Mean Similarity Score left similarity right left similarity right left similarity right left similarity right left similarity right
left_zpid right_zpid
2077667803 2077679643 1.000000 Lot / Land for sale 1.0 Lot / Land for sale Jlda Minor Sub Division LOT C, Buffalo, WY 82834 1.000000 Jlda Minor Subdivision LOT C, Buffalo, WY 82834 82834 1.0 82834 false 1.0 false WY 1.0 WY
2075244057 2075358943 0.997100 Lot / Land for sale 1.0 Lot / Land for sale 0 Township Road 118, Kimbolton, OH 43749 0.973904 Township Road 118, Kimbolton, OH 43749 43749 1.0 43749 false 1.0 false OH 1.0 OH
2077676622 2077676809 0.993867 Lot / Land for sale 1.0 Lot / Land for sale 4 55th St SE, Christine, ND 58015 0.944802 2 55th St SE, Christine, ND 58015 58015 1.0 58015 false 1.0 false ND 1.0 ND
2077093064 2078843498 0.993328 Lot / Land for sale 1.0 Lot / Land for sale 17 Sidney Park Rd, Custer, SD 57730 0.939948 Sidney Park Rd, Custer, SD 57730 57730 1.0 57730 false 1.0 false SD 1.0 SD
150690392 2076123604 0.992909 Lot / Land for sale 1.0 Lot / Land for sale 5 Windsor Ln, Gladstone, NJ 07934 0.936180 0 Windsor Ln, Gladstone, NJ 07934 7934 1.0 7934 false 1.0 false NJ 1.0 NJ
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2070837516 2072047318 0.978032 New construction 1.0 New construction D12C Plan, Kaikoi at Hoopili 0.802290 D12B Plan, Kaikoi at Hoopili 96706 1.0 96706 false 1.0 false HI 1.0 HI
305578084 90035758 0.977991 Condo for sale 1.0 Condo for sale 210 N 17th St UNIT 203, Saint Louis, MO 63103 0.801920 210 N 17th St UNIT 1202, Saint Louis, MO 63103 63103 1.0 63103 false 1.0 false MO 1.0 MO
2071195670 88086529 0.977983 Condo for sale 1.0 Condo for sale 6533 E Jefferson Ave APT 426, Detroit, MI 48207 0.801844 6533 E Jefferson Ave APT 102E, Detroit, MI 48207 48207 1.0 48207 false 1.0 false MI 1.0 MI
247263033 247263136 0.977941 New construction 1.0 New construction 1 University Way #511, Iowa City, IA 52246 0.801474 1 University Way #503, Iowa City, IA 52246 52246 1.0 52246 false 1.0 false IA 1.0 IA
2083656138 2083656146 0.977873 Condo for sale 1.0 Condo for sale 3789 S Anderson Dr, Terre Haute, IN 47803 0.800855 3776 S Anderson Dr, Terre Haute, IN 47803 47803 1.0 47803 false 1.0 false IN 1.0 IN

94 rows × 16 columns

One may choose to remove the field-values and output single-level column- headings by setting hierarchical to False:

record_linkage(
    df,
    fields_2b_matched_fuzzily=[
        field('statusText'),
        field('address'),
        field('addressZipcode', weight=2, min_similarity=0.9999),
        field('addressState', weight=4, ngram_size=2, min_similarity=0.9999),
        field('hasVideo', min_similarity=0.9999)
    ],
    hierarchical=False
)
Weighted Mean Similarity Score statusText address addressZipcode hasVideo addressState
left_zpid right_zpid
2077667803 2077679643 1.000000 1.0 1.000000 1.0 1.0 1.0
2075244057 2075358943 0.997100 1.0 0.973904 1.0 1.0 1.0
2077676622 2077676809 0.993867 1.0 0.944802 1.0 1.0 1.0
2077093064 2078843498 0.993328 1.0 0.939948 1.0 1.0 1.0
150690392 2076123604 0.992909 1.0 0.936180 1.0 1.0 1.0
... ... ... ... ... ... ... ...
2070837516 2072047318 0.978032 1.0 0.802290 1.0 1.0 1.0
305578084 90035758 0.977991 1.0 0.801920 1.0 1.0 1.0
2071195670 88086529 0.977983 1.0 0.801844 1.0 1.0 1.0
247263033 247263136 0.977941 1.0 0.801474 1.0 1.0 1.0
2083656138 2083656146 0.977873 1.0 0.800855 1.0 1.0 1.0

94 rows × 6 columns

Performance

Plots of run-times of record_linkage() vs the number n_blocks[1] of blocks into which the right matrix-operand of the dataset (663 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number n_blocks[0] of blocks into which the left matrix-operand was split.

String comparison, as implemented by string_grouper, is essentially matrix multiplication. A DataFrame of strings is converted (tokenized) into a matrix. Then that matrix is multiplied by itself (or another) transposed.

Here is an illustration of multiplication of two matrices D and MT: Block Matrix 1 1

It turns out that when the matrix (or DataFrame) is very large, the computer proceeds quite slowly with the multiplication (apparently due to the RAM being too full). Some computers give up with an OverflowError.

To circumvent this issue, red_string_grouper allows to divide the DataFrame into smaller chunks (or blocks) and multiply the chunks one pair at a time instead to get the same result:

Block Matrix 2 2

But surprise ... the run-time of the process is sometimes drastically reduced as a result. For example, the speed-up of the following call is about 500% (here, the DataFrame is divided into 200 blocks on the right operand, that is, 1 block on the left × 200 on the right) compared to the same call with no splitting [n_blocks=(1, 1), the default, which is equivalent to match_strings call of string_grouper (versions 0.5.0 and earlier)]:

# 663000 records:
companies = pd.read_csv('data/sec__edgar_company_info.csv')

# the following call produces the same result as 
# string_grouper using 
# match_strings(companies['Company Name'])
# but is more than 3 times faster!
record_linkage( 
	companies,
	fields_2b_matched_fuzzily=[field('Company Name')],
	n_blocks=(1, 200)
)

Further exploration of the block number space (see plot above) has revealed that for any fixed number of right blocks, the run-time gets longer the larger the number of left blocks specified. For this reason, it is recommended not to split the left matrix.

Block Matrix 1 2

So what are the optimum block number values for any given DataFrame? That is anyone's guess, and the answer may vary from computer to computer.

We however encourage the user to make judicious use of the n_blocks parameter to boost performance of record_linkage().

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

red_string_grouper-0.1.0.post5.tar.gz (42.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page