Row Equivalence Discoverer (red) based on string_grouper. This package finds similarities between rows of a table.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Red String Grouper

Record Equivalence Discoverer based on String Grouper (Red String Grouper) is a python package that finds similarities between rows/records of a table with multiple fields.
It is an extension of String Grouper, a library that makes finding groups of similar strings within one or two lists of strings easy ï¿½ and fast.

Installation

pip install red_string_grouper

Usage

How do we achieve matching for multiple fields and priorities?

Import the function record_linkage() from red_string_grouper and specify the fields of the table over which the comparison will be made.

import pandas as pd 
import numpy as np 
from red_string_grouper import record_linkage

matches = record_linkage(data_frame, fields_2b_matched_fuzzily,
                         fields_2b_matched_exactly=None,
                         hierarchical=True, max_n_matches=None,
                         similarity_dtype=np.float32, force_symmetries=True,
                         n_blocks=None)

This is a function that combines similarity-matching results of several fields of a DataFrame (data_frame) and returns them in another DataFrame (matches).

Parameter	Status	Description
`data_frame`	Required	`pandas.DataFrame` of strings which is the table over which the comparisons will be made.
`fields_2b_matched_fuzzily`	Required	List of tuples. Each tuple is a quadruple: (<field name>, <threshold>, <ngram_size>, <weight>). <field name> is the name of a field in `data_frame` which is to be matched using a threshold similarity score of <threshold> and an ngram size of <ngram_size>. <weight> is a number that defines the relative importance of the field to other fields -- the field's contribution to the mean similarity will be weighted by this number. <weighted mean similarity score> = (Σ_field <weight>_field × <similarity>_field) / (Σ_fieldweight_field), where Σ_field means "sum over fields".
`fields_2b_matched_exactly`	Optional	List of tuples. Each tuple is a pair: (<field name>, <weight>). <field name> is the name of a field in `data_frame` which is to be matched exactly. <weight> has the same meaning as in parameter `fields_2b_matched_fuzzily`. Defaults to `None`.
`hierarchical`	Optional	`bool`. Determines if the output DataFrame will have a hierarchical column-structure (`True`) or not (`False`). Defaults to `True`.
`max_n_matches`	Optional	`int`. Maximum number of matches allowed per string. Defaults to the total number of rows.
`similarity_dtype`	Optional	`numpy` type. Either `np.float32` (the default) or `np.float64`. A value of `np.float32` allows for less memory overhead during computation but less numerical precision, while `np.float64` allows for greater numerical precision but a larger memory overhead.
`force_symmetries`	Optional	`bool`. Specifies whether corrections should be made to the results to account for symmetry thus circumventing some errors which result from loss of numerical significance. Defaults to `True`.
`n_blocks`	Optional	`(int, int)`. This parameter is provided to boost performance, if possible, by splitting the dataset into `n_blocks[0]` blocks for the left operand (of the "comparison operator") and into `n_blocks[1]` blocks for the right operand before performing the string-comparisons blockwise.

Examples

import pandas as pd 
from red_string_grouper import record_linkage

Prepare the Input Data:

Here's some sample data:

inputfilename = 'data/us-cities-real-estate-sample-zenrows.csv'
df = pd.read_csv(inputfilename, dtype=str)

Note that the data has been read into memory as strings (dtype=str), since red_string_grouper is based on string comparison.

The dataset is a table of 10 000 records:

len(df)

Let us examine the data to determine which fields to compare (that is, use only columns without null or NaN data). At the same time, we will also check how many unique values each field has:

for field in df.columns:
    if df[field].nunique() > 1 and not df[field].isna().values.any():
        print(f'{field} : {df[field].nunique()}')

zpid : 10000
id : 10000
imgSrc : 9940
detailUrl : 10000
statusText : 24
address : 10000
addressState : 51
addressZipcode : 6446
isUndisclosedAddress : 2
isZillowOwned : 2
has3DModel : 2
hasVideo : 2
isFeaturedListing : 2
list : 2

We may set field 'zpid' as the index, since it has exactly the same number of unique values as the number of rows. zpid will thus be used to identify each row.

df.set_index('zpid', inplace=True)

Call `record_linkage()`:

There is more than one way to achieve the same matching result. But some ways are faster than others, depending on the data.

Plot comparing Runtimes of `record_linkage()` calls with and without grouping on a test field having a varying number of unique values in a 10 000-row DataFrame

1. Grouping by fields that are to be matched exactly

Note that those fields that have very few unique values distributed among a large number of rows, such as 'hasVideo' (2 unique values) and 'addressState' (51 unique values), can be specified as "fields that are to be matched exactly" (that is, in parameter fields_2b_matched_exactly) which can lead to a significant performance boost.

Behind the scenes, this allows record_linkage() to avoid using cosine- similarity matching on these fields (which is time-consuming, since many matches are likely to be found), and instead group by these fields.

In this way, cosine-similarity matching can be performed only on the other fields (in parameter fields_2b_matched_fuzzily) for each group.

On the other hand, grouping by 'addressZipcode' (which has 6446 unique values) degrades performance, since the groups by this field are so many.

To illustrate, the following call took ≈ 5 minutes to run:

matches = record_linkage(
	df,
	fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
	                           ('address', 0.8, 3, 1)],
	fields_2b_matched_exactly=[('addressZipcode', 2),
	                           ('addressState', 4),
	                           ('hasVideo', 1)],
	hierarchical=True,
	max_n_matches=10000,
	similarity_dtype=np.float32,
	force_symmetries=False
)

whereas, the following call (which produces the same result) took ≈8 seconds to run:

matches = record_linkage(
	df,
	fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
	                           ('address', 0.8, 3, 1),
	                           ('addressZipcode', 0.999999, 3, 2)],
	fields_2b_matched_exactly=[('addressState', 4),
	                           ('hasVideo', 1)],
	hierarchical=True,
	max_n_matches=10000,
	similarity_dtype=np.float32,
	force_symmetries=False
)

Let's display the results:

matches

			Exactly Matched Fields	Fuzzily Matched Fields
			addressState	hasVideo	statusText	address	addressZipcode
		Weighted Mean Similarity Score			left	similarity	right	left	similarity	right	left	similarity	right
left_zpid	right_zpid
2077667803	2077679643	1.000000	WY	false	Lot / Land for sale	1.0	Lot / Land for sale	Jlda Minor Sub Division LOT C, Buffalo, WY 82834	1.000000	Jlda Minor Subdivision LOT C, Buffalo, WY 82834	82834	1.0	82834
2075244057	2075358943	0.997100	OH	false	Lot / Land for sale	1.0	Lot / Land for sale	0 Township Road 118, Kimbolton, OH 43749	0.973904	Township Road 118, Kimbolton, OH 43749	43749	1.0	43749
2077676622	2077676809	0.993867	ND	false	Lot / Land for sale	1.0	Lot / Land for sale	4 55th St SE, Christine, ND 58015	0.944802	2 55th St SE, Christine, ND 58015	58015	1.0	58015
2077093064	2078843498	0.993328	SD	false	Lot / Land for sale	1.0	Lot / Land for sale	17 Sidney Park Rd, Custer, SD 57730	0.939948	Sidney Park Rd, Custer, SD 57730	57730	1.0	57730
150690392	2076123604	0.992909	NJ	false	Lot / Land for sale	1.0	Lot / Land for sale	5 Windsor Ln, Gladstone, NJ 07934	0.936180	0 Windsor Ln, Gladstone, NJ 07934	7934	1.0	7934
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2070837516	2072047318	0.978032	HI	false	New construction	1.0	New construction	D12C Plan, Kaikoi at Hoopili	0.802290	D12B Plan, Kaikoi at Hoopili	96706	1.0	96706
305578084	90035758	0.977991	MO	false	Condo for sale	1.0	Condo for sale	210 N 17th St UNIT 203, Saint Louis, MO 63103	0.801920	210 N 17th St UNIT 1202, Saint Louis, MO 63103	63103	1.0	63103
2071195670	88086529	0.977983	MI	false	Condo for sale	1.0	Condo for sale	6533 E Jefferson Ave APT 426, Detroit, MI 48207	0.801844	6533 E Jefferson Ave APT 102E, Detroit, MI 48207	48207	1.0	48207
247263033	247263136	0.977941	IA	false	New construction	1.0	New construction	1 University Way #511, Iowa City, IA 52246	0.801474	1 University Way #503, Iowa City, IA 52246	52246	1.0	52246
2083656138	2083656146	0.977873	IN	false	Condo for sale	1.0	Condo for sale	3789 S Anderson Dr, Terre Haute, IN 47803	0.800855	3776 S Anderson Dr, Terre Haute, IN 47803	47803	1.0	47803

94 rows × 12 columns

2. No grouping

The results above can be obtained in yet another way. However, as mentioned above, it can take much longer to compute in cases where some fuzzily matched fields have very few uniques values.

The following call took ≈3 minutes 30 seconds to run:

record_linkage(
	df,
	fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
	                           ('address', 0.8, 3, 1),
	                           ('addressZipcode', 0.999999, 3, 2),
	                           ('hasVideo', 0.999999, 3, 1),
	                           ('addressState', 0.999999, 2, 4)],
	hierarchical=True,
	max_n_matches=10000,
	similarity_dtype=np.float32,
	force_symmetries=False
)

			statusText	address	addressZipcode	hasVideo	addressState
		Weighted Mean Similarity Score	left	similarity	right	left	similarity	right	left	similarity	right	left	similarity	right	left	similarity	right
left_zpid	right_zpid
2077667803	2077679643	1.000000	Lot / Land for sale	1.0	Lot / Land for sale	Jlda Minor Sub Division LOT C, Buffalo, WY 82834	1.000000	Jlda Minor Subdivision LOT C, Buffalo, WY 82834	82834	1.0	82834	false	1.0	false	WY	1.0	WY
2075244057	2075358943	0.997100	Lot / Land for sale	1.0	Lot / Land for sale	0 Township Road 118, Kimbolton, OH 43749	0.973904	Township Road 118, Kimbolton, OH 43749	43749	1.0	43749	false	1.0	false	OH	1.0	OH
2077676622	2077676809	0.993867	Lot / Land for sale	1.0	Lot / Land for sale	4 55th St SE, Christine, ND 58015	0.944802	2 55th St SE, Christine, ND 58015	58015	1.0	58015	false	1.0	false	ND	1.0	ND
2077093064	2078843498	0.993328	Lot / Land for sale	1.0	Lot / Land for sale	17 Sidney Park Rd, Custer, SD 57730	0.939948	Sidney Park Rd, Custer, SD 57730	57730	1.0	57730	false	1.0	false	SD	1.0	SD
150690392	2076123604	0.992909	Lot / Land for sale	1.0	Lot / Land for sale	5 Windsor Ln, Gladstone, NJ 07934	0.936180	0 Windsor Ln, Gladstone, NJ 07934	7934	1.0	7934	false	1.0	false	NJ	1.0	NJ
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2070837516	2072047318	0.978032	New construction	1.0	New construction	D12C Plan, Kaikoi at Hoopili	0.802290	D12B Plan, Kaikoi at Hoopili	96706	1.0	96706	false	1.0	false	HI	1.0	HI
305578084	90035758	0.977991	Condo for sale	1.0	Condo for sale	210 N 17th St UNIT 203, Saint Louis, MO 63103	0.801920	210 N 17th St UNIT 1202, Saint Louis, MO 63103	63103	1.0	63103	false	1.0	false	MO	1.0	MO
2071195670	88086529	0.977983	Condo for sale	1.0	Condo for sale	6533 E Jefferson Ave APT 426, Detroit, MI 48207	0.801844	6533 E Jefferson Ave APT 102E, Detroit, MI 48207	48207	1.0	48207	false	1.0	false	MI	1.0	MI
247263033	247263136	0.977941	New construction	1.0	New construction	1 University Way #511, Iowa City, IA 52246	0.801474	1 University Way #503, Iowa City, IA 52246	52246	1.0	52246	false	1.0	false	IA	1.0	IA
2083656138	2083656146	0.977873	Condo for sale	1.0	Condo for sale	3789 S Anderson Dr, Terre Haute, IN 47803	0.800855	3776 S Anderson Dr, Terre Haute, IN 47803	47803	1.0	47803	false	1.0	false	IN	1.0	IN

94 rows × 16 columns

One may choose to remove the field-values and output single-level column- headings by setting hierarchical to False:

record_linkage(
	df,
	fields_2b_matched_fuzzily=[('statusText', 0.8, 3, 1),
	                           ('address', 0.8, 3, 1),
	                           ('addressZipcode', 0.999999, 3, 2),
	                           ('hasVideo', 0.999999, 3, 1),
	                           ('addressState', 0.999999, 2, 4)],
	hierarchical=False,
	max_n_matches=10000,
	similarity_dtype=np.float32,
	force_symmetries=False
)

		Weighted Mean Similarity Score	statusText	address	addressZipcode	hasVideo	addressState
left_zpid	right_zpid
2077667803	2077679643	1.000000	1.0	1.000000	1.0	1.0	1.0
2075244057	2075358943	0.997100	1.0	0.973904	1.0	1.0	1.0
2077676622	2077676809	0.993867	1.0	0.944802	1.0	1.0	1.0
2077093064	2078843498	0.993328	1.0	0.939948	1.0	1.0	1.0
150690392	2076123604	0.992909	1.0	0.936180	1.0	1.0	1.0
...	...	...	...	...	...	...	...
2070837516	2072047318	0.978032	1.0	0.802290	1.0	1.0	1.0
305578084	90035758	0.977991	1.0	0.801920	1.0	1.0	1.0
2071195670	88086529	0.977983	1.0	0.801844	1.0	1.0	1.0
247263033	247263136	0.977941	1.0	0.801474	1.0	1.0	1.0
2083656138	2083656146	0.977873	1.0	0.800855	1.0	1.0	1.0

94 rows × 6 columns

Performance

Plots of Runtimes of `record_linkage()` vs the number of blocks (`#blocks`) into which the left matrix-operand of the dataset (380 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison. As shown in the legend, each plot corresponds to the number of blocks into which the left matrix-operand was split.

String comparison, as implemented by string_grouper, is essentially matrix multiplication. A DataFrame of strings is converted (tokenized) into a matrix. Then that matrix is multiplied by itself.

Here is an illustration of multiplication of two matrices M and D: Block Matrix 1 1

It turns out that when the matrix (or DataFrame) is very large, the computer proceeds quite slowly with the multiplication (apparently due to the RAM being too full). Some computers give up with an OverflowError.

To circumvent this issue, red_string_grouper allows to divide the DataFrame into smaller chunks (or blocks) and multiply the chunks one pair at a time instead to get the same result:

Block Matrix 2 2

But surprise ... the run-time of the process is sometimes drastically reduced as a result. For example, the speed-up of the following call is about 200% (here, the DataFrame is divided into 4 blocks, that is, 4 blocks on the left × 4 on the right) compared to the same call with n_blocks=(1, 1) (the default) which is equivalent to string_grouper's match_strings():

companies = pd.read_csv('data/sec__edgar_company_info.csv')

# the following call produces the same result as 
# string_grouper using 
# match_strings(companies['Company Name'])
record_linkage(
	companies,
	fields_2b_matched_fuzzily=[('Company Name', 0.8, 3, 1)],
	fields_2b_matched_exactly=None,
	hierarchical=True,
	max_n_matches=10000,
	similarity_dtype=np.float32,
	force_symmetries=True,
	n_blocks=(4, 4)
)

Further exploration of the block number space shows that if the left operand is not split but the right operand is, then even more gains in speed can be made:

Block Matrix 1 2

Here are some plots of the results of some experiments performed:

From the plot above, it can be seen that the optimum split-configuration (run-time ≈3 minutes) is when the left operand is not split (#blocks = 1) and the right operand is split into six blocks (#nblocks = 6).

So what are the optimum block number values for a given DataFrame? That is anyone's guess, and the answer may vary from computer to computer.

We however encourage the user to make judicious use of the n_blocks parameter.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.0.post5

Oct 18, 2021

0.1.0.post4

Oct 12, 2021

0.1.0.post3

Oct 12, 2021

0.1.0.post2

Oct 12, 2021

0.1.0.post1

Oct 11, 2021

0.1.0

Oct 10, 2021

0.0.8

Sep 20, 2021

0.0.7

Sep 19, 2021

0.0.6

Sep 19, 2021

0.0.5

Sep 18, 2021

0.0.4

Sep 18, 2021

0.0.3

Sep 17, 2021

This version

0.0.2

Sep 16, 2021

0.0.1

Sep 16, 2021

0.0.0

Sep 16, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

red_string_grouper-0.0.2.tar.gz (35.0 kB view hashes)

Uploaded Sep 16, 2021 Source

Hashes for red_string_grouper-0.0.2.tar.gz

Hashes for red_string_grouper-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`6d14900a2e040d3b4cda86b1cbe7c0d65063ca93d2a3efb628e11607c90670b5`
MD5	`937503e31437eebb288c58e9511a1d9f`
BLAKE2b-256	`f67862e9519a86bc15084c98b7d9652e55218c5987239b5205d3e225a1ca9544`

red-string-grouper 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

Red String Grouper

Installation

Usage

Examples

Prepare the Input Data:

Call `record_linkage()`:

Plot comparing Runtimes of `record_linkage()` calls with and without grouping on a test field having a varying number of unique values in a 10 000-row DataFrame

1. Grouping by fields that are to be matched exactly

2. No grouping

Performance

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

red-string-grouper 0.0.2

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Project description

Red String Grouper

Installation

Usage

Examples

Prepare the Input Data:

Call record_linkage():

Plot comparing Runtimes of record_linkage() calls with and without grouping on a test field having a varying number of unique values in a 10 000-row DataFrame

1. Grouping by fields that are to be matched exactly

2. No grouping

Performance

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Call `record_linkage()`:

Plot comparing Runtimes of `record_linkage()` calls with and without grouping on a test field having a varying number of unique values in a 10 000-row DataFrame