String grouper contains functions to do string matching using TF-IDF and the cossine similarity.

These details have not been verified by PyPI

Project description

String Grouper

Click to see image

The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper. Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8).

The centroid of the group, as determined by string_grouper (see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it. A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.

The power of string_grouper is discernible from this image: in large datasets, string_grouper is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.

———

^{This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper operating on the sec__edgar_company_info.csv sample data file.}

string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.

Installing

pip install string-grouper

Speed

string_grouper leverages the blazingly fast sparse_dot_topn libary to calculate cosine similarities.

s = datetime.datetime.now()
matches = match_strings(names['Company Name'], number_of_processes = 4)

e = datetime.datetime.now()
diff = (e - s)
str(diff)

Results in:

00:05:34.65 On an Intel i7-6500U CPU @ 2.50GHz, where len(names) = 663 000

in other words, the library is able to perform fuzzy matching of 663 000 names in five and a half minutes on a 2015 consumer CPU using 4 cores.

Simple Match

import pandas as pd
from string_grouper import match_strings

company_names = 'sec__edgar_company_info.csv'
companies = pd.read_csv(company_names)
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()

	left_index	left_Company Name	similarity	right_Company Name	right_index
15	14	0210, LLC	0.870291	90210 LLC	4211
167	165	1 800 MUTUALS ADVISOR SERIES	0.931615	1 800 MUTUALS ADVISORS SERIES	166
168	166	1 800 MUTUALS ADVISORS SERIES	0.931615	1 800 MUTUALS ADVISOR SERIES	165
172	168	1 800 RADIATOR FRANCHISE INC	1	1-800-RADIATOR FRANCHISE INC.	201
178	173	1 FINANCIAL MARKETPLACE SECURITIES LLC /BD	0.949364	1 FINANCIAL MARKETPLACE SECURITIES, LLC	174

Group Similar Strings and Find most Common

companies[["group-id", "name_deduped"]] = group_similar_strings(companies['Company Name'])
companies.groupby('name_deduped')['Line Number'].count().sort_values(ascending=False).head(10)

name_deduped	Line Number
ADVISORS DISCIPLINED TRUST	1747
NUVEEN TAX EXEMPT UNIT TRUST SERIES 1	916
GUGGENHEIM DEFINED PORTFOLIOS, SERIES 1200	652
U S TECHNOLOGIES INC	632
CAPITAL MANAGEMENT LLC	628
CLAYMORE SECURITIES DEFINED PORTFOLIOS, SERIES 200	611
E ACQUISITION CORP	561
CAPITAL PARTNERS LP	561
FIRST TRUST COMBINED SERIES 1	560
PRINCIPAL LIFE INCOME FUNDINGS TRUST 20	544

Documentation

The documentation can be found here

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.7.2

Jun 3, 2026

0.7.1

Feb 26, 2025

0.7.0.7

Jan 27, 2025

0.6.1

Nov 14, 2021

0.6.0

Oct 15, 2021

0.5.0

Jul 2, 2021

0.4.0

Apr 11, 2021

0.3.2

Feb 21, 2021

0.2.2

Feb 8, 2021

0.1.2

Oct 12, 2020

0.1.1

Jul 15, 2020

0.1.0

Jan 2, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

string_grouper-0.7.2.tar.gz (2.4 MB view details)

Uploaded Jun 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

string_grouper-0.7.2-py3-none-any.whl (23.6 kB view details)

Uploaded Jun 3, 2026 Python 3

File details

Details for the file string_grouper-0.7.2.tar.gz.

File metadata

Download URL: string_grouper-0.7.2.tar.gz
Upload date: Jun 3, 2026
Size: 2.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for string_grouper-0.7.2.tar.gz
Algorithm	Hash digest
SHA256	`c97dcda79779e4fa473892d1b382de4646a751307986de20e17edc4b90834f6f`
MD5	`84f95fa9f2ce6874416ac58047151f23`
BLAKE2b-256	`70af9c045913febbf13425ee7a6cb679430ca7b6c60483dceb6a609c9bbe5eff`

See more details on using hashes here.

File details

Details for the file string_grouper-0.7.2-py3-none-any.whl.

File metadata

Download URL: string_grouper-0.7.2-py3-none-any.whl
Upload date: Jun 3, 2026
Size: 23.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for string_grouper-0.7.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`90e1e30dda020bba5a5988e02285a79aa824f98fa2395e147259fd9643c22a9c`
MD5	`cde4be090197d8126cd0ff7230852e09`
BLAKE2b-256	`4c8e37b8f9a9d01f2560246daec91717de2f846684afa1db0dc99bdae48e73f0`

See more details on using hashes here.

string-grouper 0.7.2

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

String Grouper

Installing

Speed

Simple Match

Group Similar Strings and Find most Common

Documentation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes