Skip to main content

A package to implement fuzzy matching between items in two different lists (an input list and a reference list.)

Project description

Description : This package can be used to compute similarity scores between items in two different lists.

Example Use Case : Dataload : Compare columns in a file to the ones in a database table before loading the data to catch hold of possible column name changes. If not, match the column names accordingly and then load the data !

Credits: To the authors of fuzzywuzzy package that has been used as a part of this package development.

1. Installation

pip install two_lists_similarity  #Use underscores as the seperators, not the hyphens. 

2. Usage


2.1: Import the Calculate_Similarity class from the above installed package.

from two_lists_similarity import Calculate_Similarity as cs

2.2: Create an object of this class with the below arguments.

  • inp_list : An input list of items.
  • ref_list : A reference list of items which are referenced by the input list items for the comparison.

It is mandatory that above arguments contain your desired input & reference lists before creating the object. Below is an example.

inp_list = ["Messi", "Superstar", "Soccer", "Ronaldo", "Mbappe"]

ref_list = ["Lionel Messi", "Cristiano Ronaldo", "Virgil Van Dikj", "are", "in", "the", "top", "three", "this","year" ,"OF", "BallonDor"]

# Create an instance of the class. This is otherwise called as an object 
csObj = cs(inp_list,ref_list)    
# csObj is now the object of Calculate Similarity class. 

2.3: Use the above object csObj to access the fuzzy_match_output function inside the Calculate_Similarity class to calculate similarity between the input list items and the reference list items.

csObj.fuzzy_match_output(output_csv_name = 'pkg_sim_test_vsc.csv', output_csv_path = r'C:\two-lists-similarity')

A brief overview of the function fuzzy_match_output can be found below.

Inputs :

  • output_csv_name : (Optional) Name of the output file that is to be generated.
  • output_csv_path : (Optional) Path where the output file is to be stored at.

If output_csv_name is assigned a filename, then the default path to the file would always be your current working directory unless you specify a path explicitly using the output_csv_path variable.

Functionality :

  • Step 1: Compares every item in the input list against all the items in the reference list
  • Step 2: Calculates similarity scores for each of the above mentioned comparisons
  • Step 3. Match the list item in the input list with its counterpart in the reference list that has the highest similarity score.

An illustration of the above steps can be found below :

Initiating fuzzy matching.......
------------------------------------------------
Input column name : Messi
Similarity Ratios when compared with the similar reference list items are as below :  [('Lionel Messi', 90), ('in', 45), ('Cristiano Ronaldo', 36), ('are', 25), ('the', 25)]
Associated Reference list item with highest similarity : 
('Lionel Messi', 90)
------------------------------------------------
Input column name : Superstar
Similarity Ratios when compared with the similar reference list items are as below :  [('are', 60), ('year', 46), ('Cristiano Ronaldo', 40), ('three', 36), ('the', 30)]
Associated Reference list item with highest similarity : 
('are', 60)
------------------------------------------------
Input column name : Soccer
Similarity Ratios when compared with the similar reference list items are as below :  [('year', 45), ('OF', 45), ('Lionel Messi', 30), ('Cristiano Ronaldo', 30), ('are', 30)]
Associated Reference list item with highest similarity : 
('year', 45)
------------------------------------------------
Input column name : Ronaldo
Similarity Ratios when compared with the similar reference list items are as below :  [('Cristiano Ronaldo', 90), ('BallonDor', 50), ('in', 45), ('OF', 45), ('Lionel Messi', 39)]
Associated Reference list item with highest similarity : 
('Cristiano Ronaldo', 90)
------------------------------------------------
Input column name : Mbappe
Similarity Ratios when compared with the similar reference list items are as below :  [('are', 44), ('Lionel Messi', 30), ('the', 30), ('top', 30), ('BallonDor', 30)]
Associated Reference list item with highest similarity : 
('are', 44)
------------------------------------------------

Outputs :

  • Returns a dataframe with each row containing the below relation.
    (Input List Item, Highest similar Reference List item, Similarity score)
  • Generates a CSV generated from the above mentioned dataframe at your desired path.

Below is the output of the sample input and reference lists used above.

Output Data Frame looks like : 
  input_list_item similar_ref_list_item  similarity_score
0           Messi          Lionel Messi              0.90
1       Superstar                   are              0.60
2          Soccer                  year              0.45
3         Ronaldo     Cristiano Ronaldo              0.90
4          Mbappe                   are              0.44

2.4: Use the object csObj to access the dissimilar_input_items function inside the Calculate_Similarity class to find out the input list items that are way too different when compared to all the reference list items.

csObj.dissimilar_input_items(similarity_threshold = 0.65)

A brief overview of the function dissimilar_input_items can be found below.

Inputs :

  • similarity_threshold : A float value between (0.00 - 1.00) for which you want to classify similarity and non-similarity. Recommended Value : 0.65, which is also the default value for this variable.

Functionality :

  • Applies the threshold to filter out the records that have similarity_score <= Similarity Treshold, from the dataframe returned by the function fuzzy_match_output.

Output :

  • List of items from the input list that have similarity scores <= threshold when compared against all the reference list items

Below is the output of the function dissimilar_input_items when applied on the input, reference list items used above.

ALERT : Input list items that are way too different from the reference list items are :  ['Superstar', 'Soccer', 'Mbappe']

Thank You. Will try to add more functions to this package whenever possible.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

two_lists_similarity-0.0.5.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

two_lists_similarity-0.0.5-py3-none-any.whl (6.3 kB view details)

Uploaded Python 3

File details

Details for the file two_lists_similarity-0.0.5.tar.gz.

File metadata

  • Download URL: two_lists_similarity-0.0.5.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3

File hashes

Hashes for two_lists_similarity-0.0.5.tar.gz
Algorithm Hash digest
SHA256 f2a72793848c5e55469382631797e6bf0a6a2fa40a9f693510366c6437a5e421
MD5 5c4bed1777d6f0692a741a1e430eb159
BLAKE2b-256 7f294902ba3d139deae6f0b59f36f3be9505c6a86f91371b981581ea4dcb0ef0

See more details on using hashes here.

File details

Details for the file two_lists_similarity-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: two_lists_similarity-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 6.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3

File hashes

Hashes for two_lists_similarity-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 af47c8e7df894a0ed119bb5d03e14bfd033c35e99946d2bcd15134b94758e5df
MD5 96fb97e514e371bebf215d66df73211c
BLAKE2b-256 3be3d84af65c4081aa679285ce38b54525ab7a17bba1da2d6a9660553fc73e52

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page