two-lists-similarity

A package to implement fuzzy matching between items in two different lists (an input list and a reference list.)

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Description : This package can be used to compute similarity scores between items in two different lists.

Example Use Case : Dataload : Compare columns in a file to the ones in a database table before loading the data to catch hold of possible column name changes. If not, match the column names accordingly and then load the data !

Credits: To the authors of fuzzywuzzy package that has been used as a part of this package development.

1. Installation

pip install two_lists_similarity  #Use underscores as the seperators, not the hyphens.

2. Usage

2.1: Import the Calculate_Similarity class from the above installed package.

from two_lists_similarity import Calculate_Similarity as cs

2.2: Create an object of this class with the below arguments.

inp_list : An input list of items.
ref_list : A reference list of items which are referenced by the input list items for the comparison.

It is mandatory that above arguments contain your desired input & reference lists before creating the object. Below is an example.

inp_list = ["Messi", "Superstar", "Soccer", "Ronaldo", "Mbappe"]

ref_list = ["Lionel Messi", "Cristiano Ronaldo", "Virgil Van Dikj", "are", "in", "the", "top", "three", "this","year" ,"OF", "BallonDor"]

# Create an instance of the class. This is otherwise called as an object 
csObj = cs(inp_list,ref_list)    
# csObj is now the object of Calculate Similarity class.

2.3: Use the above object csObj to access the fuzzy_match_output function inside the Calculate_Similarity class to calculate similarity between the input list items and the reference list items.

csObj.fuzzy_match_output(output_csv_name = 'pkg_sim_test_vsc.csv', output_csv_path = r'C:\two-lists-similarity')

A brief overview of the function fuzzy_match_output can be found below.

Inputs :

output_csv_name : (Optional) Name of the output file that is to be generated.
output_csv_path : (Optional) Path where the output file is to be stored at.

If output_csv_name is assigned a filename, then the default path to the file would always be your current working directory unless you specify a path explicitly using the output_csv_path variable.

Functionality :

Step 1: Compares every item in the input list against all the items in the reference list
Step 2: Calculates similarity scores for each of the above mentioned comparisons
Step 3. Match the list item in the input list with its counterpart in the reference list that has the highest similarity score.

An illustration of the above steps can be found below :

Initiating fuzzy matching.......
------------------------------------------------
Input column name : Messi
Similarity Ratios when compared with the similar reference list items are as below :  [('Lionel Messi', 90), ('in', 45), ('Cristiano Ronaldo', 36), ('are', 25), ('the', 25)]
Associated Reference list item with highest similarity : 
('Lionel Messi', 90)
------------------------------------------------
Input column name : Superstar
Similarity Ratios when compared with the similar reference list items are as below :  [('are', 60), ('year', 46), ('Cristiano Ronaldo', 40), ('three', 36), ('the', 30)]
Associated Reference list item with highest similarity : 
('are', 60)
------------------------------------------------
Input column name : Soccer
Similarity Ratios when compared with the similar reference list items are as below :  [('year', 45), ('OF', 45), ('Lionel Messi', 30), ('Cristiano Ronaldo', 30), ('are', 30)]
Associated Reference list item with highest similarity : 
('year', 45)
------------------------------------------------
Input column name : Ronaldo
Similarity Ratios when compared with the similar reference list items are as below :  [('Cristiano Ronaldo', 90), ('BallonDor', 50), ('in', 45), ('OF', 45), ('Lionel Messi', 39)]
Associated Reference list item with highest similarity : 
('Cristiano Ronaldo', 90)
------------------------------------------------
Input column name : Mbappe
Similarity Ratios when compared with the similar reference list items are as below :  [('are', 44), ('Lionel Messi', 30), ('the', 30), ('top', 30), ('BallonDor', 30)]
Associated Reference list item with highest similarity : 
('are', 44)
------------------------------------------------

Outputs :

Returns a dataframe with each row containing the below relation.
(Input List Item, Highest similar Reference List item, Similarity score)
Generates a CSV generated from the above mentioned dataframe at your desired path.

Below is the output of the sample input and reference lists used above.

Output Data Frame looks like : 
  input_list_item similar_ref_list_item  similarity_score
0           Messi          Lionel Messi              0.90
1       Superstar                   are              0.60
2          Soccer                  year              0.45
3         Ronaldo     Cristiano Ronaldo              0.90
4          Mbappe                   are              0.44

2.4: Use the object csObj to access the dissimilar_input_items function inside the Calculate_Similarity class to find out the input list items that are way too different when compared to all the reference list items.

csObj.dissimilar_input_items(similarity_threshold = 0.65)

A brief overview of the function dissimilar_input_items can be found below.

Inputs :

similarity_threshold : A float value between (0.00 - 1.00) for which you want to classify similarity and non-similarity. Recommended Value : 0.65, which is also the default value for this variable.

Functionality :

Applies the threshold to filter out the records that have similarity_score <= Similarity Treshold, from the dataframe returned by the function fuzzy_match_output.

Output :

List of items from the input list that have similarity scores <= threshold when compared against all the reference list items

Below is the output of the function dissimilar_input_items when applied on the input, reference list items used above.

ALERT : Input list items that are way too different from the reference list items are :  ['Superstar', 'Soccer', 'Mbappe']

Thank You. Will try to add more functions to this package whenever possible.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.5

Dec 17, 2019

0.0.4

Dec 16, 2019

0.0.3

Dec 16, 2019

0.0.2

Dec 16, 2019

0.0.1

Dec 16, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

two_lists_similarity-0.0.5.tar.gz (5.0 kB view details)

Uploaded Dec 17, 2019 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

two_lists_similarity-0.0.5-py3-none-any.whl (6.3 kB view details)

Uploaded Dec 17, 2019 Python 3

File details

Details for the file two_lists_similarity-0.0.5.tar.gz.

File metadata

Download URL: two_lists_similarity-0.0.5.tar.gz
Upload date: Dec 17, 2019
Size: 5.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3

File hashes

Hashes for two_lists_similarity-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`f2a72793848c5e55469382631797e6bf0a6a2fa40a9f693510366c6437a5e421`
MD5	`5c4bed1777d6f0692a741a1e430eb159`
BLAKE2b-256	`7f294902ba3d139deae6f0b59f36f3be9505c6a86f91371b981581ea4dcb0ef0`

See more details on using hashes here.

File details

Details for the file two_lists_similarity-0.0.5-py3-none-any.whl.

File metadata

Download URL: two_lists_similarity-0.0.5-py3-none-any.whl
Upload date: Dec 17, 2019
Size: 6.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/42.0.2 requests-toolbelt/0.9.1 tqdm/4.40.2 CPython/3.7.3

File hashes

Hashes for two_lists_similarity-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af47c8e7df894a0ed119bb5d03e14bfd033c35e99946d2bcd15134b94758e5df`
MD5	`96fb97e514e371bebf215d66df73211c`
BLAKE2b-256	`3be3d84af65c4081aa679285ce38b54525ab7a17bba1da2d6a9660553fc73e52`

See more details on using hashes here.

two-lists-similarity 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

1. Installation

2. Usage

Thank You. Will try to add more functions to this package whenever possible.

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes