Skip to main content

Library for Standardizing names from a Pandas dataframe

Project description

Similar Names

Library for Standardizing names from a Pandas dataframe

Description

Similar Names is basically a package for names manipulation. That is, if you have a Pandas dataframe with multiple names written in different ways (e.g.: John Doe, John E. Doe and John Edson Doe), the "close_matches" function will look for all similar names on that column and then add two columns: "Close Matches" (list of all close matches) and "standard_name" (shortest name of the list).

Instalation

Similar Names can be installed directly through pip

pip install similarnames

How to use?

If you have a pandas dataframe with the names that you want to standardize, or look for close matches, simply follow the steps described next. As for the "close_matches" parameters, they are basically 6:

close_matches(obj, names, sep = ',', connectors = ['and','e','y'], languages = ['english', 'portuguese', 'spanish'], custom_words = None)
  • obj (dataframe): The pandas dataframe
  • names (str): The name of the pandas dataframe with the names that you want to analyze
  • sep (str or None): The separator to be used to split multiple names
  • connectors (str, list or None): Words to also be used as separators (e.g.: "and")
  • languages (str, list or None): Lanaguages for the default stopwords config (All stopwords are not considered names)
  • custom_words (str, list or None): Additional words that should not be considered as names (e.g.: "jr")

1st Scenario: 1 name per row

In case your dataframe is already formatted with one name per row, simply execute the following command setting the "sep" parameter to "None". In case you are having some trouble with the results, you can set the "languages" and "custom_words" parameters to include, or exclude, names from the analyses (by default, stopwords in english, portuguese and spanish are not considered names).

'''
Input (df): df and the name of the column with the names to check

| Names          | ... |
|----------------|-----|
| John Doe       |     |
| John Edson Doe |     |
| John E. Doe    |     |
| John Edson D.  |     |
'''
from similarnames import close_matches

# Default config: sep = ',', connectors = ['and','e','y'], languages = ['english', 'portuguese', 'spanish'], custom_words = None
df_standard = close_matches(df, 'Names', sep = None)

'''
Output (df_standard)

| Names          | ... | close_matches                                                   | standard_name |
|----------------|-----|----------------------------------------------------------------|--------------|
| John Doe       |     | ['John Doe', 'John E. Doe', 'John Edson Doe', 'John Edson D.'] | John Doe     |
| John Edson Doe |     | ['John Doe', 'John E. Doe', 'John Edson Doe', 'John Edson D.'] | John Doe     |
| John E. Doe    |     | ['John Doe', 'John E. Doe', 'John Edson Doe', 'John Edson D.'] | John Doe     |
| John Edson D.  |     | ['John Doe', 'John E. Doe', 'John Edson Doe', 'John Edson D.'] | John Doe     |

'''

2nd Scenario: Multiple names per row

In case you have multiple names in a single row, the "explode" is automatically done for you. So, just provide the "sep" parameter to identify where we should use to split those names. By default, the connectors "and", "e" and "y" will also be considered as separators. Therefore, in case you are working in a different language, just set the "connectors" and "languagues" parameter as you wish.

'''
Input (df): df and the name of the column with the names to check

| Names                                        | Other columns           | ... |
|----------------------------------------------|-------------------------|-----|
| John Doe, Jane Doe                           | Two names (sep = ',')   |     |
| John E. Doe and Michael Johnson              | Two names (without sep) |     |
| Jane A. Doe, Michael M. Johnson and John Doe | Three names (sep = ',') |     |
'''
from similarnames import close_matches

# Default config: sep = ',', connectors = ['and','e','y'], languages = ['english', 'portuguese', 'spanish'], custom_words = None
df_standard = close_matches(df, 'Names', sep = ',')

'''
Output (df_standard)

| Names              | Other columns           | ... | close_matches                              | standard_name    |
|--------------------|-------------------------|-----|-------------------------------------------|-----------------|
| John Doe           | Two names (sep = ',')   |     | ['John Doe', 'John E. Doe']               | John Doe        |
| Jane Doe           | Two names (sep = ',')   |     | ['Jane Doe', 'Jane A. Doe']               | Jane Doe        |
| John E. Doe        | Two names (without sep) |     | ['John Doe', 'John E. Doe']               | John Doe        |
| Michael Johnson    | Two names (without sep) |     | ['Michael Johnson', 'Michael M. Johnson'] | Michael Johnson |
| Jane A. Doe        | Three names (sep = ',') |     | ['Jane Doe', 'Jane A. Doe']               | Jane Doe        |
| Michael M. Johnson | Three names (sep = ',') |     | ['Michael Johnson', 'Michael M. Johnson'] | Michael Johnson |
| John Doe           | Three names (sep = ',') |     | ['John Doe', 'John E. Doe']               | John Doe        |

'''

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

similarnames-0.1.7.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

similarnames-0.1.7-py3-none-any.whl (4.7 kB view details)

Uploaded Python 3

File details

Details for the file similarnames-0.1.7.tar.gz.

File metadata

  • Download URL: similarnames-0.1.7.tar.gz
  • Upload date:
  • Size: 4.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.8.8 Windows/10

File hashes

Hashes for similarnames-0.1.7.tar.gz
Algorithm Hash digest
SHA256 1acaff7e9beb1068d366baed63676460a0846fc250cc05ec09908bd2c121f444
MD5 654a4ff6d43bb74563c0a9d37cdf0e2a
BLAKE2b-256 b6961d44d2c3621134826a21d9bfaead6b536cf6b02e0dd1cd3f17823dd2e768

See more details on using hashes here.

File details

Details for the file similarnames-0.1.7-py3-none-any.whl.

File metadata

  • Download URL: similarnames-0.1.7-py3-none-any.whl
  • Upload date:
  • Size: 4.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.14 CPython/3.8.8 Windows/10

File hashes

Hashes for similarnames-0.1.7-py3-none-any.whl
Algorithm Hash digest
SHA256 e6211d19d053f32876cb7266d3d9b3c1801407a5b83a4b0834f12780ad93df76
MD5 1e2faf1bae4feb8abd48c10a258d9326
BLAKE2b-256 99c0778e9b3db2433fd5d7e9d39c22179f1ecbde1e9afebd924eec5ac5bc2505

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page