Skip to main content

Entity Matching Model package

Project description

Entity Matching model

Build Latest Github release GitHub release date Ruff Downloads

Entity Matching Model (EMM) solves the problem of matching company names between two possibly very large datasets. EMM can match millions against millions of names with a distributed approach. It uses the well-established candidate selection techniques in string matching, namely: tfidf vectorization combined with cosine similarity (with significant optimization), both word-based and character-based, and sorted neighbourhood indexing. These so-called indexers act complementary for selecting realistic name-pair candidates. On top of the indexers, EMM has a classifier with optimized string-based, rank-based, and legal-entity based features to estimate how confident a company name match is.

The classifier can be trained to give a string similarity score or a probability of match. Both types of score are useful, in particular when there are many good-looking matches to choose between. Optionally, the EMM package can also be used to match a group of company names that belong together, to a common company name in the ground truth. For example, all different names used to address an external bank account. This step aggregates the name-matching scores from the supervised layer into a single match.

The package is modular in design and and works both using both Pandas and Spark. A classifier trained with the former can be used with the latter and vice versa.

For release history see GitHub Releases.

Notebooks

For detailed examples of the code please see the notebooks under notebooks/.

  • 01-entity-matching-pandas-version.ipynb: Using the Pandas version of EMM for name-matching.
  • 02-entity-matching-spark-version.ipynb: Using the Spark version of EMM for name-matching.
  • 03-entity-matching-training-pandas-version.ipynb: Fitting the supervised model and setting a discrimination threshold (Pandas).
  • 04-entity-matching-aggregation-pandas-version.ipynb: Using the aggregation layer and setting a discrimination threshold (Pandas).

Documentation

For documentation, design, and API see the documentation. Or read our Medium blog Entity Matching at Scale!

Check it out

The Entity matching model library requires Python >= 3.7 and is pip friendly. To get started, simply do:

pip install emm

or check out the code from our repository:

git clone https://github.com/ing-bank/EntityMatchingModel.git
pip install -e EntityMatchingModel/

where in this example the code is installed in edit mode (option -e).

Additional dependencies can be installed with, e.g.:

pip install "emm[spark,dev,test]"

You can now use the package in Python with:

import emm

Congratulations, you are now ready to use the Entity Matching model!

Quick run

As a quick example, you can do:

from emm import PandasEntityMatching
from emm.data.create_data import create_example_noised_names

# generate example ground-truth names and matching noised names, with typos and missing words.
ground_truth, noised_names = create_example_noised_names(random_seed=42)
train_names, test_names = noised_names[:5000], noised_names[5000:]

# two example name-pair candidate generators: character-based cosine similarity and sorted neighbouring indexing
indexers = [
  {
      'type': 'cosine_similarity',
      'tokenizer': 'characters',   # character-based cosine similarity. alternative: 'words'
      'ngram': 2,                  # 2-character tokens only
      'num_candidates': 5,         # max 5 candidates per name-to-match
      'cos_sim_lower_bound': 0.2,  # lower bound on cosine similarity
  },
  {'type': 'sni', 'window_length': 3}  # sorted neighbouring indexing window of size 3.
]
em_params = {
  'name_only': True,         # only consider name information for matching
  'entity_id_col': 'Index',  # important to set both index and name columns to pick up
  'name_col': 'Name',
  'indexers': indexers,
  'supervised_on': False,    # no supervided model (yet) to select best candidates
  'with_legal_entity_forms_match': True,   # add feature that indicates match of legal entity forms (e.g. ltd != co)
}
# 1. initialize the entity matcher
p = PandasEntityMatching(em_params)

# 2. fitting: prepare the indexers based on the ground truth names, eg. fit the tfidf matrix of the first indexer.
p.fit(ground_truth)

# 3. create and fit a supervised model for the PandasEntityMatching object, to pick the best match (this takes a while)
#    input is "positive" names column 'Name' that are all supposed to match to the ground truth,
#    and an id column 'Index' to check with candidate name-pairs are matching and which not.
#    A fraction of these names may be turned into negative names (= no match to the ground truth).
#    (internally, candidate name-pairs are automatically generated, these are the input to the classification)
p.fit_classifier(train_names, create_negative_sample_fraction=0.5)

# 4. scoring: generate pandas dataframe of all name-pair candidates.
#    The classifier-based probability of match is provided in the column 'nm_score'.
#    Note: can also call p.transform() without training the classifier first.
candidates_scored_pd = p.transform(test_names)

# 5. scoring: for each name-to-match, select the best ground-truth candidate.
best_candidates = candidates_scored_pd[candidates_scored_pd.best_match]
best_candidates.head()

For Spark, you can use the class SparkEntityMatching instead, with the same API as the Pandas version. For all available examples, please see the tutorial notebooks under notebooks/.

Project contributors

This package was authored by ING Analytics Wholesale Banking.

Contact and support

Contact the WBAA team via Github issues. Please note that INGA-WB provides support only on a best-effort basis.

License

Copyright ING WBAA 2023. Entity Matching Model is completely free, open-source and licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

emm-2.1.11.tar.gz (189.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

emm-2.1.11-py3-none-any.whl (184.4 kB view details)

Uploaded Python 3

File details

Details for the file emm-2.1.11.tar.gz.

File metadata

  • Download URL: emm-2.1.11.tar.gz
  • Upload date:
  • Size: 189.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emm-2.1.11.tar.gz
Algorithm Hash digest
SHA256 eaaac5837f54157e7b16ecc6553e4830261950ff31b3c41d19614fdb290c07ef
MD5 50c5af9edfafe89f4f6d2514df49b6a4
BLAKE2b-256 16dd2bf706893b7d10a033108838e8bb62b7b8e0e86c3e3c0889a6c0bec26285

See more details on using hashes here.

File details

Details for the file emm-2.1.11-py3-none-any.whl.

File metadata

  • Download URL: emm-2.1.11-py3-none-any.whl
  • Upload date:
  • Size: 184.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for emm-2.1.11-py3-none-any.whl
Algorithm Hash digest
SHA256 e8100e845c488b7b6ce24fca30c881109d985683fb9d7c3dacff7a236ae9ee0a
MD5 c5232c6c6e5c02aec4b62619b0d1ea29
BLAKE2b-256 a343e9026d5a26647c6db520d43764a5d55498c7883f5b7d86aab1a78076494d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page