Skip to main content

Record matching and entity resolution at scale in Spark

Project description

Version Downloads Docs - GitHub.io

spark_matcher_logo

Spark-Matcher

Spark-Matcher is a scalable entity matching algorithm implemented in PySpark. With Spark-Matcher the user can easily train an algorithm to solve a custom matching problem. Spark Matcher uses active learning (modAL) to train a classifier (Scikit-learn) to match entities. In order to deal with the N^2 complexity of matching large tables, blocking is implemented to reduce the number of pairs. Since the implementation is done in PySpark, Spark-Matcher can deal with extremely large tables.

Documentation with examples can be found here.

Developed by data scientists at ING Analytics, www.ing.com.

Installation

Normal installation

As Spark-Matcher is intended to be used with large datasets on a Spark cluster, it is assumed that Spark is already installed. If that is not the case, first install Spark and PyArrow (pip install pyspark pyarrow).

Install Spark-Matcher using PyPi:

pip install spark-matcher

Install with possibility to create documentation

Pandoc, the general markup converter needs to be available. You may follow the official Pandoc installations instructions or use conda:

conda install -c conda-forge pandoc

Then clone the Spark-Matcher repository and add [doc] like this:

pip install ".[doc]"

Install to contribute

Clone this repo and install in editable mode. This also installs PySpark and Jupyterlab:

python -m pip install -e ".[dev]"
python setup.py develop

Documentation

Documentation can be created using the following command:

make create_documentation

Dependencies

The usage examples in the examples directory contain notebooks that run in local mode. Using the SparkMatcher in cluster mode, requires sending the SparkMatcher package and several other python packages (see spark_requirements.txt) to the executors. How to send these dependencies, depends on the cluster. Please read the instructions and examples of Apache Spark on how to do this: https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html.

SparkMatcher uses graphframes under to hood. Therefore, depending on the spark version, the correct version of graphframes needs to be added to the external_dependencies directory and to the configuration of the spark session.
As a default, graphframes for spark 3.0 is used in the spark sessions in the notebooks in the examples directory. For a different version, see: https://spark-packages.org/package/graphframes/graphframes.

Usage

Example notebooks are provided in the examples directory. Using the SparkMatcher to find matches between Spark dataframes a and b goes as follows:

from spark_matcher.matcher import Matching

myMatcher = Matcher(spark_session, col_names=['name', 'suburb', 'postcode'])

Now we are ready for fitting the Matcher object using 'active learning'; this means that the user has to enter whether a pair is a match or not. You enter 'y' if a pair is a match or 'n' when a pair is not a match. You will be notified when the model has converged and you can stop training by pressing 'f'.

myMatcher.fit(a, b)

The Matcher is now trained and can be used to predict on all data. This can be the data used for training or new data that was not seen by the model yet.

result = myMatcher.predict(a, b)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Spark-Matcher-0.3.2.tar.gz (270.7 kB view details)

Uploaded Source

Built Distribution

Spark_Matcher-0.3.2-py3-none-any.whl (290.6 kB view details)

Uploaded Python 3

File details

Details for the file Spark-Matcher-0.3.2.tar.gz.

File metadata

  • Download URL: Spark-Matcher-0.3.2.tar.gz
  • Upload date:
  • Size: 270.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.12

File hashes

Hashes for Spark-Matcher-0.3.2.tar.gz
Algorithm Hash digest
SHA256 f57fb5ecada4c367c064a6697a62dfb32911702342dd8c86a53ba00715ba4d13
MD5 afee88cad4ffeaee7336da745ae21298
BLAKE2b-256 c44fa736d0fafc68f1a95d22346d18226e34d961fd18bfaf37a1726ef4b9910c

See more details on using hashes here.

File details

Details for the file Spark_Matcher-0.3.2-py3-none-any.whl.

File metadata

File hashes

Hashes for Spark_Matcher-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 91be9da27e4ebe6114cc81f8924d14f56b00f15b21f3a5f2e09b5e0d4d6f4e35
MD5 9078c3a110e0bab515065b5fd1234d42
BLAKE2b-256 0b6823983d115432ecbe40bc18c192a01058f50726f5f34693e475711ad5a4d9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page