Skip to main content

The package of DADER (Domain Adaptation for Deep Entity Resolution).

Project description

DADER: Domain Adaptation for Deep Entity Resolution

python pytorch

Entity resolution (ER) is a core problem of data integration. The state-of-the-art (SOTA) results on ER are achieved by deep learning (DL) based methods, trained with a lot of labeled matching/non-matching entity pairs. This may not be a problem when using well-prepared benchmark datasets. Nevertheless, for many real-world ER applications, the situation changes dramatically, with a painful issue to collect large-scale labeled datasets. In this paper, we seek to answer: If we have a well-labeled source ER dataset, can we train a DL-based ER model for target dataset, without any labels or with a few labels? This is known as domain adaptation (DA), which has achieved great successes in computer vision and natural language processing, but is not systematically studied for ER. Our goal is to systematically explore the benefits and limitations of a wide range of DA methods for ER. To this purpose, we develop a DADER (Domain Adaptation for Deep Entity Resolution) framework that significantly advances ER in applying DA. We define a space of design solutions for the three modules of DADER, namely Feature Extractor, Matcher, and Feature Aligner. We conduct so far the most comprehensive experimental study to explore the design space and compare different choices of DA for ER. We provide guidance for selecting appropriate design solutions based on extensive experiments.

This repository contains the implementation code of six representative methods of [DADER]: MMD, K-order, GRL, InvGAN, InvGAN+KD, ED.

DataSets

The dataset format is <entity1,entity2,label>. See Hugging Face for details.

Quick Start

Step 1: Requirements

  • Before running the code, please make sure your Python version is 3.6.5 and cuda version is 11.1. Then install necessary packages by :

  • pip install dader

  • If Pytorch is not installed automatically, you can install it using the following command:

  • pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Step 2: Run Example

```python
#!/usr/bin/env python3
from dader import data, model

# load datasets
X_src, y_src = data.load_data(path='source.csv')
X_tgt, X_tgt_val, y_tgt, y_tgt_val = data.load_data(path='target.csv', valid_rate = 0.1)


# load model
aligner = model.Model(method = 'invgankd', architecture = 'Bert')
# train & adapt
aligner.fit(X_src, y_src, X_tgt, X_tgt_val, y_tgt_val, batch_size = 16, ada_max_epoch=20)
# predict                    
y_prd = aligner.predict(X_tgt)
# evaluate
eval_result = aligner.eval(X_tgt, y_prd, y_tgt)

```

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dader-0.0.4.tar.gz (19.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dader-0.0.4-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file dader-0.0.4.tar.gz.

File metadata

  • Download URL: dader-0.0.4.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.9

File hashes

Hashes for dader-0.0.4.tar.gz
Algorithm Hash digest
SHA256 91ea17471346814966564b16b7775c771ed21922ef685ae144fb465cea02cbfb
MD5 6d25d4c487aff9eddbebabc2ef2f7e68
BLAKE2b-256 3be16dc37630a6b1bfc9df598818781b95ed6a9ab188772b1bb928a53f743918

See more details on using hashes here.

File details

Details for the file dader-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: dader-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/33.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.2 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.9

File hashes

Hashes for dader-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 be9167dc52f3f8f5676ac833278c9cca398124ee89fd01512894c7d4f1b81bf3
MD5 b404eb98a487ce71aaec35df1fdcfcd5
BLAKE2b-256 3eb11f26db64b2d739ea5f55016963ff959a0f12d2a331ad0a42a44a19dd3e4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page