Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Facilitating the design, comparison and sharing of deep text matching models. Based on MatchZoo

Project description

MatchZoo-Lite

基于 MatchZoo 2.0.0 开发,并做了简化

主要修改

增加数据加载器 dataloader 可方便进行数据的加载,训练数据和测试数据的文件格式统一为 json 文件,格式为:

{"text_left": "xxx xxx xx", "text_right": "xxx xxx xxx", "label": 1}

其中 text_lefttext_right 为空格分割的分词文本

删改

  • 去除 nltk 相关语料库的调用(如停用词)
  • 去除预提供的 datasets
  • 更换了测试 tests
  • 去除部分模型,只保留以下模型:
    • arci
    • arcii
    • dssm
    • cdssm
    • conv_highway
    • duet
    • match_pyramid
    • mvlstm

Install

MatchZoo is dependent on Keras, please install one of its backend engines: TensorFlow, Theano, or CNTK. We recommend the TensorFlow backend. Two ways to install MatchZoo:

Install matchzoo-lite from the Github source

git clone http://gitlab.alipay-inc.com/niming.lxm/matchzoo-lite.git
cd matchzoo-lite
python setup.py install

Docker

docker pull seanlee97/matchzoo-lite:latest

Train your model

Get Started in 60 Seconds

To train a Deep Semantic Structured Model, import matchzoo and prepare input data.

import matchzoo as mz

train_pack = mz.datasets.wiki_qa.load_data('train', task='ranking')
valid_pack = mz.datasets.wiki_qa.load_data('dev', task='ranking')
predict_pack = mz.datasets.wiki_qa.load_data('test', task='ranking')

Preprocess your input data in three lines of code, keep track parameters to be passed into the model.

preprocessor = mz.preprocessors.DSSMPreprocessor()
train_pack_processed = preprocessor.fit_transform(train_pack)
valid_pack_processed = preprocessor.transform(valid_pack)
predict_pack_processed = preprocessor.transform(predict_pack)

Make use of MatchZoo customized loss functions and evaluation metrics:

ranking_task = mz.tasks.Ranking(loss=mz.losses.RankCrossEntropyLoss(num_neg=4))
ranking_task.metrics = [
    mz.metrics.NormalizedDiscountedCumulativeGain(k=3),
    mz.metrics.NormalizedDiscountedCumulativeGain(k=5),
    mz.metrics.MeanAveragePrecision()
]

Initialize the model, fine-tune the hyper-parameters.

model = mz.models.DSSM()
model.params['input_shapes'] = preprocessor.context['input_shapes']
model.params['task'] = ranking_task
model.params['mlp_num_layers'] = 3
model.params['mlp_num_units'] = 300
model.params['mlp_num_fan_out'] = 128
model.params['mlp_activation_func'] = 'relu'
model.guess_and_fill_missing_params()
model.build()
model.compile()

Generate pair-wise training data on-the-fly, evaluate model performance using customized callbacks on prediction data.

train_generator = mz.PairDataGenerator(train_pack_processed, num_dup=1, num_neg=4, batch_size=64, shuffle=True)

pred_x, pred_y = predict_pack_processed.unpack()
evaluate = mz.callbacks.EvaluateAllMetrics(model, x=pred_x, y=pred_y, batch_size=len(pred_x))

history = model.fit_generator(train_generator, epochs=20, callbacks=[evaluate], workers=5, use_multiprocessing=False)

References

MatchZoo

Tutorials

English Documentation

中文文档

If you're interested in the cutting-edge research progress, please take a look at awaresome neural models for semantic match.

License

Apache-2.0

MatchZoo License

Apache-2.0 Copyright (c) 2015-present, Yixing Fan (faneshion)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for matchzoo-lite, version 0.1.4
Filename, size File type Python version Upload date Hashes
Filename, size matchzoo-lite-0.1.4.tar.gz (52.2 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page