Facilitating the design, comparison and sharing of deep text matching models. Based on MatchZoo
Project description
MatchZoo-Lite
基于 MatchZoo 2.0.0 开发,并做了简化
主要修改
增
增加数据加载器 dataloader 可方便进行数据的加载,训练数据和测试数据的文件格式统一为 json 文件,格式为:
{"text_left": "xxx xxx xx", "text_right": "xxx xxx xxx", "label": 1}
其中 text_left
和 text_right
为空格分割的分词文本
删改
- 去除 nltk 相关语料库的调用(如停用词)
- 去除预提供的 datasets
- 更换了测试 tests
- 去除部分模型,只保留以下模型:
- arci
- arcii
- dssm
- cdssm
- conv_highway
- duet
- match_pyramid
- mvlstm
Install
MatchZoo is dependent on Keras, please install one of its backend engines: TensorFlow, Theano, or CNTK. We recommend the TensorFlow backend. Two ways to install MatchZoo:
Install matchzoo-lite from the Github source
git clone http://gitlab.alipay-inc.com/niming.lxm/matchzoo-lite.git
cd matchzoo-lite
python setup.py install
Docker
docker pull seanlee97/matchzoo-lite:latest
Train your model
Get Started in 60 Seconds
To train a Deep Semantic Structured Model, import matchzoo and prepare input data.
import matchzoo as mz
train_pack = mz.datasets.wiki_qa.load_data('train', task='ranking')
valid_pack = mz.datasets.wiki_qa.load_data('dev', task='ranking')
predict_pack = mz.datasets.wiki_qa.load_data('test', task='ranking')
Preprocess your input data in three lines of code, keep track parameters to be passed into the model.
preprocessor = mz.preprocessors.DSSMPreprocessor()
train_pack_processed = preprocessor.fit_transform(train_pack)
valid_pack_processed = preprocessor.transform(valid_pack)
predict_pack_processed = preprocessor.transform(predict_pack)
Make use of MatchZoo customized loss functions and evaluation metrics:
ranking_task = mz.tasks.Ranking(loss=mz.losses.RankCrossEntropyLoss(num_neg=4))
ranking_task.metrics = [
mz.metrics.NormalizedDiscountedCumulativeGain(k=3),
mz.metrics.NormalizedDiscountedCumulativeGain(k=5),
mz.metrics.MeanAveragePrecision()
]
Initialize the model, fine-tune the hyper-parameters.
model = mz.models.DSSM()
model.params['input_shapes'] = preprocessor.context['input_shapes']
model.params['task'] = ranking_task
model.params['mlp_num_layers'] = 3
model.params['mlp_num_units'] = 300
model.params['mlp_num_fan_out'] = 128
model.params['mlp_activation_func'] = 'relu'
model.guess_and_fill_missing_params()
model.build()
model.compile()
Generate pair-wise training data on-the-fly, evaluate model performance using customized callbacks on prediction data.
train_generator = mz.PairDataGenerator(train_pack_processed, num_dup=1, num_neg=4, batch_size=64, shuffle=True)
pred_x, pred_y = predict_pack_processed.unpack()
evaluate = mz.callbacks.EvaluateAllMetrics(model, x=pred_x, y=pred_y, batch_size=len(pred_x))
history = model.fit_generator(train_generator, epochs=20, callbacks=[evaluate], workers=5, use_multiprocessing=False)
References
If you're interested in the cutting-edge research progress, please take a look at awaresome neural models for semantic match.
License
MatchZoo License
Apache-2.0 Copyright (c) 2015-present, Yixing Fan (faneshion)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file matchzoo-lite-0.1.4.tar.gz
.
File metadata
- Download URL: matchzoo-lite-0.1.4.tar.gz
- Upload date:
- Size: 52.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a6f9dcbd4de2d29e8af1a92855aab14b6a7745593bfe6eeb94e177d5c84e0c15 |
|
MD5 | 37416ab0bb9df876ad1c5547e3bbc552 |
|
BLAKE2b-256 | a63f0e9f14e734d739c4f1c3ec936b5ca07da30d5608c23d4c78563a68952ea4 |