Pointwise Hilbert–Schmidt Independence Criterion (PHSIC)
Project description
Pointwise Hilbert窶鉄chmidt Independence Criterion (PHSIC)
Compute co-occurrence between two objects utilizing similarities.
For example, given consistent sentence pairs:
X | Y |
---|---|
They had breakfast at the hotel. | They are full now. |
They had breakfast at ten. | I'm full. |
She had breakfast with her friends. | She felt happy. |
They had breakfast with their friends at the Japanese restaurant. | They felt happy. |
He have trouble with his homework. | He cries. |
I have trouble associating with others. | I cry. |
PHSIC can give high scores to consistent pairs in terms of the given pairs:
X | Y | score |
---|---|---|
They had breakfast at the hotel. | They are full now. | 0.1134 |
They had breakfast at an Italian restaurant. | They are stuffed now. | 0.0023 |
I have dinner. | I have dinner again. | 0.0023 |
Installation
$ pip install phsic
This will install phsic
command to your environment:
$ phsic --help
Basic Usage
Download pre-trained wordvecs (e.g. fasttext):
$ wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip
$ unzip crawl-300d-2M.vec.zip
Prepare dataset:
$ TAB="$(printf '\t')"
$ cat << EOF > train.txt
They had breakfast at the hotel.${TAB}They are full now.
They had breakfast at ten.${TAB}I'm full.
She had breakfast with her friends.${TAB}She felt happy.
They had breakfast with their friends at the Japanese restaurant.${TAB}They felt happy.
He have trouble with his homework.${TAB}He cries.
I have trouble associating with others.${TAB}I cry.
EOF
$ cut -f 1 train.txt > train_X.txt
$ cut -f 2 train.txt > train_Y.txt
$ cat << EOF > test.txt
They had breakfast at the hotel.${TAB}They are full now.
They had breakfast at an Italian restaurant.${TAB}They are stuffed now.
I have dinner.${TAB}I have dinner again.
EOF
$ cut -f 1 test.txt > test_X.txt
$ cut -f 2 test.txt > test_Y.txt
Then, train and predict:
$ phsic train_X.txt train_Y.txt --kernel1 Gaussian 1.0 --encoder1 SumBov FasttextEn --emb1 crawl-300d-2M.vec --kernel2 Gaussian 1.0 --encoder2 SumBov FasttextEn --emb2 crawl-300d-2M.vec --limit_words1 10000 --limit_words2 10000 --dim1 3 --dim2 3 --out_prefix toy --out_dir out --X_test test_X.txt --Y_test test_Y.txt
$ cat toy.Gaussian-1.0-SumBov-FasttextEn.Gaussian-1.0-SumBov-FasttextEn.3.3.phsic
1.134489336180434238e-01
2.320408776101631244e-03
2.321869174772554344e-03
Citation
@InProceedings{D18-1203,
author = "Yokoi, Sho
and Kobayashi, Sosuke
and Fukumizu, Kenji
and Suzuki, Jun
and Inui, Kentaro",
title = "Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions",
booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "1763--1775",
location = "Brussels, Belgium",
url = "http://aclweb.org/anthology/D18-1203"
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
phsic-cli-0.1.0.tar.gz
(11.6 kB
view hashes)
Built Distribution
phsic_cli-0.1.0-py3-none-any.whl
(39.2 kB
view hashes)
Close
Hashes for phsic_cli-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 085579e96158ebe8e69ecb5f4252d101654c7fa65c4e8c8606d7cd6e585fa9db |
|
MD5 | c96834a010e903c0088984ffd23e8776 |
|
BLAKE2b-256 | 68a35c906dd2d6c40f9d4e8f047ca603824bf8428b484c707c963cde4b137d6e |