Skip to main content

Pointwise Hilbert–Schmidt Independence Criterion (PHSIC)

Project description

Pointwise Hilbert窶鉄chmidt Independence Criterion (PHSIC)

Compute co-occurrence between two objects utilizing similarities.

For example, given consistent sentence pairs:

X Y
They had breakfast at the hotel. They are full now.
They had breakfast at ten. I'm full.
She had breakfast with her friends. She felt happy.
They had breakfast with their friends at the Japanese restaurant. They felt happy.
He have trouble with his homework. He cries.
I have trouble associating with others. I cry.

PHSIC can give high scores to consistent pairs in terms of the given pairs:

X Y score
They had breakfast at the hotel. They are full now. 0.1134
They had breakfast at an Italian restaurant. They are stuffed now. 0.0023
I have dinner. I have dinner again. 0.0023

Installation

$ pip install phsic

This will install phsic command to your environment:

$ phsic --help

Basic Usage

Download pre-trained wordvecs (e.g. fasttext):

$ wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip
$ unzip crawl-300d-2M.vec.zip

Prepare dataset:

$ TAB="$(printf '\t')"
$ cat << EOF > train.txt
They had breakfast at the hotel.${TAB}They are full now.
They had breakfast at ten.${TAB}I'm full.
She had breakfast with her friends.${TAB}She felt happy.
They had breakfast with their friends at the Japanese restaurant.${TAB}They felt happy.
He have trouble with his homework.${TAB}He cries.
I have trouble associating with others.${TAB}I cry.
EOF
$ cut -f 1 train.txt > train_X.txt
$ cut -f 2 train.txt > train_Y.txt
$ cat << EOF > test.txt
They had breakfast at the hotel.${TAB}They are full now.
They had breakfast at an Italian restaurant.${TAB}They are stuffed now.
I have dinner.${TAB}I have dinner again.
EOF
$ cut -f 1 test.txt > test_X.txt
$ cut -f 2 test.txt > test_Y.txt

Then, train and predict:

$ phsic train_X.txt train_Y.txt --kernel1 Gaussian 1.0 --encoder1 SumBov FasttextEn --emb1 crawl-300d-2M.vec --kernel2 Gaussian 1.0 --encoder2 SumBov FasttextEn --emb2 crawl-300d-2M.vec --limit_words1 10000 --limit_words2 10000 --dim1 3 --dim2 3 --out_prefix toy --out_dir out --X_test test_X.txt --Y_test test_Y.txt
$ cat toy.Gaussian-1.0-SumBov-FasttextEn.Gaussian-1.0-SumBov-FasttextEn.3.3.phsic
1.134489336180434238e-01
2.320408776101631244e-03
2.321869174772554344e-03

Citation

@InProceedings{D18-1203,
  author = 	"Yokoi, Sho
        and Kobayashi, Sosuke
        and Fukumizu, Kenji
        and Suzuki, Jun
        and Inui, Kentaro",
  title = 	"Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions",
  booktitle = 	"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"1763--1775",
  location = 	"Brussels, Belgium",
  url = 	"http://aclweb.org/anthology/D18-1203"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for phsic-cli, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size phsic_cli-0.1.0-py3-none-any.whl (39.2 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size phsic-cli-0.1.0.tar.gz (11.6 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page