Skip to main content

ClickModels for Search Engines Implemented on top of Cython.

Project description

pyClickModels Build Status Coverage Status PyPI version Pyversions GitHub license

A Cython implementation of ClickModels that uses Probabilistic Graphical Models to infer user behavior when interacting with Search Page Results (Ranking).

How It Works

ClickModels uses the concept of Probabilistic Graphical Models to model components that describe the interactions between users and a list of items ranked by a set of retrieval rules.

These models tend to be useful when it's desired to understand whether a given document is a good match for a given search query or not which is also known in literature as Judgments grades. This is possible through evaluating past observed clicks and the positions at which the document appeared on the results pages for each query.

There are several proposed approaches to handle this problem. This repository implements a Dynamic Bayesian Network, similar to previous works also done in Python:

dbn

Main differences are:

  1. Implemented on top of Cython: solutions already public available rely on CPython integrated with PyPy for additional speed ups. Unfortunatelly this still might not be good enough in terms of performance. To work on that, this implementation relies 100% on C/C++ for further optimization in speed. Despite not having an official benchmark, it's expected an improvement of 15x ~ 18x on top of CPython (same data lead to an increase of ~3x when using PyPy).
  2. Memory Friendly: expects input data to follow a JSON format with all sessions of clickstream already expressed for each row. This saves memory and allows for the library to process bigger amounts of data.
  3. Purchase variable: as businesses such as eCommerces can greately benefit from better understanding their search engine, this repository added the variable Purchase to further describe customers behaviors.

The file notebooks/DBN.ipynb has a complete description of how the model has been implemented along with all the mathematics involved.

Instalation

As this project relies on binaries compiled by Cython, currently only Linux (manylinux) platform is supported. It can be installed with:

pip install pyClickModels

Getting Started

Input Data

pyClickModels expects input data to be stored in a set of compressed gz files located on the same folder. They all should start with the string "judgments", for instance, judgments0.gz. Each file should contain line separated JSONs. The following is an example of each JSON line:

{
    "search_keys": {
        "search_term": "blue shoes",
        "region": "south",
	"favorite_brand": "super brand",
	"user_size": "L",
	"avg_ticket": 10
    },
    "judgment_keys": [
        {
	    "session": [
                {"click": 0, "purchase": 0, "doc": "doc0"}
                {"click": 1, "purchase": 0, "doc": "doc1"}
                {"click": 1, "purchase": 1, "doc": "doc2"}
	    ]
        },
        {
	    "session": [
                {"click": 1, "purchase": 0, "doc": "doc0"}
                {"click": 0, "purchase": 0, "doc": "doc1"}
                {"click": 0, "purchase": 0, "doc": "doc2"}
	    ]
        }
    ]
}

The key search_keys sets the context for the search. In the above example, a given customer (or cluster of customers with the same context) searched for blue shoes. Their region is south (it could be any chosen value), favorite brand is super brand and so on.

These keys sets the context for which the search happened. When pyClickModels runs its optimization, it will consider all the context at once. This means that the Judgments obtained are also on the whole context setting.

If no context is desired, just use {"search_keys": {"search_term": "user search"}}.

There's no required schema here which means the library loops through all keys available in search_keys and builds the optimization process considering the whole context as a single query.

As for the judgment_keys, this is a list of sessions. The key session is mandatory. Each session contains the clickstream of users (if the variable purchase is not required set it to 0).

For running DBN from pyClickModels, here's a simple example:

from pyClickModels.DBN import DBN

model = DBN()
model.fit(input_folder="/tmp/clicks_data/", iters=10)
model.export_judgments("/tmp/output.gz")

Output file will contain a NEWLINE JSON separated file with the judgments for each query and each document observed for that query, i.e.:

{"search_term:blue shoes|region:south|brand:super brand": {"doc0": 0.2, "doc1": 0.3, "doc2": 0.4}}
{"search_term:query|region:north|brand:other_brand": {"doc0": 0.0, "doc1": 0.0, "doc2": 0.1}}

Judgments here varies between 0 and 1. Some libraries requires it to range between integers 0 and 4. Choose a proper transformation in this case that better suits your data.

Warnings

This library is still alpha! Use it with caution. It's been fully unittested but still parts of it uses pure C whose exceptions might not have been fully considered yet. It's recommended to, before using this library in production evironments, to fully test it with different datasets and sizes to evaluate how it performs.

Contributing

Contributions are very welcome! Also, if you find bugs, please report them :).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyClickModels-0.0.2.tar.gz (167.5 kB view details)

Uploaded Source

Built Distributions

pyClickModels-0.0.2-cp38-cp38-manylinux2010_x86_64.whl (891.2 kB view details)

Uploaded CPython 3.8 manylinux: glibc 2.12+ x86-64

pyClickModels-0.0.2-cp38-cp38-manylinux1_x86_64.whl (891.2 kB view details)

Uploaded CPython 3.8

pyClickModels-0.0.2-cp37-cp37m-manylinux2010_x86_64.whl (862.4 kB view details)

Uploaded CPython 3.7m manylinux: glibc 2.12+ x86-64

pyClickModels-0.0.2-cp37-cp37m-manylinux1_x86_64.whl (862.4 kB view details)

Uploaded CPython 3.7m

pyClickModels-0.0.2-cp36-cp36m-manylinux2010_x86_64.whl (868.2 kB view details)

Uploaded CPython 3.6m manylinux: glibc 2.12+ x86-64

pyClickModels-0.0.2-cp36-cp36m-manylinux1_x86_64.whl (868.2 kB view details)

Uploaded CPython 3.6m

pyClickModels-0.0.2-cp35-cp35m-manylinux2010_x86_64.whl (842.7 kB view details)

Uploaded CPython 3.5m manylinux: glibc 2.12+ x86-64

pyClickModels-0.0.2-cp35-cp35m-manylinux1_x86_64.whl (842.7 kB view details)

Uploaded CPython 3.5m

File details

Details for the file pyClickModels-0.0.2.tar.gz.

File metadata

  • Download URL: pyClickModels-0.0.2.tar.gz
  • Upload date:
  • Size: 167.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2.tar.gz
Algorithm Hash digest
SHA256 3ac63b80498c74cec1af488a9a934c085be10dce093d380703eea9cf4a0a8ca1
MD5 7ec05f733a1d10ca6310cee442c1e4b2
BLAKE2b-256 3f5f5229d10f6eec879ad957594e179cc1e320353e4870f77e20987e2cc34117

See more details on using hashes here.

File details

Details for the file pyClickModels-0.0.2-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyClickModels-0.0.2-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 891.2 kB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 74f76e6378982adafa90a23850d7bb413f7f36da54afbbd8852b5c9c5c5ae0f2
MD5 aa25989cadfa057b38101c24ad3c84d5
BLAKE2b-256 6a70d5c7600e282c8e264d53a1c32899d53d88a7806263027e40c19a298a7e2f

See more details on using hashes here.

File details

Details for the file pyClickModels-0.0.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyClickModels-0.0.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 891.2 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e4f352d669f3ee0f07190ed9330abb33fe50a53cbf77ef48e925cf0fe4b9c958
MD5 72a367851f11523c99c84e925cce482f
BLAKE2b-256 352929768793bed380059a55f6b7105a4050eba753d8ad2be5667f500cd0a9d6

See more details on using hashes here.

File details

Details for the file pyClickModels-0.0.2-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyClickModels-0.0.2-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 862.4 kB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 dd900c8fbb5e2e9ab7f64c97b3309e6b5cb85cf37fdcd7be977b54d80ca0d003
MD5 659b1e23496a5530ed56e4b33189d438
BLAKE2b-256 f46ce5b2f32169200c1af203f56e81fad19bf00d2eda8bf6cbf23e5cb4b11f8f

See more details on using hashes here.

File details

Details for the file pyClickModels-0.0.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyClickModels-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 862.4 kB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 409e85c81002904d3c7a071f0e8734bbec25cef679cd31c35b7f138ad8b1fbc3
MD5 d4611a8be97a0bf5bf75cf3eca7efb9a
BLAKE2b-256 31594017260702fa5762973b8cc0ca5cb03aed478afec2df9ca01bcac16f2fc5

See more details on using hashes here.

File details

Details for the file pyClickModels-0.0.2-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyClickModels-0.0.2-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 868.2 kB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 14faa053d1352068f1178527402447f7c73408317297339bb1ae956efd4b97da
MD5 35e5850300c14cfce9523346642141b6
BLAKE2b-256 545e4a1d01bad49906c919c7431a3c86de64e06076210ed66c3c7bd473d1726b

See more details on using hashes here.

File details

Details for the file pyClickModels-0.0.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyClickModels-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 868.2 kB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 88cae68558c39f56edddbffc1a0e7039ad767ac74e316c064a21f9d47851001a
MD5 003a590495ac04740f8d85f1d88e4e79
BLAKE2b-256 ce88f48062d41a149a66d01c30f11208e2f2f70c574a3eadc169b5ee8cef64b6

See more details on using hashes here.

File details

Details for the file pyClickModels-0.0.2-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyClickModels-0.0.2-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 842.7 kB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 12550f3cdde87d5fc42e25f6a69e32e35a647071ad71751c4758bb06730261d5
MD5 4862ef497b32f698164354966bfb52ec
BLAKE2b-256 daeca27fc371d0316f71a18dc3defb2fa71c5d2adf516b7cb776ca1d0610e574

See more details on using hashes here.

File details

Details for the file pyClickModels-0.0.2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyClickModels-0.0.2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 842.7 kB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/3.6.9

File hashes

Hashes for pyClickModels-0.0.2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9e4f93fee202692c1d7d052ec88fd72224f869f286e6bdf9b5bc6165dfcec087
MD5 ae52c8d2a974e62ae256e3e7a5c3ec7b
BLAKE2b-256 fa64abfa1860936778610de3e02fee278ba4fcef63c90471298d2f3efa25ae6d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page