Skip to main content

A Plone portlet that uses the catalog internals to find 'similar' content to the page you are looking at

Project description

Introduction

A Plone portlet that uses the catalog internals to find ‘similar’ content to the page you are looking at

This portlet uses some deep dark data structures within the ZCatalog and ZCTextindex, so it could be brittle in the future if those structures are changed. Then again, they have been the same for the past 8 years or so ;)

This portlet also runs in linear time relative to the number for documents you have in your site, so it could well slow things down. That said I’ve tried to make it pretty efficient.

How it Works

In a nutshell, this portlet compares the text content of an object with all other objects on the site to find other objects with a similar content. The steps are as follows:

  1. Find the path of this document

  2. Look up the record_id (docid) of this path in the catalog

  3. Look in the SearchableText index to find all word ids (wids) in this document

  4. Work out the top 20 most ‘important’ words in this document [*]

  5. For each of the top 20 words, find all documents containing any of those words

  6. Use a vector space model to measure similarity of each candidate document to our top 20 words

  7. Return the top 10 most similar documents.

[*] We work out the top 20 words using a TF*IDF algorithm (the same used in ZCTextIndex.OkapiIndex) to find the words that appear proportionately high in this document compared to all documents in general.

TODO

Add some caching ;)

Changelog

1.4

  • Added checks for security and language on results [Alessio Siniscalchi]

1.3

  • Fixed broken 1.2 release egg

1.2

  • Added ability to only search certain types [matth]

  • Do not display portlet if no similar items found [matth]

1.1

  • Bug fix important word selection code [matth]

1.0

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

collective.portlet.similarcontent-1.4.tar.gz (15.9 kB view details)

Uploaded Source

File details

Details for the file collective.portlet.similarcontent-1.4.tar.gz.

File metadata

File hashes

Hashes for collective.portlet.similarcontent-1.4.tar.gz
Algorithm Hash digest
SHA256 b6062a84e53294914d71ec37efbf537fc7a8f8b6f7bb3e18718c7a1ed1ce2fd8
MD5 c68522642621cd1461e3981b2ce08680
BLAKE2b-256 ac508c02edefc60f47396bd7f36f3e147650b923e0699ebe25c9e2c6ae0bcb8d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page