Open-source recommendation engines based on Wikipedia data
Project description
Open-source recommendation engines based on Wikipedia data
Jump to: Data • Methods • Usage • To-Do
wikirec is a framework that allows users to parse Wikipedia for entries of a given type and then seamlessly create recommendation engines based on unsupervised natural language processing. The gaol is for wikirec to both refine and deploy models that provide accurate content recommendations based solely on open-source data.
Installation via PyPi
wikirec can be downloaded from pypi via pip or sourced directly from this repository:
pip install wikirec
git clone https://github.com/andrewtavis/wikirec.git
cd wikirec
python setup.py install
import wikirec
Data ↩
wikirec allows a user to download Wikipedia texts of a given document type including movies, TV shows, books, music, and countless other classes of information. These texts then serve as the basis to recommend similar content given an input of what the user is interested in.
wikirec derives article classes from infobox types found on Wikipedia articles. The article on infoboxes contains all the allowed arguments to subset the data by. Simply passing Infobox chosen_type
to the topic
argument in the following example will subset all Wikipedia articles for the given type. wikirec also provides a shortcut for types of data that commonly serve as recommendation inputs including: books
, songs
, albums
, movies
, tv_series
, video_games
, and various categories of people
such as athletes
and musicians
, and authors
.
Downloading and parsing Wikipedia for the needed data is as simple as:
from wikirec import data_utils
# This downloads the most recent stable bz2 compressed Wikipedia dump
files = data_utils.download_wiki()
# Produces an ndjson of all book articles on Wikipedia
data_utils.parse_to_json(
topic="books",
output_path="wiki_book_articles.ndjson",
multicore=True,
verbose=True,
)
Methods ↩
Current NLP modeling methods implemented include:
LDA
Latent Dirichlet Allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of kwgen, documents or text entries are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics.
BERT
Bidirectional Encoder Representations from Transformers derives representations of words based running nlp models over open source Wikipedia data. These representations are then able to be leveraged to derive topics.
LDA with BERT embeddings
The combination of LDA with BERT via wikirec.autoencoder.
Usage ↩
The following are examples of recommendations using wikirec:
import wikirec
To-Do ↩
- Adding further methods for recommendations
- Allowing a user to specify multiple articles of interest
- Allowing a user to input their preference for something and then update their recommendations
- Adding support for non-English versions of Wikipedia
- Compiling other sources of open source data that can be used to augment input data
- Potentially writing scripts to load this data for significant topics
- Updating and refining the documentation
References
List of references
Powered By
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for wikirec-0.0.1.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 94117643d8c7a15840d7698593eff590911145b4aad91a4cbe18af5021816e67 |
|
MD5 | 5b4b2efe85c3df4800850fe6d575f301 |
|
BLAKE2b-256 | e23e65fff16f3810606959700c1fd429d6bf78e448c0411ee499d5aa9b939aa7 |