Skip to main content

An offline information retrieval system for full-text search on reddit comments.

Project description

# redditquery

An offline information retrieval system for full-text search on reddit comments.

## Getting Started

Once redditquery is set-up on your system (see Installation and Prerequisites), you can call the package from the command line like so (see Parameters):

user@host:~ redditquery mode [-h] [-s [START]] [-e [END]] [-d [DIR]] [-n [NUM]]
[-c [CORES]] [-m [MINFREQ]] [-p [PROGRESS]] [-f [FULLTEXT]]
[-l [LEMMA]] [-a [ALL]]


Alterantively, you can use it from inside python to interact with it dynamically (see Examples).

## Parameters

redditquery's behaviour can be changed with various parameters. Specifying mode is obligatory:

mode: 1 Build Inverted Index (requires specifying -f and -l)
2 Query existing Inverted Index
3 Build Inverted Index and Query (requires specifying -f and -l)

If the index is build, you will be required to specify the range of months to build the index on, by specifying the first and last month to be processed:

-s --start: first month to be downloaded as YYYY/MM
-e --end: last month to be downloaded as YYYY/MM

All other parameters are optional, here is what they do and their defaults:

-d or --dir: directory path to store data in (defaults to working dir)
-c or --cores: number of cores to use for downloading/decompressing monthly data (defaults to single-core)
-m or --minfreq: minimum frequeny to keep terms in index (defaults to 5)
-n or --num: number of results to show for each query (defaults to 10)
-f or --fulltext: store/retrieve full text of reddit comments (defaults to only storing/retrieving comment ids)
-a or --all: return documents containing all query terms (defaults to documents containing any of the query terms)
-l or --lemma: lemmatize documents/queries
-p or --progress: output progress information for download/processing (only single core, defaults to no progress shown)
- h or --help: show help file

## Examples

Build inverted index from reddit comments between december 2005 and march 2006 from the command line:

user@host:~ redditquery 1 -s 2005/12 -e 2006/03

Query inverted index that already exists in myDirectory with queries from myQueries.txt in the same directory:

user@host:~ redditquery 2 -d path/to/myDirectory path/to/myDirectory/myQueries.txt

Build and query the same index as above in one go from inside python:

>>> import os
>>> import sys
>>> import pickle
>>> from redditquery.database import DataBase
>>> from redditquery.parse import Parser
>>> from redditquery.index import InvertedIndex, QueryProcessor
>>> from redditquery.reddit import RedditDownloader, DocumentGenerator

>>> directory = "myDirectory"
>>> queries = "myDirectory/myQueries.txt"
>>> start = "2005/12"
>>> end = "2006/03"
>>> minimum_freq = 5
>>> num_results = 10
>>> downloader = RedditDownloader(start = start, end = end, directory = directory, keep_compressed = False)
>>> downloader.process_all_parallel()
>>> documents = DocumentGenerator(directory = os.path.join(directory, "monthly_data"), fulltext = False, lemmatize = False)
>>> database = DataBase(database_file = os.path.join(directory,"database.sql"))
>>> inverted_index = InvertedIndex(documents = documents, database = database, frequency_threshold = minimum_freq)
>>> processor = QueryProcessor(inverted_index = inverted_index, lemmatize = False)
>>> with open(queries, "r") as queries:
>>> for query in queries:
>>> processor.query_index(query.strip(), num_results = num_results, fulltext = False, conjunctive = False)

### Prerequisites

redditquery has two dependencies that are not part of the standard distribution, Pandas and Spacy. If you install this package using pip, the dependencies should be installed automatically. On Unix systems, you should also be able to install them separately using pip:

user@host:~ pip install pandas
user@host:~ pip install spacy

An alternative, especially for Windows users, is to use a conda distribution that should come shipped with pandas and add spacy like so and then install, still using pip:

user@host:~ [source] activate <environment>
(environment)user@host:~ conda install spacy
(environment)user@host:~ pip install redditquery

Lastly, you can clone the repository and use the to install the package manually:

user@host:~ git clone
user@host:~ python install

If you encounter any problems installing the dependencies, please consult the installation instructions for [Pandas]( and [Spacy](

### Installation

This package is pip-installable:

user@host:~ pip install redditquery

If you're using conda, then first activate the target environment and then install. Alternatively, clone this repository to your local directory and install manually:

user@host:~ git clone <path_to_destination_folder>
user@host:~ [source] activate <environment>
(environment)user@host:~ python install

## Author

**Christian Adam**

## License

This project is licensed under the MIT License - see the [LICENSE.txt](LICENSE.txt) file for details

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for redditquery, version 0.1.1
Filename, size File type Python version Upload date Hashes
Filename, size redditquery-0.1.1.tar.gz (14.9 kB) File type Source Python version None Upload date Hashes View

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring DigiCert DigiCert EV certificate Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page