Skip to main content

Saved webpage index and search

Project description

WebHist indexes a collection of saved webpages and provides an interface to search the index.

WebHist can handle the following archive file types:

  • MAFF files generated by Mozilla Archive Format, with MHT and Faithful Save
  • HTML files generated by Save Page WE

Installation

Package is uploaded on PyPI.

You can install it with pip:

$ pip install webhist

Usage

Create an index of archived webpages

i = webhist.Index("/path/to/index")

Index a single file

i.add("/path/to/file")

A file will not be re-indexed unless explicitly requested. Files are tracked by the path string passed to the add() function, so an absolute path and a relative path will be considered two different files.

The code below will update the file in the index

i.add("/path/to/file", update=True)

Add all files in a specified directory (note that it does not search within subdirectories)

i.add_path("/path/to/directory")

Again, you can specify update=True to re-index files. You can also specify verbose=True to print information about whether or not files were indexed

i.add_path("/path/to/directory", verbose=True)

The output will look something like:

file1
- file2 (already in index)
- file3 (exception type: error message)

In the example output above:

  • file1 was indexed correctly
  • file2 was already in the index, and was not re-indexed
  • file3 had a problem and was not indexed (python exception message shown)

After adding files, the changes to the index need to be committed

i.commit()

You can also cancel the changes

i.cancel()

Once an index has been populated, you can run search queries against it. The syntax follows the Whoosh default query language. More information can be found here.

The code below searches for webpage archives that contain “webhist” and “installation”

results = i.search("webhist installation")

The field searched by default is the content field. The following fields are indexed and searchable:

  • title (title of page)
  • content (content of page)
  • url (full URL of page)
  • fqdn (fully qualified domain name, e.g. packaging.python.org)
  • dn (domain name, e.g. python.org)
  • date (the date the webpage archive was saved)

For example, you can search the title field for webpages saved from example.com

results = i.search("title:webhist dn:example.com")

Shell Interface

A simple shell interface to a WebHist index is provided in examples/shell.py. You can clone the webhist repo and run it from the repo root:

$ python examples/shell.py /path/to/archive -i /path/to/index

The -i parameter is optional. The default index location is /path/to/archive/index.

Run a search query:

webhist> search title:webhist dn:example.com

The output will look something like:

0: [2010-01-02 12:30:01] Title of page
1: [2011-02-03 16:20:25] Another page
2: [2013-06-12 00:00:01] Yet another page

To open page #2 from the search results:

webhist> open 2

To get more help:

webhist> help

To exit the shell:

webhist> exit

License

WebHist is released under the GNU Lesser General Public License, Version 3.

Project details


Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
webhist-1.0.0-py2-none-any.whl (20.8 kB) Copy SHA256 hash SHA256 Wheel py2
webhist-1.0.0.tar.gz (4.8 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page