Skip to main content

Saved webpage index and search

Project description

WebHist indexes a collection of saved webpages and provides an interface to search the index.

WebHist can handle the following archive file types:

  • MAFF files generated by Mozilla Archive Format, with MHT and Faithful Save

  • HTML files generated by Save Page WE

Installation

Package is uploaded on PyPI.

You can install it with pip:

$ pip install webhist

Usage

Create an index of archived webpages

i = webhist.Index("/path/to/index")

Index a single file

i.add("/path/to/file")

A file will not be re-indexed unless explicitly requested. Files are tracked by the path string passed to the add() function, so an absolute path and a relative path will be considered two different files.

The code below will update the file in the index

i.add("/path/to/file", update=True)

Add all files in a specified directory (note that it does not search within subdirectories)

i.add_path("/path/to/directory")

Again, you can specify update=True to re-index files. You can also specify verbose=True to print information about whether or not files were indexed

i.add_path("/path/to/directory", verbose=True)

The output will look something like:

file1
- file2 (already in index)
- file3 (exception type: error message)

In the example output above:

  • file1 was indexed correctly

  • file2 was already in the index, and was not re-indexed

  • file3 had a problem and was not indexed (python exception message shown)

After adding files, the changes to the index need to be committed

i.commit()

You can also cancel the changes

i.cancel()

Once an index has been populated, you can run search queries against it. The syntax follows the Whoosh default query language. More information can be found here.

The code below searches for webpage archives that contain “webhist” and “installation”

results = i.search("webhist installation")

The field searched by default is the content field. The following fields are indexed and searchable:

  • title (title of page)

  • content (content of page)

  • url (full URL of page)

  • fqdn (fully qualified domain name, e.g. packaging.python.org)

  • dn (domain name, e.g. python.org)

  • date (the date the webpage archive was saved)

For example, you can search the title field for webpages saved from example.com

results = i.search("title:webhist dn:example.com")

Shell Interface

A simple shell interface to a WebHist index is provided in examples/shell.py. You can clone the webhist repo and run it from the repo root:

$ python examples/shell.py /path/to/archive -i /path/to/index

The -i parameter is optional. The default index location is /path/to/archive/index.

Run a search query:

webhist> search title:webhist dn:example.com

The output will look something like:

0: [2010-01-02 12:30:01] Title of page
1: [2011-02-03 16:20:25] Another page
2: [2013-06-12 00:00:01] Yet another page

To open page #2 from the search results:

webhist> open 2

To get more help:

webhist> help

To exit the shell:

webhist> exit

License

WebHist is released under the GNU Lesser General Public License, Version 3.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webhist-1.0.0.tar.gz (4.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

webhist-1.0.0-py2-none-any.whl (20.8 kB view details)

Uploaded Python 2

File details

Details for the file webhist-1.0.0.tar.gz.

File metadata

  • Download URL: webhist-1.0.0.tar.gz
  • Upload date:
  • Size: 4.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/18.4 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/2.7.10

File hashes

Hashes for webhist-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cb097f9ad13c9e9e43786e2a045f67c10315bb30ad18493bdbc1eb5ba42ea326
MD5 7f36ebdbf04b8b5bce13879ba771d38b
BLAKE2b-256 a5f00b9ff7dd25ce9bb5fb6bab972560a6a740dfbe5cb2a7e5566a1e9e8c092f

See more details on using hashes here.

File details

Details for the file webhist-1.0.0-py2-none-any.whl.

File metadata

  • Download URL: webhist-1.0.0-py2-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/18.4 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/2.7.10

File hashes

Hashes for webhist-1.0.0-py2-none-any.whl
Algorithm Hash digest
SHA256 2016ad076f202994949269d6c2f019fb3e070e98043a18615ec3fa8d0568fc6c
MD5 22e2a4d04a717f272c4399de8af96a5c
BLAKE2b-256 72c24002385e64b13af721b3909546252f4ab7b9db827ef74929be319c3a18a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page