Skip to main content

A file system cache which saves URL metadata and summarizes content

Project description

PyPi version Python3.7|3.8|3.9|3.10|3.11 PRs Welcome

This is currently very alpha and in development, so expect changes to the API/interface. It aims to walk the line between extracting enough text/data for it to be useful, but no so much that it takes enormous amounts of space.

As it stands I'm sort of pessimistic this would ever be a silver bullet, getting useful info out of arbitrary HTML is hard, so you're sort of stuck writing parsers for each website you're interested in. However, I still use this frequently, especially as a cache for API information like described below

Current TODOs:

  • Add more sites using the abstract interface, to get more info from sites I use commonly. Ideally, should be able to re-use common scraper/parsers/API interface libraries in python, instead of recreating everything from scratch
  • Create a (separate repo/project) daemon which handles configuring this and slowly requests things in the background as they become available through given sources; allow user to provide generators/inputs define include/exclude lists/regexes. Probably just integrate with promnesia so avoid duplicating the work of searching for URLs on disk

Installation

Requires python3.7+

To install with pip, run:

python3 -m pip install url_cache

As this is still in development, for the latest changes install from git: python3 -m pip install git+https://github.com/seanbreckenridge/url_cache

Rationale

A file system cache which saves URL metadata and summarizes content

This is meant to provide more context to any of my tools which use URLs. If I watched some youtube video and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I read an article, I want the article text! This requests, parses and abstracts away that data for me locally, so I can do something like:

>>> from url_cache.core import URLCache
>>> u = URLCache()
>>> data = u.get("https://sean.fish/")
>>> data.metadata["images"][-1]
{'src': 'https://raw.githubusercontent.com/seanbreckenridge/glue/master/assets/screenshot.png', 'alt': 'screenshot', 'type': 'body_image', 'width': 600}
>>> data.metadata["description"]
"sean.fish; Sean Breckenridge's Home Page"

If I ever request that URL again, the information is grabbed from a local cache instead.

Generally, this uses:

  • lassie to get generic metadata; the title, description, opengraph information, links to images/videos on the page.
  • readability to convert/compress HTML to a summary of the HTML content.

Site-Specific Extractors:

  • Youtube: to get manual/auto-generated captions (converted to a .srt file) from Youtube URLs
  • Stackoverflow (Just a basic URL preprocessor to reduce the possibility of conflicts/duplicate data)
  • MyAnimeList (using Jikan v4)

This is meant to be extendible -- so its possible for you to write your own extractors/file loaders/dumpers (for new formats (e.g. srt)) for sites you use commonly and pass those to url_cache.core.URLCache to extract richer data for those sites. Otherwise, it saves the information from lassie and the summarized HTML using readability for each URL.

To avoid scope creep, this probably won't support:

  • Converting the HTML summary to text (use something like the lynx command below)
  • Minimizing HTML - run something like find ~/.local/share/url_cache/ -name '*.html' -exec <some tool/script that minimizes in place> \; instead -- the data is just stored in individual files in the data directory

Usage:

In Python, this can be configured by using the url_cache.core.URLCache class: For example:

import logging
from url_cache.core import URLCache

# make requests every 2 seconds
# debug logs
# save to a folder in my home directory
cache = URLCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/Documents/urldata")
c = cache.get("https://github.com/seanbreckenridge")
# just request information, don't read/save to cache
data = cache.request_data("https://www.wikipedia.org/")

For more information, see the docs

The CLI interface provides some utility commands to get/list information from the cache.

Usage: url_cache [OPTIONS] COMMAND [ARGS]...

Options:
  --cache-dir PATH                Override default cache directory location
  --debug / --no-debug            Increase log verbosity
  --sleep-time INTEGER            How long to sleep between requests
  --summarize-html / --no-summarize-html
                                  Use readability to summarize html. Otherwise
                                  saves the entire HTML document

  --skip-subtitles / --no-skip-subtitles
                                  Skip downloading Youtube Subtitles
  --subtitle-language TEXT        Subtitle language for Youtube Subtitles
  --help                          Show this message and exit.

Commands:
  cachedir  Prints the location of the local cache directory
  export    Print all cached information as JSON
  get       Get information for one or more URLs Prints results as JSON
  in-cache  Prints if a URL is already cached
  list      List all cached URLs

An environment variable URL_CACHE_DIR can be set, which changes the default cache directory.

API Cache Examples

I've also successfully used this to cache responses from API results in some of my projects, by subclassing and overriding the request_data function. I just make a request and return a summary, and it transparently caches the rest. See:

CLI Examples

The get command emits JSON, so it could with other tools (e.g. jq) used like:

$ url_cache get "https://click.palletsprojects.com/en/7.x/arguments/" | \
  jq -r '.[] | .html_summary' | lynx -stdin -dump | head -n 5
Arguments

   Arguments work similarly to [1]options but are positional. They also
   only support a subset of the features of options due to their
   syntactical nature. Click will also not attempt to document arguments
$ url_cache export | jq -r '.[] | .metadata | .title'
seanbreckenridge - Overview
Arguments  Click Documentation (7.x)
url_cache list --location
/home/sean/.local/share/url_cache/data/2/c/7/6284b2f664f381372fab3276449b2/000
/home/sean/.local/share/url_cache/data/7/5/1/70fc230cd88f32e475ff4087f81d9/000
# to make a backup of the cache directory
$ tar -cvzf url_cache.tar.gz "$(url_cache cachedir)"

Accessible through the url_cache script and python3 -m url_cache

Implementation Notes

This stores all of this information as individual files in a cache directory. In particular, it MD5 hashes the URL and stores information like:

.
└── a
    └── a
        └── e
            └── cf0118bb22340e18fff20f2db8abd
                └── 000
                    ├── data
                    │   └── subtitles.srt
                    ├── key
                    ├── metadata.json
                    └── timestamp.datetime.txt

In other words, this is a file system hash table which implements separate chaining.

You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching key file.

By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.

Originally created for HPI.


Testing

git clone 'https://github.com/seanbreckenridge/url_cache'
cd ./url_cache
pip install '.[testing]'
mypy ./src/url_cache
flake8 ./src/url_cache
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url_cache-0.0.9.tar.gz (34.2 kB view details)

Uploaded Source

Built Distribution

url_cache-0.0.9-py3-none-any.whl (32.7 kB view details)

Uploaded Python 3

File details

Details for the file url_cache-0.0.9.tar.gz.

File metadata

  • Download URL: url_cache-0.0.9.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for url_cache-0.0.9.tar.gz
Algorithm Hash digest
SHA256 262f16291cd871dc3ab1e69e73fdd4118f49d19922ec1c5b2b02840abfd8ca88
MD5 2fcad9998c2728cafbb75f4d6bca7036
BLAKE2b-256 137cd3c840867817319cab7eb602302f4004a4d447a403724e2ef09ca65cd6f8

See more details on using hashes here.

File details

Details for the file url_cache-0.0.9-py3-none-any.whl.

File metadata

  • Download URL: url_cache-0.0.9-py3-none-any.whl
  • Upload date:
  • Size: 32.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for url_cache-0.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 3fdb945004d9498bca239cda9e9dd5525de662c16343cdc6000e5b4651dd480d
MD5 ae7ef6a8d4434cd5280030ac5022ae4f
BLAKE2b-256 b79a2daf8920d68705b4ecc8991b1018e30da384ddef5b7e906b65ae8510252c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page