A filesystem cache which saves URL metadata and summarizes content
Project description
This is currently not perfect and in development, so expect changes to the API/interface. It aims to walk the line between extracting enough text/data for it to be useful, but no so much that it takes enormous amounts of space.
Current TODOs:
- Add more sites using the abstract interface, to get more info from sites I use commonly
- Create a (separate repo/project) daemon which handles configuring this and slowly requests things in the background as they become available through given sources; allow user to provide generators/inputs define include/exclude lists/regexes. probably just integrate with promnesia so avoid duplicating the work of searching for URLs on disk
A filesystem cache which saves URL metadata and summarizes content
This is meant to provide more context to any of my tools which use URLs. If I watched some youtube video and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I read an article, I want the article text! This requests, parses and abstracts away that data for me locally.
Installation
Requires python3.7+
To install with pip, run:
pip install url_summary
Generally, this uses:
lassie
to get generic metadata; the title, description, opengraph information, links to images/videos on the pagereadability
/lxml
to convert/compress HTML to a summary of the HTML content.
Site-Specific Extractors:
- Youtube: to get manual/autogenerated captions (converted to a
.srt
file) from Youtube URLs
This is meant to be extendible -- so its possible for you to write your own extractors/file loaders/dumpers (for new formats (e.g. srt
)) for sites you use commonly and pass those to url_summary.core.URLSummaryCache
to cache richer data for those sites. Otherwise, it saves the information from lassie
and the summarized HTML using readability
for each URL
Usage:
The CLI interface provides some utility commands to get/list information from the cache.
$ url_summary --help
Usage: url_summary [OPTIONS] COMMAND [ARGS]...
Options:
--cache-dir PATH Override default cache directory location
--debug / --no-debug Increase log verbosity
--sleep-time INTEGER How long to sleep between requests
--skip-subtitles Don't attempt to download subtitles
--subtitle-language TEXT Subtitle language for Youtube captions
--help Show this message and exit.
Commands:
cachedir Prints the location of the local cache directory
export Print all cached information as JSON
get Get information for one or more URLs Prints results as JSON
list List all cached URLs
An environment variable URL_METADATA_DIR
can be set, which changes the default metadata cache directory.
In Python, this can be configured by using the url_summary.URLMetadataCache
class:
url_summary.URLMetadataCache(loglevel: int = 30,
subtitle_language: str = 'en',
sleep_time: int = 5,
skip_subtitles: bool = False,
cache_dir: Optional[str, pathlib.Path] = None):
"""
Main interface to the library
subtitle_language: for youtube subtitle requests
sleep_time: time to wait between HTTP requests
skip_subtitles: don't attempt to download youtube subtitles
cache_dir: location the store cached data
uses default user cache directory if not provided
"""
get(self, url: str) -> url_summary.model.Metadata
"""
Gets metadata/summary for a URL
Save the parsed information in a local data directory
If the URL already has cached data locally, returns that instead
"""
get_cache_dir(self, url: str) -> Optional[str]
"""
If this URL is in cache, returns the location of the cache directory
Returns None if it couldn't find a matching directory
"""
in_cache(self, url: str) -> bool
"""
Returns True if the URL already has cached information
"""
request_data(self, url: str) -> url_summary.model.Metadata
"""
Given a URL:
If this is a youtube URL, this requests youtube subtitles
Uses lassie to grab metadata
Parses/minifies the HTML text with readablity/lxml
"""
For example:
import logging
from url_summary import URLMetadataCache
# make requests every 2 seconds
# debug logs
# save to a folder in my home directory
cache = URLMetadataCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/mydata")
c = cache.get("https://github.com/seanbreckenridge/url_summary")
# just request information, don't read/save to cache
data = cache.request_data("https://www.wikipedia.org/")
CLI Examples
The get
command emits JSON
, so it could with other tools (e.g. jq
) used like:
$ url_summary get "https://click.palletsprojects.com/en/7.x/arguments/" \
| jq -r '.[] | .html_summary' | lynx -stdin -dump | head -n 5
Arguments
Arguments work similarly to options but are positional. They also only
support a subset of the features of options due to their syntactical
nature. Click will also not attempt to document arguments for you and
$ url_summary export | jq -r '.[] | .info | .title'
seanbreckenridge/youtube_subtitles_downloader
Arguments — Click Documentation (7.x)
$ url_summary list --location
/home/sean/.local/share/url_summary/data/b/a/a/c8e05501857a3c7d2d1a94071c68e/000
/home/sean/.local/share/url_summary/data/9/4/4/1c380792a3d62302e1137850d177b/000
# to make a backup of the cache directory
$ tar -cvzf url_summary.tar.gz "$(url_summary cachedir)"
Accessible through the url_summary
script and python3 -m url_summary
Implementation Notes
This stores all of this information as individual files in a cache directory (using appdirs
). In particular, it MD5
hashes the URL and stores information like:
.
└── 7
└── b
└── 0
└── d952fd7265e8e4cf10b351b6f8932
└── 000
├── epoch_timestamp.txt
├── key
├── metadata.json
├── subtitles.srt
├── summary.html
└── summary.txt
You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching key
file. See comments here for implementation details.
By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.
Originally created for HPI
.
Testing
git clone 'https://github.com/seanbreckenridge/url_summary'
cd ./url_summary
git submodule update --init
pip install '.[testing]'
mypy ./src/url_summary/
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for url_cache-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e0068ffbfabc08ceda6e83b23ac098d4df319c04b21b32725b9357b1598bd48a |
|
MD5 | 1508fa746651e623f1b011624e94f2b1 |
|
BLAKE2b-256 | 037df1ba8669a4970be7dd5939182e059bae09c89ad3d4daa70fbd0bc4d06dc2 |