Skip to main content

A cache which saves URL metadata and content

Project description

PyPi version Python3.7|Python 3.8 PRs Welcome

A cache which saves URL metadata and content

This is meant to provide more context to any of my tools which use URLs. If I watched some youtube video and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I read an article, I want the article text! This requests, parses and abstracts away that data for me locally, so I can just do:

>>> from url_metadata import metadata
>>> m = metadata("https://pypi.org/project/beautifulsoup4/")
>>> len(m.info["images"])
46
>>> m.info["title"]
'beautifulsoup4'
>>> m.text_summary[:57]
"Beautiful Soup is a library that makes it easy to scrape"

If I ever request the same URL again, that info is grabbed from a local directory cache instead.


Installation

Requires python3.7+

To install with pip, run:

pip install url_metadata

This uses:

  • lassie to get generic metadata; the title, description, opengraph information, links to images/videos on the page
  • readability to convert HTML to a summary of the HTML content.
  • bs4 to convert the parsed HTML to text (to allow for nicer text searching)
  • youtube_subtitles_downloader to get manual/autogenerated captions (converted to a .srt file) from Youtube URLs.

Usage:

In Python, this can be configured by using the url_metadata.URLMetadataCache class:

   URLMetadataCache(loglevel: int = logging.WARNING,
                    subtitle_language: str = 'en',
                    sleep_time: int = 5,
                    cache_dir: Union[str, pathlib.Path, NoneType] = None)
       """
       Main interface to the library

       Supply 'cache_dir' to overwrite the default location.
       """

   get(self, url: str) -> url_metadata.model.Metadata
       """
       Gets metadata/summary for a URL.
       Save the parsed information in a local data directory
       If the URL already has cached data locally, returns that instead.
       """

   in_cache(self, url: str) -> bool
       """
       Returns True if the URL already has cached information
       """

   request_data(self, url: str) -> url_metadata.model.Metadata
       """
       Given a URL:

       If this is a youtube URL, this requests youtube subtitles
       Uses lassie to grab metadata
       Parses the HTML text with readablity
       uses bs4 to parse that text into a plaintext summary
       """

For example:

import logging
from url_metadata import URLMetadataCache

# make requests every 2 seconds
# debug logs
# save to a folder in my home directory
cache = URLMetadataCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/mydata")
c = cache.get("https://github.com/seanbreckenridge/url_metadata")

The CLI interface lets you specify much the same; and provides some utility commands to get/list information from the cache.

$ url_metadata
Usage: url_metadata [OPTIONS] COMMAND [ARGS]...

Options:
  --cache-dir PATH          Override default directory cache location
  --debug / --no-debug      Increase log verbosity
  --sleep-time INTEGER      How long to sleep between requests
  --subtitle-language TEXT  Subtitle language for Youtube captions
  --help                    Show this message and exit.

Commands:
  cachedir  Prints the location of the local cache directory
  export    Print all cached information as JSON
  get       Get information for one or more URLs.
  list      List all cached URLs

CLI Examples

The get command emits JSON, so it could with other tools (e.g. jq) used like:

$ url_metadata get "https://click.palletsprojects.com/en/7.x/arguments/" \
    | jq -r '.[] | .text_summary' | head -n5
Arguments
Arguments work similarly to options but are positional.
They also only support a subset of the features of options due to their
syntactical nature. Click will also not attempt to document arguments for
you and wants you to document them manually
$ url_metadata export | jq -r '.[] | .info | .title'
seanbreckenridge/youtube_subtitles_downloader
Arguments  Click Documentation (7.x)
$ url_metadata list --location
/home/sean/.local/share/url_metadata/data/b/a/a/c8e05501857a3c7d2d1a94071c68e/000
/home/sean/.local/share/url_metadata/data/9/4/4/1c380792a3d62302e1137850d177b/000
# to make a backup of the cache directory
$ tar -cvzf url_metadata.tar.gz "$(url_metadata cachedir)"

Accessible through the url_metadata script and python3 -m url_metadata


This stores all of this information as individual files in a cache directory (using appdirs). In particular, it MD5 hashes the URL and stores information like:

.
└── 7
    └── b
        └── 0
            └── d952fd7265e8e4cf10b351b6f8932
                └── 000
                    ├── epoch_timestamp.txt
                    ├── key
                    ├── metadata.json
                    ├── subtitles.srt
                    ├── summary.html
                    └── summary.txt

You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching key file. See comments here for implementation details.

By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.

Originally created for HPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url_metadata-0.1.1.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

url_metadata-0.1.1-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file url_metadata-0.1.1.tar.gz.

File metadata

  • Download URL: url_metadata-0.1.1.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for url_metadata-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ea95284ca051c7cd1405441287c79dcd16cc7a8ab354a4aaa1383d0be5d48b2d
MD5 d8fe640300b3a8c119fab84a0aee4b9b
BLAKE2b-256 17b7f787513ba1fbc735718c4ffe4dfc18bdb187faa7e54f7d4137878388a176

See more details on using hashes here.

File details

Details for the file url_metadata-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: url_metadata-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.8.5

File hashes

Hashes for url_metadata-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1ec53b94a3ce8539c7ad0e3cd8599b74056f36acafa941e55a675a29abeb7876
MD5 1e4bbb7303645c2ad739cb116309ccc4
BLAKE2b-256 ec50d60f551b23dc57a2bc38f1f1f952e2260eb885205e5db175297d2c13782a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page