url-metadata

A cache which saves URL metadata and summarizes content

These details have not been verified by PyPI

Project links

Homepage

Project description

This is currently not perfect and in development, so expect changes to the API/interface. It aims to walk the line between extracting enough text/data for it to be useful, but no so much that it takes enormous amounts of space.

Current TODOs:

Improve CLI interface to match all functions
Improve HTML/text parsing (see #6)
Add more sites using the abstract interface, to get more info from sites I use commonly
Add a preprocessing step to the sites abstract interface/URLMetadataCache functions, which 'corrects' URLs, to avoid hash mismatches

A cache which saves URL metadata and summarizes content

This is meant to provide more context to any of my tools which use URLs. If I watched some youtube video and I have a URL, I'd like to have the subtitles for it, so I can do a text-search over all the videos I've watched. If I read an article, I want the article text! This requests, parses and abstracts away that data for me locally, so I can just do:

>>> from url_metadata import metadata
>>> m = metadata("https://pypi.org/project/beautifulsoup4/")
>>> len(m.info["images"])
46
>>> m.info["title"]
'beautifulsoup4'
>>> m.text_summary[:57]
"Beautiful Soup is a library that makes it easy to scrape"

If I ever request the same URL again, that info is grabbed from a local directory cache instead.

Installation

Requires python3.7+

To install with pip, run:

pip install url_metadata

This uses:

lassie to get generic metadata; the title, description, opengraph information, links to images/videos on the page
readability to convert HTML to a summary of the HTML content.
bs4 to convert the parsed HTML to text (to allow for nicer text searching)
youtube_subtitles_downloader to get manual/autogenerated captions (converted to a .srt file) from Youtube URLs.

Usage:

The CLI interface provides some utility commands to get/list information from the cache.

$ url_metadata --help
Usage: url_metadata [OPTIONS] COMMAND [ARGS]...

Options:
  --cache-dir PATH          Override default cache directory location
  --debug / --no-debug      Increase log verbosity
  --sleep-time INTEGER      How long to sleep between requests
  --skip-subtitles          Don't attempt to download subtitles
  --subtitle-language TEXT  Subtitle language for Youtube captions
  --help                    Show this message and exit.

Commands:
  cachedir  Prints the location of the local cache directory
  export    Print all cached information as JSON
  get       Get information for one or more URLs Prints results as JSON
  list      List all cached URLs

In Python, this can be configured by using the url_metadata.URLMetadataCache class:

url_metadata.URLMetadataCache(loglevel: int = 30,
                            subtitle_language: str = 'en',
                            sleep_time: int = 5,
                            skip_subtitles: bool = False,
                            cache_dir: Optional[str, pathlib.Path] = None):
    """
    Main interface to the library

    subtitle_language: for youtube subtitle requests
    sleep_time: time to wait between HTTP requests
    skip_subtitles: don't attempt to download youtube subtitles
    cache_dir: location the store cached data
               uses default user cache directory if not provided
    """

get(self, url: str) -> url_metadata.model.Metadata
    """
    Gets metadata/summary for a URL
    Save the parsed information in a local data directory
    If the URL already has cached data locally, returns that instead
    """

get_cache_dir(self, url: str) -> Optional[str]
    """
    If this URL is in cache, returns the location of the cache directory
    Returns None if it couldn't find a matching directory
    """

in_cache(self, url: str) -> bool
    """
    Returns True if the URL already has cached information
    """

request_data(self, url: str) -> url_metadata.model.Metadata
    """
    Given a URL:

    If this is a youtube URL, this requests youtube subtitles
    Uses lassie to grab metadata
    Parses the HTML text with readablity
    uses bs4 to parse that text into a plaintext summary
    """

For example:

import logging
from url_metadata import URLMetadataCache

# make requests every 2 seconds
# debug logs
# save to a folder in my home directory
cache = URLMetadataCache(loglevel=logging.DEBUG, sleep_time=2, cache_dir="~/mydata")
c = cache.get("https://github.com/seanbreckenridge/url_metadata")
# just request information, don't read/save to cache
data = cache.request_data("https://www.wikipedia.org/")

CLI Examples

The get command emits JSON, so it could with other tools (e.g. jq) used like:

$ url_metadata get "https://click.palletsprojects.com/en/7.x/arguments/" \
    | jq -r '.[] | .text_summary' | head -n5
Arguments
Arguments work similarly to options but are positional.
They also only support a subset of the features of options due to their
syntactical nature. Click will also not attempt to document arguments for
you and wants you to document them manually

$ url_metadata export | jq -r '.[] | .info | .title'
seanbreckenridge/youtube_subtitles_downloader
Arguments — Click Documentation (7.x)

$ url_metadata list --location
/home/sean/.local/share/url_metadata/data/b/a/a/c8e05501857a3c7d2d1a94071c68e/000
/home/sean/.local/share/url_metadata/data/9/4/4/1c380792a3d62302e1137850d177b/000

# to make a backup of the cache directory
$ tar -cvzf url_metadata.tar.gz "$(url_metadata cachedir)"

Accessible through the url_metadata script and python3 -m url_metadata

Implementation Notes

This stores all of this information as individual files in a cache directory (using appdirs). In particular, it MD5 hashes the URL and stores information like:

.
└── 7
    └── b
        └── 0
            └── d952fd7265e8e4cf10b351b6f8932
                └── 000
                    ├── epoch_timestamp.txt
                    ├── key
                    ├── metadata.json
                    ├── subtitles.srt
                    ├── summary.html
                    └── summary.txt

You're free to delete any of the directories in the cache if you want, this doesn't maintain a strict index, it uses a hash of the URL and then searches for a matching key file. See comments here for implementation details.

By default this waits 5 seconds between requests. Since all the info is cached, I use this by requesting all the info from one data source (e.g. my bookmarks, or videos I've watched recently) in a loop in the background, which saves all the information to my computer. The next time I do that same loop, it doesn't have to make any requests and it just grabs all the info from local cache.

Originally created for HPI.

Testing

git clone 'https://github.com/seanbreckenridge/url_metadata'
cd ./url_metadata
git submodule update --init
pip install '.[testing]'
mypy ./src/url_metadata/
pytest

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.6

Dec 30, 2020

0.1.5

Oct 8, 2020

0.1.4

Oct 7, 2020

0.1.3

Oct 4, 2020

0.1.2

Oct 3, 2020

0.1.1

Oct 3, 2020

0.1.0

Oct 3, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url_metadata-0.1.6.tar.gz (790.4 kB view details)

Uploaded Dec 30, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

url_metadata-0.1.6-py3-none-any.whl (24.3 kB view details)

Uploaded Dec 30, 2020 Python 3

File details

Details for the file url_metadata-0.1.6.tar.gz.

File metadata

Download URL: url_metadata-0.1.6.tar.gz
Upload date: Dec 30, 2020
Size: 790.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0.post20201221 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.9.1

File hashes

Hashes for url_metadata-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`aa971398f847aff3d213953fdf6db751101a6e000190697bf8c466565f97b496`
MD5	`d4827a71978aab8279a463270a4c577e`
BLAKE2b-256	`f514cfff7af9201490e0a405d6718d1cb8410707a9deefd99352709c7ced8d9a`

See more details on using hashes here.

File details

Details for the file url_metadata-0.1.6-py3-none-any.whl.

File metadata

Download URL: url_metadata-0.1.6-py3-none-any.whl
Upload date: Dec 30, 2020
Size: 24.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0.post20201221 requests-toolbelt/0.9.1 tqdm/4.54.1 CPython/3.9.1

File hashes

Hashes for url_metadata-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fde69ae3f9d6811910e9d541cb7f234a024219d24e45a9841311935b841313b4`
MD5	`b85029030a997f45900035ec6f2913c7`
BLAKE2b-256	`967ad3d08684ef1f2322b924f1f383c7710f502bef993d6b68ae2d7394651f8f`

See more details on using hashes here.

url-metadata 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Installation

Usage:

CLI Examples

Implementation Notes

Testing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes