Skip to main content

Media Cloud news article metadata extraction

Project description

Media Cloud Metadata Extractor

This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:

  • the original URL of publication
  • a normalized URL useful for de-duplication
  • the canonical domain published on
  • the date of publication
  • the primary language used in the article text
  • the title of the article
  • a normalized title useful for de-duplication
  • the text content of the news article
  • the name of the library used to extract the article content

Other often-reused methods and configuration related to the mediacloud service also live in this package.

Installation

pip install mediacloud-metadata

Usage

If you pass in a URL, it will follow redirects and fetch the HTML for you.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")

You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
                   html_text="<html><head><title>my webpage ... </html>")

Development

If you are interested in adding code to this module, first clone the GitHub repository.

Installing

  • flit install
  • pre-commit install

Testing

pytest

Distributing a New Version

  1. Run pytest to make sure all the test pass
  2. Update the version number in pyproject.toml
  3. Make a brief note in the CHANGELOG.md about what changes
  4. Commit the changes
  5. Tag the commit with a semantic version number - v*.*.*
  6. Push to repo to GitHub

Test Cache

Test are run against fixtures by default. This can be changed with the use of '--use-cache=False' when running tests. When adding new tests, re-run 'scripts/get-test-web-content.py'

Contributors

Created as part of the Media Cloud Project. Contributes include:

  • Rahul Bhargava (Media Cloud, Northeastern University)
  • Paige Gulley (Media Cloud)
  • Phil Budne (Media Cloud)
  • Vangelis Banos (Internet Archive)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mediacloud_metadata-1.2.0.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

mediacloud_metadata-1.2.0-py3-none-any.whl (8.8 MB view details)

Uploaded Python 3

File details

Details for the file mediacloud_metadata-1.2.0.tar.gz.

File metadata

  • Download URL: mediacloud_metadata-1.2.0.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for mediacloud_metadata-1.2.0.tar.gz
Algorithm Hash digest
SHA256 ce7207aec5a7b7686c69260eae2e3de2737adb524dd58fd3e13790adc7a19b4f
MD5 6f5badaeff76c37cb5f8aab01b6ee53d
BLAKE2b-256 6f54cc35271f23ec17869821dcac02448dd566be448eea58f9daeeb6aa923c95

See more details on using hashes here.

File details

Details for the file mediacloud_metadata-1.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for mediacloud_metadata-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 04adc067703cb055eabc31726b154f2b4744f778a1de7fcf88b03a29311b83a0
MD5 de07281ec40a1f7faca90aa6dfd58972
BLAKE2b-256 cebc30f655a612190ec1b79637dfdf0da83249f604720a6c975f6f9a3021cb2e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page