Skip to main content

Media Cloud news article metadata extraction

Project description

Media Cloud Metadata Extractor

This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:

  • the original URL of publication
  • a normalized URL useful for de-duplication
  • the canonical domain published on
  • the date of publication
  • the primary language used in the article text
  • the title of the article
  • a normalized title useful for de-duplication
  • the text content of the news article
  • the name of the library used to extract the article content

Other often-reused methods and configuration related to the mediacloud service also live in this package.

Installation

pip install mediacloud-metadata

Usage

If you pass in a URL, it will follow redirects and fetch the HTML for you.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")

You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
                   html_text="<html><head><title>my webpage ... </html>")

Development

If you are interested in adding code to this module, first clone the GitHub repository.

Installing

  • flit install
  • pre-commit install

Testing

pytest

Distributing a New Version

  1. Run pytest to make sure all the test pass
  2. Update the version number in pyproject.toml
  3. Make a brief note in the CHANGELOG.md about what changes
  4. Commit the changes
  5. Tag the commit with a semantic version number - v*.*.*
  6. Push to repo to GitHub

Test Cache

Test are run against fixtures by default. This can be changed with the use of '--use-cache=False' when running tests. When adding new tests, re-run 'scripts/get-test-web-content.py'

Contributors

Created as part of the Media Cloud Project. Contributes include:

  • Rahul Bhargava (Media Cloud, Northeastern University)
  • Paige Gulley (Media Cloud)
  • Phil Budne (Media Cloud)
  • Vangelis Banos (Internet Archive)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mediacloud_metadata-1.4.2.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mediacloud_metadata-1.4.2-py3-none-any.whl (8.8 MB view details)

Uploaded Python 3

File details

Details for the file mediacloud_metadata-1.4.2.tar.gz.

File metadata

  • Download URL: mediacloud_metadata-1.4.2.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mediacloud_metadata-1.4.2.tar.gz
Algorithm Hash digest
SHA256 3d68b5d550ecf9b93d3dd1814fb86be53fb383945e27d9ed53a54d71a8949e0f
MD5 a0c658b3715739fdbe7b99c1781f2bc3
BLAKE2b-256 adb28df79f44853c9909157f82686cd014b99261b69163cd37ea582a1038a1b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for mediacloud_metadata-1.4.2.tar.gz:

Publisher: publish-to-pypi.yml on mediacloud/metadata-lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mediacloud_metadata-1.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for mediacloud_metadata-1.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9977c1425888b4f18c8816fe8543b088391b6dccb782c2b05c27d288c8da8a77
MD5 8cc9b53efa3c3b3d6eb1bb6d64a33763
BLAKE2b-256 3d0fe0ada3d700b6492665ae52020602f95c10f06841e6f6e8946f8227ab0ff6

See more details on using hashes here.

Provenance

The following attestation bundles were made for mediacloud_metadata-1.4.2-py3-none-any.whl:

Publisher: publish-to-pypi.yml on mediacloud/metadata-lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page