Skip to main content

Media Cloud news article metadata extraction

Project description

Media Cloud Metadata Extractor

This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:

  • the original URL of publication
  • a normalized URL useful for de-duplication
  • the canonical domain published on
  • the date of publication
  • the primary language used in the article text
  • the title of the article
  • a normalized title useful for de-duplication
  • the text content of the news article
  • the name of the library used to extract the article content

Other often-reused methods and configuration related to the mediacloud service also live in this package.

Installation

pip install mediacloud-metadata

Usage

If you pass in a URL, it will follow redirects and fetch the HTML for you.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")

You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
                   html_text="<html><head><title>my webpage ... </html>")

Development

If you are interested in adding code to this module, first clone the GitHub repository.

Installing

  • flit install
  • pre-commit install

Testing

pytest

Distributing a New Version

  1. Run pytest to make sure all the test pass
  2. Update the version number in pyproject.toml
  3. Make a brief note in the CHANGELOG.md about what changes
  4. Commit the changes
  5. Tag the commit with a semantic version number - v*.*.*
  6. Push to repo to GitHub

Test Cache

Test are run against fixtures by default. This can be changed with the use of '--use-cache=False' when running tests. When adding new tests, re-run 'scripts/get-test-web-content.py'

Contributors

Created as part of the Media Cloud Project. Contributes include:

  • Rahul Bhargava (Media Cloud, Northeastern University)
  • Paige Gulley (Media Cloud)
  • Phil Budne (Media Cloud)
  • Vangelis Banos (Internet Archive)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mediacloud_metadata-1.4.3.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mediacloud_metadata-1.4.3-py3-none-any.whl (8.8 MB view details)

Uploaded Python 3

File details

Details for the file mediacloud_metadata-1.4.3.tar.gz.

File metadata

  • Download URL: mediacloud_metadata-1.4.3.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mediacloud_metadata-1.4.3.tar.gz
Algorithm Hash digest
SHA256 90d7d1bd6ec4d214d10b84e1ca4e179d3dafbb0c368318d289dd3266154f91c1
MD5 0303d8dae59d205cdc4e0350835a2348
BLAKE2b-256 3aac0d414fe79739e61f2d72f7cce894f3a3f24605e32db2c112d29432660d31

See more details on using hashes here.

Provenance

The following attestation bundles were made for mediacloud_metadata-1.4.3.tar.gz:

Publisher: publish-to-pypi.yml on mediacloud/metadata-lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mediacloud_metadata-1.4.3-py3-none-any.whl.

File metadata

File hashes

Hashes for mediacloud_metadata-1.4.3-py3-none-any.whl
Algorithm Hash digest
SHA256 eb30290fe65c10ed961d25d1df4bbeede6719c4144240ded7eaa468350ff4a0a
MD5 53d78164a678e90c1574ee987e0c0b4c
BLAKE2b-256 6f3e2d2ed7cdb722b0f2a918f413098415c9ff00e0c5dea3bfdc84f24316c031

See more details on using hashes here.

Provenance

The following attestation bundles were made for mediacloud_metadata-1.4.3-py3-none-any.whl:

Publisher: publish-to-pypi.yml on mediacloud/metadata-lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page