Skip to main content

Media Cloud news article metadata extraction

Project description

Media Cloud Metadata Extractor

This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:

  • the original URL of publication
  • a normalized URL useful for de-duplication
  • the canonical domain published on
  • the date of publication
  • the primary language used in the article text
  • the title of the article
  • a normalized title useful for de-duplication
  • the text content of the news article
  • the name of the library used to extract the article content

Other often-reused methods and configuration related to the mediacloud service also live in this package.

Installation

pip install mediacloud-metadata

Usage

If you pass in a URL, it will follow redirects and fetch the HTML for you.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")

You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
                   html_text="<html><head><title>my webpage ... </html>")

Development

If you are interested in adding code to this module, first clone the GitHub repository.

Installing

  • flit install
  • pre-commit install

Testing

pytest

Distributing a New Version

  1. Run pytest to make sure all the test pass
  2. Update the version number in pyproject.toml
  3. Make a brief note in the CHANGELOG.md about what changes
  4. Commit the changes
  5. Tag the commit with a semantic version number - v*.*.*
  6. Push to repo to GitHub

Test Cache

Test are run against fixtures by default. This can be changed with the use of '--use-cache=False' when running tests. When adding new tests, re-run 'scripts/get-test-web-content.py'

Contributors

Created as part of the Media Cloud Project. Contributes include:

  • Rahul Bhargava (Media Cloud, Northeastern University)
  • Paige Gulley (Media Cloud)
  • Phil Budne (Media Cloud)
  • Vangelis Banos (Internet Archive)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mediacloud_metadata-1.4.1.tar.gz (8.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mediacloud_metadata-1.4.1-py3-none-any.whl (8.8 MB view details)

Uploaded Python 3

File details

Details for the file mediacloud_metadata-1.4.1.tar.gz.

File metadata

  • Download URL: mediacloud_metadata-1.4.1.tar.gz
  • Upload date:
  • Size: 8.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for mediacloud_metadata-1.4.1.tar.gz
Algorithm Hash digest
SHA256 f4498a0f3e50e10427bac1b3a5165cfea68e0aab8c2c0387ed16518ec60ed93e
MD5 498c26869e0ff86e80f30a7f5078fca8
BLAKE2b-256 0da3ea561111a68191145f4343e0d5a321eca395556d6ed8c6d0e03615d063e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for mediacloud_metadata-1.4.1.tar.gz:

Publisher: publish-to-pypi.yml on mediacloud/metadata-lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mediacloud_metadata-1.4.1-py3-none-any.whl.

File metadata

File hashes

Hashes for mediacloud_metadata-1.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c8537ffe9a3e29851e234b12b0619a203161a3cfb6052af499dc6b1a6577af97
MD5 d685bfcdfc02a12651e985fb0b74140f
BLAKE2b-256 a72e5b7b521fdddbae4b61d48bb1431d3afc8847ef19dd5285036b3d72663d15

See more details on using hashes here.

Provenance

The following attestation bundles were made for mediacloud_metadata-1.4.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on mediacloud/metadata-lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page