Media Cloud news article metadata extraction
Project description
Media Cloud Metadata Extractor
This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:
- the original URL of publication
- a normalized URL useful for de-duplication
- the canonical domain published on
- the date of publication
- the primary language used in the article text
- the title of the article
- a normalized title useful for de-duplication
- the text content of the news article
- the name of the library used to extract the article content
Other often-reused methods and configuration related to the mediacloud service also live in this package.
Installation
pip install mediacloud-metadata
Usage
If you pass in a URL, it will follow redirects and fetch the HTML for you.
from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")
You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.
from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
html_text="<html><head><title>my webpage ... </html>")
Development
If you are interested in adding code to this module, first clone the GitHub repository.
Installing
flit install
pre-commit install
Testing
pytest
Distributing a New Version
- Run
pytest
to make sure all the test pass - Update the version number in
pyproject.toml
- Make a brief note in the
CHANGELOG.md
about what changes - Commit the changes
- Tag the commit with a semantic version number -
v*.*.*
- Push to repo to GitHub
Test Cache
Test are run against fixtures by default. This can be changed with the use of '--use-cache=False' when running tests. When adding new tests, re-run 'scripts/get-test-web-content.py'
Contributors
Created as part of the Media Cloud Project. Contributes include:
- Rahul Bhargava (Media Cloud, Northeastern University)
- Paige Gulley (Media Cloud)
- Phil Budne (Media Cloud)
- Vangelis Banos (Internet Archive)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mediacloud_metadata-1.2.0.tar.gz
.
File metadata
- Download URL: mediacloud_metadata-1.2.0.tar.gz
- Upload date:
- Size: 8.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ce7207aec5a7b7686c69260eae2e3de2737adb524dd58fd3e13790adc7a19b4f |
|
MD5 | 6f5badaeff76c37cb5f8aab01b6ee53d |
|
BLAKE2b-256 | 6f54cc35271f23ec17869821dcac02448dd566be448eea58f9daeeb6aa923c95 |
File details
Details for the file mediacloud_metadata-1.2.0-py3-none-any.whl
.
File metadata
- Download URL: mediacloud_metadata-1.2.0-py3-none-any.whl
- Upload date:
- Size: 8.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 04adc067703cb055eabc31726b154f2b4744f778a1de7fcf88b03a29311b83a0 |
|
MD5 | de07281ec40a1f7faca90aa6dfd58972 |
|
BLAKE2b-256 | cebc30f655a612190ec1b79637dfdf0da83249f604720a6c975f6f9a3021cb2e |