Media Cloud news article metadata extraction
Project description
Meta Extractor
🚧 under construction 🚧
This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:
- the original URL of publication
- the canonical domain published on
- the date of publication
- the primary language used in the article text
- the title of the article
- the text content of the news article
- the name of the library used to extract the article content
Installation
pip install mediacloud-metadata
Usage
If you pass in a URL, it will follow redirects and fetch the HTML for you.
from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")
You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.
from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
html_text="<html><head><title>my webpage ... </html>")
Distribution
- Run
pytest
to make sure all the test pass - Update the version number in
mcextractor/__init__.py
- Make a brief note in the version history section below about the changes
- Run
python setup.py sdist
to create an install package - Run
twine upload --repository-url https://test.pypi.org/legacy/ dist/*
to upload it to PyPI's test platform - Run
twine upload dist/*
to upload it to PyPI
Version History
- v0.3.0: more fault tolerant, faster regex's, track extration rates, update requirements
- v0.2.0: first packaging release for use in other places
- v0.1.1: first version for testing with collaborators
Authors
Created as part of the Media Cloud Project:
- Tyler Horan
- Rahul Bhargava
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for mediacloud-metadata-0.3.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f306dcdf5508a7b43cecd8e52c601d41a99f5ceb4e4b6c5cb40d9a866dd819e |
|
MD5 | 607b961ba3bee178e9bb0e08dc9863a5 |
|
BLAKE2b-256 | 71bfaf27b2a1abb4b9e852e8b30bc52160b1f7116c4eaa8e6fdf1650c4ddb279 |