Media Cloud news article metadata extraction

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

Media Cloud Metadata Extractor

This is a package to extract a domain, title, publication date, text, and language content from the URL or text of an online news story. The methods for each are extracted from the larger Media Cloud project, but also build on numerous 3rd party libraries. The metadata extracted includes:

the original URL of publication
a normalized URL useful for de-duplication
the canonical domain published on
the date of publication
the primary language used in the article text
the title of the article
a normalized title useful for de-duplication
the text content of the news article
the name of the library used to extract the article content

Other often-reused methods and configuration related to the mediacloud service also live in this package.

Installation

pip install mediacloud-metadata

Usage

If you pass in a URL, it will follow redirects and fetch the HTML for you.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path")

You can also pass in HTML you already have on hand. Note that in this case it is also useful to pass in the URL because that is used for some for some of the metadata extraction.

from mcmetadata import extract
metadata = extract(url="https://my.awesome.news/story-path",
                   html_text="<html><head><title>my webpage ... </html>")

Development

If you are interested in adding code to this module, first clone the GitHub repository.

Installing

flit install
pre-commit install

Testing

pytest

Distributing a New Version

Run pytest to make sure all the test pass
Update the version number in pyproject.toml
Make a brief note in the CHANGELOG.md about what changes
Commit the changes
Tag the commit with a semantic version number - v*.*.*
Push to repo to GitHub

Test Cache

Test are run against fixtures by default. This can be changed with the use of '--use-cache=False' when running tests. When adding new tests, re-run 'scripts/get-test-web-content.py'

Contributors

Created as part of the Media Cloud Project. Contributes include:

Rahul Bhargava (Media Cloud, Northeastern University)
Paige Gulley (Media Cloud)
Phil Budne (Media Cloud)
Vangelis Banos (Internet Archive)

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

1.4.1

Dec 22, 2024

1.4.0

Dec 17, 2024

1.3.1

Dec 3, 2024

1.3.0

Dec 3, 2024

1.2.0

Oct 23, 2024

1.1.0

Oct 8, 2024

1.0.2

May 7, 2024

1.0.1

May 7, 2024

1.0.0

Mar 25, 2024

0.12.0

Feb 15, 2024

0.11.2

Jan 22, 2024

0.11.1

Dec 13, 2023

0.11.0

Dec 13, 2023

0.10.0

Dec 3, 2023

0.9.5

Sep 28, 2023

0.9.4

Jun 28, 2023

0.9.3

Jun 22, 2023

0.9.2

Jan 23, 2023

0.9.1

Dec 24, 2022

0.9.0

Dec 21, 2022

0.8.2

Dec 15, 2022

0.8.1

Dec 15, 2022

0.8.0

Dec 6, 2022

0.7.9

Aug 30, 2022

0.7.8

Aug 15, 2022

0.7.6

Aug 12, 2022

0.7.5

Aug 12, 2022

0.7.4

Aug 2, 2022

0.7.3

Jul 31, 2022

0.7.2

Jul 25, 2022

0.7.1

Jul 24, 2022

0.7.0

Jul 21, 2022

0.6.0

Jul 18, 2022

0.5.5

Jul 14, 2022

0.5.4

Jul 11, 2022

0.5.3

Jul 6, 2022

0.5.2

Jun 2, 2022

0.5.1

May 27, 2022

0.5.0

May 25, 2022

0.4.3

Apr 28, 2022

0.4.2

Apr 27, 2022

0.4.1

Apr 25, 2022

0.4.0

Apr 25, 2022

0.3.0

Mar 22, 2022

0.2.0

Mar 11, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mediacloud_metadata-1.4.1.tar.gz (8.7 MB view details)

Uploaded Dec 22, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mediacloud_metadata-1.4.1-py3-none-any.whl (8.8 MB view details)

Uploaded Dec 22, 2024 Python 3

File details

Details for the file mediacloud_metadata-1.4.1.tar.gz.

File metadata

Download URL: mediacloud_metadata-1.4.1.tar.gz
Upload date: Dec 22, 2024
Size: 8.7 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for mediacloud_metadata-1.4.1.tar.gz
Algorithm	Hash digest
SHA256	`f4498a0f3e50e10427bac1b3a5165cfea68e0aab8c2c0387ed16518ec60ed93e`
MD5	`498c26869e0ff86e80f30a7f5078fca8`
BLAKE2b-256	`0da3ea561111a68191145f4343e0d5a321eca395556d6ed8c6d0e03615d063e1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mediacloud_metadata-1.4.1.tar.gz:

Publisher: publish-to-pypi.yml on mediacloud/metadata-lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mediacloud_metadata-1.4.1.tar.gz
- Subject digest: f4498a0f3e50e10427bac1b3a5165cfea68e0aab8c2c0387ed16518ec60ed93e
- Sigstore transparency entry: 157304448
- Sigstore integration time: Dec 22, 2024
Source repository:
- Permalink: mediacloud/metadata-lib@930945f404ff85c8987069365dbaea759bda86ed
- Branch / Tag: refs/tags/v1.4.1
- Owner: https://github.com/mediacloud
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@930945f404ff85c8987069365dbaea759bda86ed
- Trigger Event: push

File details

Details for the file mediacloud_metadata-1.4.1-py3-none-any.whl.

File metadata

Download URL: mediacloud_metadata-1.4.1-py3-none-any.whl
Upload date: Dec 22, 2024
Size: 8.8 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for mediacloud_metadata-1.4.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c8537ffe9a3e29851e234b12b0619a203161a3cfb6052af499dc6b1a6577af97`
MD5	`d685bfcdfc02a12651e985fb0b74140f`
BLAKE2b-256	`a72e5b7b521fdddbae4b61d48bb1431d3afc8847ef19dd5285036b3d72663d15`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mediacloud_metadata-1.4.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on mediacloud/metadata-lib

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mediacloud_metadata-1.4.1-py3-none-any.whl
- Subject digest: c8537ffe9a3e29851e234b12b0619a203161a3cfb6052af499dc6b1a6577af97
- Sigstore transparency entry: 157304449
- Sigstore integration time: Dec 22, 2024
Source repository:
- Permalink: mediacloud/metadata-lib@930945f404ff85c8987069365dbaea759bda86ed
- Branch / Tag: refs/tags/v1.4.1
- Owner: https://github.com/mediacloud
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish-to-pypi.yml@930945f404ff85c8987069365dbaea759bda86ed
- Trigger Event: push

mediacloud-metadata 1.4.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Media Cloud Metadata Extractor

Installation

Usage

Development

Installing

Testing

Distributing a New Version

Test Cache

Contributors

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance