Skip to main content

Google News Sitemap Parser

Project description

PyPI version Python versions Codacy-quality Codecov License

NewsGrabber is a Python library that parses Google News sitemap structures into Python objects, enabling developers to easily extract and analyze news-related metadata.

Features

  • Parses Google News sitemaps into structured Python objects.

  • Handles sitemap parsing with robust error tolerance.

  • Lightweight and efficient, leveraging:

    • lxml for fast XML parsing.

    • requests for HTTP requests.

    • python-dateutil for flexible date parsing.

  • Python 3.11+ compatible.

Installation

Install NewsGrabber via pip:

pip install newsgrabber

Usage

Parsing a Google News Sitemap

from newsgrabber import NewsGrabber

grabber = NewsGrabber("https://www.bbc.com/sitemaps/https-sitemap-com-news-1.xml")
grabber.parse()
print("\n".join(x.title for x in grabber.news_urls[:5]))

Example Output

BBC Look East: Latest weather forecast for the East
Syria country profile
Namibia country profile
How working parents can get 15 and 30 hours free childcare
South East England weather forecast

Requirements

NewsGrabber requires Python 3.11+ and the following dependencies:

  • lxml>=5.3.0: For XML parsing.

  • requests>=2.32.3: For HTTP requests.

  • python-dateutil>=2.1,<3.0.0: For flexible date parsing.

Development and Testing

To set up a development environment:

  1. Clone the repository: bash git clone https://github.com/yibudak/newsgrabber cd newsgrabber

  2. Install dependencies: bash pip install -e .[test]

  3. Run tests: bash pytest

Contributing

Contributions are welcome! If you’d like to contribute, please fork the repository and submit a pull request. Make sure to include tests for any new functionality.

License

This library is licensed under the AGPL-3.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newsgrabber-24.1.2.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

newsgrabber-24.1.2-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file newsgrabber-24.1.2.tar.gz.

File metadata

  • Download URL: newsgrabber-24.1.2.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for newsgrabber-24.1.2.tar.gz
Algorithm Hash digest
SHA256 bd6cd546f7d836f7ce4930ccaff4f151af3783f5158cb06cfca22e97a056a08a
MD5 ff7b3afd39dd80deebe05d72d03a4c9f
BLAKE2b-256 c4817c8f542af57ea5654000b02f753d0603e41ace7cff13c08507c4ce96eaa1

See more details on using hashes here.

Provenance

The following attestation bundles were made for newsgrabber-24.1.2.tar.gz:

Publisher: release.yml on yibudak/newsgrabber

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file newsgrabber-24.1.2-py3-none-any.whl.

File metadata

  • Download URL: newsgrabber-24.1.2-py3-none-any.whl
  • Upload date:
  • Size: 33.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for newsgrabber-24.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3745bd39c2b45198d64ff9329039d51d819ed73d0743575f3ac5be3491522a00
MD5 ccc7cd3961d21dfbab5d2a8654115132
BLAKE2b-256 2781784afa49991ae24b6bed716ed710e2672a25df20616bb4627717e5e0f386

See more details on using hashes here.

Provenance

The following attestation bundles were made for newsgrabber-24.1.2-py3-none-any.whl:

Publisher: release.yml on yibudak/newsgrabber

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page