Skip to main content

A Python-based tool for scraping news articles from various sources, using different techniques.

Project description

# NewsCrawler


NewsCrawler is a Python-based web scraping tool designed to extract news articles from various sources using multiple techniques. It navigates through paywalls and anti-bot measures to retrieve content, leveraging the Google Cache, Selenium with Stealth Mode, and Archive.is for comprehensive coverage.

## Features

- **Multiple Parsing Methods:** Includes Google Cache, Selenium Stealthed, Archive.is, and direct requests to fetch articles.
- **HTML Validation:** Ensures the integrity of the downloaded content, filtering out insufficient or irrelevant data.
- **Dynamic News Source Handling:** Utilizes a custom `NewsUrlGetter` to dynamically fetch news URLs based on specified topics.
- **Robust Error Handling:** Implements custom exceptions for HTML validation and download errors, ensuring reliability.
- **Extensible Design:** Easily adaptable to include more news sources or parsing methods.

## Dependencies

- Python 3.x
- `requests`
- `selenium`
- `newspaper3k`
- `selenium-stealth`
- `beautifulsoup4`

Ensure you have Chrome WebDriver installed and accessible in your system's PATH for Selenium to function properly.

## Installation

1. Clone the repository:
```sh
git clone https://github.com/yourgithubusername/newscrawler.git
```

2. Install the required Python packages:
```sh
pip install -r requirements.txt
```

## Usage

To use NewsCrawler, instantiate the `NewsParser` class with optional parameters for headless browsing and URL filtering. Then, call the `get_news` method with your topic of interest:

```python
from newscrawler import NewsParser, NewsUrlGetter

# Initialize the NewsParser with custom settings
news_parser = NewsParser(NewsUrlGetter(max_results=20, start_date=(2023, 1, 20), end_date=(2023, 12, 25)), headless=True)

# Fetch news articles about "Interest rates"
articles = news_parser.get_news("Interest rates")
```

## Contributing

Contributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

KyaNewsScraper-1.0.5.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

KyaNewsScraper-1.0.5-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file KyaNewsScraper-1.0.5.tar.gz.

File metadata

  • Download URL: KyaNewsScraper-1.0.5.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for KyaNewsScraper-1.0.5.tar.gz
Algorithm Hash digest
SHA256 ba54f79ef12195c73a0b5ad9ba51ffa12e42d690599618006608e866083e49ac
MD5 5daafdc37f84fef54de75839b5f101f1
BLAKE2b-256 88b280a1bdccc192651c123c0d01783f973a21ff19342896ecf2a5405f20dab9

See more details on using hashes here.

File details

Details for the file KyaNewsScraper-1.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for KyaNewsScraper-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 bc3060a04384b1dc0d41171bd1905cbd22f477183ebb041f2f53dae7e836ac39
MD5 81bcd96758af4d52a2e765ba8bde7853
BLAKE2b-256 5782ec07285dac64e7d49dcac7be9aea338b82e4221db5b628d1153a8236a8b2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page