Skip to main content

A Python-based tool for scraping news articles from various sources, using different techniques.

Project description

# NewsCrawler


NewsCrawler is a Python-based web scraping tool designed to extract news articles from various sources using multiple techniques. It navigates through paywalls and anti-bot measures to retrieve content, leveraging the Google Cache, Selenium with Stealth Mode, and Archive.is for comprehensive coverage.

## Features

- **Multiple Parsing Methods:** Includes Google Cache, Selenium Stealthed, Archive.is, and direct requests to fetch articles.
- **HTML Validation:** Ensures the integrity of the downloaded content, filtering out insufficient or irrelevant data.
- **Dynamic News Source Handling:** Utilizes a custom `NewsUrlGetter` to dynamically fetch news URLs based on specified topics.
- **Robust Error Handling:** Implements custom exceptions for HTML validation and download errors, ensuring reliability.
- **Extensible Design:** Easily adaptable to include more news sources or parsing methods.

## Dependencies

- Python 3.x
- `requests`
- `selenium`
- `newspaper3k`
- `selenium-stealth`
- `beautifulsoup4`

Ensure you have Chrome WebDriver installed and accessible in your system's PATH for Selenium to function properly.

## Installation

1. Clone the repository:
```sh
git clone https://github.com/yourgithubusername/newscrawler.git
```

2. Install the required Python packages:
```sh
pip install -r requirements.txt
```

## Usage

To use NewsCrawler, instantiate the `NewsParser` class with optional parameters for headless browsing and URL filtering. Then, call the `get_news` method with your topic of interest:

```python
from newscrawler import NewsParser, NewsUrlGetter

# Initialize the NewsParser with custom settings
news_parser = NewsParser(NewsUrlGetter(max_results=20, start_date=(2023, 1, 20), end_date=(2023, 12, 25)), headless=True)

# Fetch news articles about "Interest rates"
articles = news_parser.get_news("Interest rates")
```

## Contributing

Contributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

KyaNewsScraper-1.0.3.tar.gz (7.3 kB view details)

Uploaded Source

Built Distribution

KyaNewsScraper-1.0.3-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file KyaNewsScraper-1.0.3.tar.gz.

File metadata

  • Download URL: KyaNewsScraper-1.0.3.tar.gz
  • Upload date:
  • Size: 7.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for KyaNewsScraper-1.0.3.tar.gz
Algorithm Hash digest
SHA256 979ff111a45dc5dab6af2a925a8964d75e11c859fb731cc15af3967e9ccf7595
MD5 4c93e707316bf22c20bf673281800ce6
BLAKE2b-256 4f7bf010c0e0a96accdf44c26843473c729545bfbe13ade9075914cd5d4fe796

See more details on using hashes here.

File details

Details for the file KyaNewsScraper-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for KyaNewsScraper-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 30038c8c8fab5fcf1a350fc14f6722ef64613ee9fe9a0867b77d185df6555310
MD5 fb2d008d75c9b28ae52047faf84be484
BLAKE2b-256 7ad0ceb4225de34c28f0649e0220c491cfface8cabf61de00e326faf5aaaa3bb

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page