A Python-based tool for scraping news articles from various sources, using different techniques.
Project description
# NewsCrawler
NewsCrawler is a Python-based web scraping tool designed to extract news articles from various sources using multiple techniques. It navigates through paywalls and anti-bot measures to retrieve content, leveraging the Google Cache, Selenium with Stealth Mode, and Archive.is for comprehensive coverage.
## Features
- **Multiple Parsing Methods:** Includes Google Cache, Selenium Stealthed, Archive.is, and direct requests to fetch articles.
- **HTML Validation:** Ensures the integrity of the downloaded content, filtering out insufficient or irrelevant data.
- **Dynamic News Source Handling:** Utilizes a custom `NewsUrlGetter` to dynamically fetch news URLs based on specified topics.
- **Robust Error Handling:** Implements custom exceptions for HTML validation and download errors, ensuring reliability.
- **Extensible Design:** Easily adaptable to include more news sources or parsing methods.
## Dependencies
- Python 3.x
- `requests`
- `selenium`
- `newspaper3k`
- `selenium-stealth`
- `beautifulsoup4`
Ensure you have Chrome WebDriver installed and accessible in your system's PATH for Selenium to function properly.
## Installation
1. Clone the repository:
```sh
git clone https://github.com/yourgithubusername/newscrawler.git
```
2. Install the required Python packages:
```sh
pip install -r requirements.txt
```
## Usage
To use NewsCrawler, instantiate the `NewsParser` class with optional parameters for headless browsing and URL filtering. Then, call the `get_news` method with your topic of interest:
```python
from newscrawler import NewsParser, NewsUrlGetter
# Initialize the NewsParser with custom settings
news_parser = NewsParser(NewsUrlGetter(max_results=20, start_date=(2023, 1, 20), end_date=(2023, 12, 25)), headless=True)
# Fetch news articles about "Interest rates"
articles = news_parser.get_news("Interest rates")
```
## Contributing
Contributions are welcome! Please feel free to submit pull requests or create issues for bugs and feature requests.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
KyaNewsScraper-1.0.3.tar.gz
(7.3 kB
view details)
Built Distribution
File details
Details for the file KyaNewsScraper-1.0.3.tar.gz
.
File metadata
- Download URL: KyaNewsScraper-1.0.3.tar.gz
- Upload date:
- Size: 7.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 979ff111a45dc5dab6af2a925a8964d75e11c859fb731cc15af3967e9ccf7595 |
|
MD5 | 4c93e707316bf22c20bf673281800ce6 |
|
BLAKE2b-256 | 4f7bf010c0e0a96accdf44c26843473c729545bfbe13ade9075914cd5d4fe796 |
File details
Details for the file KyaNewsScraper-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: KyaNewsScraper-1.0.3-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30038c8c8fab5fcf1a350fc14f6722ef64613ee9fe9a0867b77d185df6555310 |
|
MD5 | fb2d008d75c9b28ae52047faf84be484 |
|
BLAKE2b-256 | 7ad0ceb4225de34c28f0649e0220c491cfface8cabf61de00e326faf5aaaa3bb |