Ultra-fast and efficient web scraper with GPU text cleaning and JSON output. Supports generic and language-specific scraping.
Project description
Greek Web Scraper
A high-performance web scraper built with Scrapy, optimized for Greek-language content extraction. Supports GPU acceleration for text processing and provides robust retry mechanisms for reliable scraping.
🚀 Features
- Efficient Web Scraping: Uses Scrapy with custom middlewares.
- GPU Acceleration: Cleans and filters Greek text using CuPy.
- Robust Encoding Handling: Automatic encoding detection and conversion.
- Custom Retry Mechanism: Skips domains with persistent failures.
- Parallel Domain Scraping: Configurable concurrent requests.
- Automatic Text Extraction: Uses
trafilaturaandBeautifulSoup. - Storage Pipelines: Outputs cleaned text to JSONL and JSON.
📂 Project Structure
greek_scraper/
│── middlewares.py # Custom Scrapy middlewares for encoding & retries
│── pipelines.py # Data processing & storage pipelines
│── spider.py # Main Scrapy spider for scraping Greek websites
│── gpu_processor.py # GPU-based text processing
│── cli.py # Command-line interface for running the scraper
│── __init__.py # Entry point & helper functions
⚙️ Installation
-
Clone the repository
git clone https://github.com/your-repo/greek_scraper.git cd greek_scraper
-
Install dependencies
pip install -r requirements.txt
-
Ensure CUDA & CuPy are installed (if using GPU)
pip install cupy-cuda12x # Adjust CUDA version if needed
🕵️ Usage
Single Domain Scraping
import greek_scraper
greek_scraper.scrape("example.gr")
Multi-Domain Scraping
import greek_scraper
greek_scraper.multi_scrape(["example.gr", "another.gr"])
Scraping from a File
import greek_scraper
greek_scraper.from_file("domains.txt")
Custom Configuration
import greek_scraper
greek_scraper.gpu(True) # Enable GPU processing
greek_scraper.output_path("output.jsonl")
greek_scraper.language("greek")
greek_scraper.threads(4)
greek_scraper.speed(7)
🛠 Configuration
| Function | Description | Default Value |
|---|---|---|
gpu(True/False) |
Enables/disables GPU processing | False |
output_path("file.jsonl") |
Sets output file name | scraped_data.jsonl |
threads(n) |
Sets concurrent requests per domain | 1 |
speed(n) |
Controls scraping speed (1-10) | 5 |
language("greek") |
Filters text by language (Greek only) | greek |
📜 License
This project is licensed under the MIT License.
👨💻 Author
Charis Nikolaidis – GitHub – ncharis97@gmail.com
🌟 Show Your Support!
Give a ⭐ if you like this project and find it useful!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file greek_scraper-0.1.0-py3-none-any.whl.
File metadata
- Download URL: greek_scraper-0.1.0-py3-none-any.whl
- Upload date:
- Size: 3.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8156667085c95341cfe15fea382d9d93cf1a301e09ff32c77e0db3b0c86cb8ee
|
|
| MD5 |
48f93dd01b75c3299315d45e6d582652
|
|
| BLAKE2b-256 |
2c9a7cdcfe856a94193e4527a00225a03e189afc7d49123f28c130fd1367ca27
|