Skip to main content

Ultra-fast and efficient web scraper with GPU utilization for text cleaning and JSON output. Supports generic and language-specific scraping.

Project description

Greek Web Scraper

Greek Scraper PyPI version

A high-performance web scraper built on Scrapy and optimized for Greek-language content extraction. This tool leverages GPU acceleration (via CuPy) for text processing and features robust retry mechanisms to ensure reliable scraping across multiple domains.

🚀 Features

  • Efficient Web Scraping: Leverages Scrapy with custom middlewares for encoding handling and retries.
  • GPU Acceleration: Utilizes CuPy for GPU-based text processing to speed up cleaning and filtering of Greek text.
  • Robust Encoding Handling: Automatically detects and converts text encodings to handle diverse content.
  • Custom Retry Mechanism: Skips problematic domains to minimize downtime and maximize throughput.
  • Parallel Domain Scraping: Configurable concurrency allows simultaneous scraping of multiple domains.
  • Automatic Text Extraction: Integrates trafilatura and BeautifulSoup for precise content extraction.
  • Flexible Storage Pipelines: Outputs cleaned and structured data to JSONL or JSON formats for easy downstream processing.

📂 Project Structure

greek_scraper/
├── __init__.py          # Package initialization and helper functions
├── cli.py               # Command-line interface for running the scraper
├── gpu_processor.py     # GPU-based text processing routines
├── middlewares.py       # Custom Scrapy middlewares for encoding and retry mechanisms
├── pipelines.py         # Data processing and storage pipelines
├── spider.py            # Main Scrapy spider for scraping Greek websites
└── utils.py             # Additional utility functions (if applicable)

Note: Additional modules or directories (e.g., tests/ or docs/) might be present in the repository.

⚙️ Installation

Prerequisites

  • Python: Version 3.10 or above.
  • CUDA: Ensure you have a compatible CUDA toolkit if using GPU acceleration.
  • CuPy: Install the version matching your CUDA setup.

Steps

Clone the Repository

git clone https://github.com/Charisn/Greek-web-scraper.git
cd Greek-web-scraper

Create a Virtual Environment (Optional but Recommended)

python -m venv venv
source venv/bin/activate      # Linux/MacOS
venv\Scripts\activate       # Windows

Install Dependencies

pip install -r requirements.txt

Install CuPy for GPU Acceleration

pip install cupy-cuda12x  # Adjust version to match your CUDA toolkit

🕵️ Usage

Command-Line Interface

If the package includes a CLI entry point (as defined in the setup), you can run:

greek-scraper --help

This should provide usage information and available options.

Python API

Single Domain Scraping

import greek_scraper

# Scrape a single domain
greek_scraper.scrape("example.gr")

Multi-Domain Scraping

import greek_scraper

# Scrape multiple domains simultaneously
domains = ["example.gr", "another.gr"]
greek_scraper.multi_scrape(domains)

Scraping from a File

import greek_scraper

# Provide a file containing a list of domains (one per line)
greek_scraper.from_file("domains.txt")

Custom Configuration Example

import greek_scraper

# Configure settings before starting the scrape
greek_scraper.gpu(True)              # Enable GPU processing
greek_scraper.output_path("output.jsonl")  # Set output file name
greek_scraper.language("greek")      # Focus on Greek language content
greek_scraper.threads(4)             # Set concurrent requests per domain
greek_scraper.speed(7)               # Increase scraping speed (scale 1-10)

# Start scraping after configuration
greek_scraper.scrape("example.gr")

🛠 Configuration

The scraper exposes several configurable functions to tailor its behavior:

Function Description Default Value
gpu(True/False) Enable/disable GPU processing False
output_path("file.jsonl") Specify the output file name scraped_data.jsonl
threads(n) Set number of concurrent requests per domain 1
speed(n) Adjust scraping speed (scale 1-10) 5
language("greek") Filter extracted text by language (Greek only) greek

Note: The functions can be chained or called independently before initiating the scraping process.

📜 License

This project is licensed under the GNU Lesser General Public License v2.1.
For details, see LGPL v2.1 License.

👨‍💻 Author

Charis Nikolaidis
GitHubncharis97@gmail.com

🌟 Show Your Support!

If you find this project useful, please consider giving it a ⭐ on GitHub!

❓ Contributing

Contributions, suggestions, and bug reports are always welcome!

  1. Fork the repository.
  2. Create a feature branch: git checkout -b feature/YourFeature.
  3. Commit your changes: git commit -m 'Add new feature'.
  4. Push to the branch: git push origin feature/YourFeature.
  5. Open a Pull Request.

For major changes, please open an issue first to discuss what you would like to change.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

greek_scraper-0.4.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

greek_scraper-0.4-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file greek_scraper-0.4.tar.gz.

File metadata

  • Download URL: greek_scraper-0.4.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for greek_scraper-0.4.tar.gz
Algorithm Hash digest
SHA256 bafb3aae5c060508b1fb8b5401d1ef610826744a45c6348508826be4f6a8252f
MD5 0eb9e3b62a6e4ab0f8e8c60ccaf15e86
BLAKE2b-256 66c53aaede9de3e0815f42a26366e489916f5fbd481bdb2b5a43f2f74259820e

See more details on using hashes here.

File details

Details for the file greek_scraper-0.4-py3-none-any.whl.

File metadata

  • Download URL: greek_scraper-0.4-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for greek_scraper-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 3f30262b123ce340a225bcae9ed5bfd028da3037c07e7631e1951ce451d4a5e8
MD5 6a3ec7ea8b28b4980dec3a4e226865eb
BLAKE2b-256 30db22341dcb09ca6cf24df0fdaf0a71c7a72f7240378b4e7c00e198b4e896ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page