Skip to main content

Ultra-fast and efficient web scraper with GPU utilization for text cleaning and JSON output. Supports generic and language-specific scraping.

Project description

Greek Web Scraper

Greek Scraper PyPI version

A high-performance web scraper built on Scrapy and optimized for Greek-language content extraction. This tool leverages GPU acceleration (via CuPy) for text processing and features robust retry mechanisms to ensure reliable scraping across multiple domains.

🚀 Features

  • Efficient Web Scraping: Leverages Scrapy with custom middlewares for encoding handling and retries.
  • GPU Acceleration: Utilizes CuPy for GPU-based text processing to speed up cleaning and filtering of Greek text.
  • Robust Encoding Handling: Automatically detects and converts text encodings to handle diverse content.
  • Custom Retry Mechanism: Skips problematic domains to minimize downtime and maximize throughput.
  • Parallel Domain Scraping: Configurable concurrency allows simultaneous scraping of multiple domains.
  • Automatic Text Extraction: Integrates trafilatura and BeautifulSoup for precise content extraction.
  • Flexible Storage Pipelines: Outputs cleaned and structured data to JSONL or JSON formats for easy downstream processing.

📂 Project Structure

greek_scraper/
├── __init__.py          # Package initialization and helper functions
├── cli.py               # Command-line interface for running the scraper
├── gpu_processor.py     # GPU-based text processing routines
├── middlewares.py       # Custom Scrapy middlewares for encoding and retry mechanisms
├── pipelines.py         # Data processing and storage pipelines
├── spider.py            # Main Scrapy spider for scraping Greek websites
└── utils.py             # Additional utility functions (if applicable)

Note: Additional modules or directories (e.g., tests/ or docs/) might be present in the repository.

⚙️ Installation

Prerequisites

  • Python: Version 3.10 or above.
  • CUDA: Ensure you have a compatible CUDA toolkit if using GPU acceleration.
  • CuPy: Install the version matching your CUDA setup.

Steps

Clone the Repository

git clone https://github.com/Charisn/Greek-web-scraper.git
cd Greek-web-scraper

Create a Virtual Environment (Optional but Recommended)

python -m venv venv
source venv/bin/activate      # Linux/MacOS
venv\Scripts\activate       # Windows

Install Dependencies

pip install -r requirements.txt

Install CuPy for GPU Acceleration

pip install cupy-cuda12x  # Adjust version to match your CUDA toolkit

! CUDA CUDA Toolkit 12.1 Is Required

https://developer.nvidia.com/cuda-12-1-0-download-archive

🕵️ Usage

Command-Line Interface

If the package includes a CLI entry point (as defined in the setup), you can run:

greek-scraper --help

This should provide usage information and available options.

Python API

Single Domain Scraping

import greek_scraper

# Scrape a single domain
greek_scraper.scrape("example.gr")

Multi-Domain Scraping

import greek_scraper

# Scrape multiple domains simultaneously
domains = ["example.gr", "another.gr"]
greek_scraper.multi_scrape(domains)

Scraping from a File

import greek_scraper

# Provide a file containing a list of domains (one per line)
greek_scraper.from_file("domains.txt")

Custom Configuration Example

import greek_scraper

# Configure settings before starting the scrape
greek_scraper.gpu(True)              # Enable GPU processing
greek_scraper.output_path("output.jsonl")  # Set output file name
greek_scraper.language("greek")      # Focus on Greek language content
greek_scraper.threads(4)             # Set concurrent requests per domain
greek_scraper.speed(7)               # Increase scraping speed (scale 1-10)

# Start scraping after configuration
greek_scraper.scrape("example.gr")

🛠 Configuration

The scraper exposes several configurable functions to tailor its behavior:

Function Description Default Value
gpu(True/False) Enable/disable GPU processing False
output_path("file.jsonl") Specify the output file name scraped_data.jsonl
threads(n) Set number of concurrent requests per domain 1
speed(n) Adjust scraping speed (scale 1-10) 5
language("greek") Filter extracted text by language (Greek only) greek

Note: The functions can be chained or called independently before initiating the scraping process.

📜 License

This project is licensed under the GNU Lesser General Public License v2.1.
For details, see LGPL v2.1 License.

👨‍💻 Author

Charis Nikolaidis
GitHubncharis97@gmail.com

🌟 Show Your Support!

If you find this project useful, please consider giving it a ⭐ on GitHub!

❓ Contributing

Contributions, suggestions, and bug reports are always welcome!

  1. Fork the repository.
  2. Create a feature branch: git checkout -b feature/YourFeature.
  3. Commit your changes: git commit -m 'Add new feature'.
  4. Push to the branch: git push origin feature/YourFeature.
  5. Open a Pull Request.

For major changes, please open an issue first to discuss what you would like to change.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

greek_scraper-0.9.4.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

greek_scraper-0.9.4-py3-none-any.whl (22.8 kB view details)

Uploaded Python 3

File details

Details for the file greek_scraper-0.9.4.tar.gz.

File metadata

  • Download URL: greek_scraper-0.9.4.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for greek_scraper-0.9.4.tar.gz
Algorithm Hash digest
SHA256 eafdab7931a6dc136598164616678e39847d92b9a17635986c660ddde9afecd7
MD5 d85de43c5163aaf96feba0b9a657f187
BLAKE2b-256 8fe7dffafda237a6a8c61797e955c01674af32a913f4cdf89d31218071a36127

See more details on using hashes here.

File details

Details for the file greek_scraper-0.9.4-py3-none-any.whl.

File metadata

  • Download URL: greek_scraper-0.9.4-py3-none-any.whl
  • Upload date:
  • Size: 22.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for greek_scraper-0.9.4-py3-none-any.whl
Algorithm Hash digest
SHA256 0faf2563ba0771b624076853dcbe5adba455dcfe09ffc5153115abd957d0eefd
MD5 f16d5a62e623356df8baafa3b2932308
BLAKE2b-256 5afc74107bab6e64adef9f03c5eef27326c5d0860c4a84d42b5846d5c769dd27

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page