Skip to main content

A robust web scraping pipeline with smart static/dynamic fallback and semantic text classification.

Project description

DeScraper 🕷️

The Intelligent, AI-Ready Web Scraping Pipeline

PyPI version License: MIT Python 3.8+

DeScraper is a resilient and intelligent Python library designed for the modern web. Unlike traditional scrapers, it semantically analyzes content, automatically handles dynamic (JavaScript-heavy) sites, and produces clean, structured data perfect for LLM / RAG applications.


🇬🇧 English Documentation (Click to expand)

Why DeScraper?

If you've ever built a scraper, you've faced these problems:

  1. Dynamic Sites: You get empty HTML because the content is rendered with React/Vue.
  2. Noisy Data: Navigation menus, footers, and ad text get mixed with the main article.
  3. Bot Blocks: Simple HTTP requests are blocked with a 403 error.

DeScraper solves these for you:

  • 🚀 Smart Strategy: It first tries a fast static request (requests). If the site requires JavaScript or the content looks sparse, it automatically switches to a full browser engine (selenium).
  • 🧠 AI-Ready Output: It doesn't just strip HTML tags. It intelligently detects headings, paragraphs, lists, and tables, generating a clean, markdown-like text in the content field—perfect for feeding into a RAG pipeline or fine-tuning an LLM.
  • 🛡️ Production-Ready: Built-in features like automatic retries, configurable timeouts, and User-Agent management make it robust for real-world use.

Installation

pip install descraper

Note: The dynamic mode requires Firefox. The necessary driver is downloaded automatically.

Usage

1. As a Python Library (Recommended)

The simplest way to use DeScraper:

from descraper import run_scrape

url = "https://en.wikipedia.org/wiki/List_of_Byzantine_emperors"

# 'smart' mode is default: tries static, falls back to dynamic if needed.
data = run_scrape(url)

if data:
    # 1. AI-Ready Clean Content
    print("--- Clean Content for AI/RAG ---")
    print(data['content']) 
    # Example Output:
    # # List of Byzantine emperors
    # This is a list of the Byzantine emperors from the foundation of Constantinople...
    #
    # ## Palaiologan dynasty (1261–1453)
    # | Emperor | Reign | Notes |
    # | --- | --- | --- |
    # | Michael VIII Palaiologos | 1261–1282 | Reconquered Constantinople... |
    
    # 2. Metadata
    print(f"\nTitle: {data['title']}")
    print(f"Image Count: {len(data['images'])}")

2. CLI Usage

For quick tests or getting a JSON output directly:

# Print to screen
web-scraper https://www.python.org

# Save to file
web-scraper https://www.python.org -o result.json

# Run in dynamic mode with a 10-second scroll/wait time
web-scraper "https://medium.com/tag/python" -s dynamic -w 10

Output Structure (JSON)

DeScraper returns a rich dictionary:

{
  "url": "https://...",
  "title": "Page Title",
  "meta_description": "SEO description...",
  "content": "# Title\n\nThe main article text, cleaned and formatted in markdown, including tables...",
  "structured_text": [
    {"type": "heading_h1", "text": "Page Title", "tag": "h1"},
    {"type": "article_paragraph", "text": "First paragraph...", "tag": "p"}
  ],
  "links": {
    "internal": ["https://..."],
    "external": ["https://google.com..."]
  },
  "images": [{"src": "...", "alt": "..."}]
}

🇹🇷 Türkçe Dokümantasyon (Genişletmek için tıklayın)

Neden DeScraper?

Eğer daha önce bir scraper yazdıysanız, şu sorunlarla kesinlikle karşılaştınız:

  1. Dinamik Siteler: İçerik React/Vue ile oluşturulduğu için elinize boş HTML gelir.
  2. Kirli Veri: Menüler, altbilgiler (footer) ve reklam metinleri, asıl makaleye karışır.
  3. Bot Korumaları: Basit HTTP istekleri 403 hatasıyla engellenir.

DeScraper bunları sizin yerinize çözer:

  • 🚀 Akıllı Strateji: Önce hızlı statik yöntemle (requests) dener. Eğer site JavaScript gerektiriyorsa veya içerik zayıfsa, otomatik olarak tam bir tarayıcı motoruna (selenium) geçer.
  • 🧠 Yapay Zekaya Hazır Çıktı: Sadece HTML etiketlerini temizlemez. Başlıkları, paragrafları, listeleri ve tabloları akıllıca tespit ederek content alanında yapay zekaya vermeye uygun, markdown benzeri temiz bir metin üretir. Bu çıktı, RAG (Retrieval-Augmented Generation) sistemleri veya LLM (Büyük Dil Modeli) eğitimi için idealdir.
  • 🛡️ Production Seviyesinde: Otomatik tekrar deneme (retry), ayarlanabilir zaman aşımı (timeout) ve User-Agent yönetimi gibi yerleşik özelliklerle gerçek dünya kullanımı için dayanıklıdır.

Kurulum

pip install descraper

Not: Dinamik mod için Firefox tarayıcısı gereklidir. Gerekli sürücü otomatik olarak indirilir.

Kullanım

1. Python Kütüphanesi Olarak (Önerilen)

En basit haliyle kullanımı:

from descraper import run_scrape

url = "https://tr.wikipedia.org/wiki/Bizans_imparatorları_listesi"

# 'smart' modu varsayılandır: Önce static dener, gerekirse dynamic'e geçer.
data = run_scrape(url)

if data:
    # 1. AI/RAG için Temiz İçerik
    print("--- AI İçin Temiz Metin ---")
    print(data['content']) 
    # Örnek Çıktı:
    # # Bizans imparatorları listesi
    # Bu madde, Konstantinopolis'in imparatorluk başkenti olarak kurulmasından...
    #
    # ## Palaiologos Hanedanı (1261-1453)
    # | Resim | İsim | Hüküm süresi | Notlar |
    # | --- | --- | --- | --- |
    # | | VIII. Mihail | 1261-1282 | Konstantinopolis'i Latinlerden geri aldı... |
    
    # 2. Metadata
    print(f"\nBaşlık: {data['title']}")
    print(f"Görsel Sayısı: {len(data['images'])}")

2. Komut Satırı (CLI)

Hızlıca bir URL'i test etmek veya JSON çıktısı almak için:

# Ekrana basar
web-scraper https://www.python.org

# Dosyaya kaydeder
web-scraper https://www.python.org -o sonuc.json

# Dinamik modda, 10 saniye scroll yaparak çalıştırır
web-scraper "https://medium.com/tag/python" -s dynamic -w 10

Çıktı Yapısı (JSON)

DeScraper aşağıdaki formatta zengin bir sözlük (dictionary) döndürür:

{
  "url": "https://...",
  "title": "Sayfa Başlığı",
  "meta_description": "SEO açıklaması...",
  "content": "# Başlık\n\nTablolar dahil, temizlenmiş ve markdown formatında ana metin...",
  "structured_text": [
    {"type": "heading_h1", "text": "Sayfa Başlığı", "tag": "h1"},
    {"type": "article_paragraph", "text": "İlk paragraf...", "tag": "p"}
  ],
  "links": {
    "internal": ["https://..."],
    "external": ["https://google.com..."]
  },
  "images": [{"src": "...", "alt": "..."}]
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

descraper-0.2.0.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

descraper-0.2.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file descraper-0.2.0.tar.gz.

File metadata

  • Download URL: descraper-0.2.0.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for descraper-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2c734ea0463270b21eb9860bf133fb63583ee3c83bcd78a1fee80aa99e5301d3
MD5 7642a0e8d23a3737e979b7f7303774de
BLAKE2b-256 4dd6d8846667706e55e4bbb1df14334e8d2f9f255e89579b2e776ff7c36c0ca0

See more details on using hashes here.

File details

Details for the file descraper-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: descraper-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for descraper-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 758860ecb438791c99cda5ad74c285c34db4bcba52e66a3743ec32565a0745cd
MD5 64670ec159137a4734fdd9787eecad45
BLAKE2b-256 72535c2d88c1fdf0c045bcc54162398132c765165661d90fc720fc946ff75045

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page