A robust web scraping pipeline with smart static/dynamic fallback and semantic text classification.

These details have not been verified by PyPI

Project links

Homepage

Project description

DeScraper 🕷️

The Intelligent, AI-Ready Web Scraping Pipeline

DeScraper is a resilient and intelligent Python library designed for the modern web. Unlike traditional scrapers, it semantically analyzes content, automatically handles dynamic (JavaScript-heavy) sites, and produces clean, structured data perfect for LLM / RAG applications.

🇬🇧 English Documentation (Click to expand)

Why DeScraper?

If you've ever built a scraper, you've faced these problems:

Dynamic Sites: You get empty HTML because the content is rendered with React/Vue.
Noisy Data: Navigation menus, footers, and ad text get mixed with the main article.
Bot Blocks: Simple HTTP requests are blocked with a 403 error.

DeScraper solves these for you:

🚀 Smart Strategy: It first tries a fast static request (requests). If the site requires JavaScript or the content looks sparse, it automatically switches to a full browser engine (selenium).
🧠 AI-Ready Output: It doesn't just strip HTML tags. It intelligently detects headings, paragraphs, lists, and tables, generating a clean, markdown-like text in the content field—perfect for feeding into a RAG pipeline or fine-tuning an LLM.
🛡️ Production-Ready: Built-in features like automatic retries, configurable timeouts, and User-Agent management make it robust for real-world use.

Installation

pip install descraper

Note: The dynamic mode requires Firefox. The necessary driver is downloaded automatically.

Usage

1. As a Python Library (Recommended)

The simplest way to use DeScraper:

from descraper import run_scrape

url = "https://en.wikipedia.org/wiki/List_of_Byzantine_emperors"

# 'smart' mode is default: tries static, falls back to dynamic if needed.
data = run_scrape(url)

if data:
    # 1. AI-Ready Clean Content
    print("--- Clean Content for AI/RAG ---")
    print(data['content']) 
    # Example Output:
    # # List of Byzantine emperors
    # This is a list of the Byzantine emperors from the foundation of Constantinople...
    #
    # ## Palaiologan dynasty (1261–1453)
    # | Emperor | Reign | Notes |
    # | --- | --- | --- |
    # | Michael VIII Palaiologos | 1261–1282 | Reconquered Constantinople... |
    
    # 2. Metadata
    print(f"\nTitle: {data['title']}")
    print(f"Image Count: {len(data['images'])}")

2. CLI Usage

For quick tests or getting a JSON output directly:

# Print to screen
web-scraper https://www.python.org

# Save to file
web-scraper https://www.python.org -o result.json

# Run in dynamic mode with a 10-second scroll/wait time
web-scraper "https://medium.com/tag/python" -s dynamic -w 10

Output Structure (JSON)

DeScraper returns a rich dictionary:

{
  "url": "https://...",
  "title": "Page Title",
  "meta_description": "SEO description...",
  "content": "# Title\n\nThe main article text, cleaned and formatted in markdown, including tables...",
  "structured_text": [
    {"type": "heading_h1", "text": "Page Title", "tag": "h1"},
    {"type": "article_paragraph", "text": "First paragraph...", "tag": "p"}
  ],
  "links": {
    "internal": ["https://..."],
    "external": ["https://google.com..."]
  },
  "images": [{"src": "...", "alt": "..."}]
}

🇹🇷 Türkçe Dokümantasyon (Genişletmek için tıklayın)

Neden DeScraper?

Eğer daha önce bir scraper yazdıysanız, şu sorunlarla kesinlikle karşılaştınız:

Dinamik Siteler: İçerik React/Vue ile oluşturulduğu için elinize boş HTML gelir.
Kirli Veri: Menüler, altbilgiler (footer) ve reklam metinleri, asıl makaleye karışır.
Bot Korumaları: Basit HTTP istekleri 403 hatasıyla engellenir.

DeScraper bunları sizin yerinize çözer:

🚀 Akıllı Strateji: Önce hızlı statik yöntemle (requests) dener. Eğer site JavaScript gerektiriyorsa veya içerik zayıfsa, otomatik olarak tam bir tarayıcı motoruna (selenium) geçer.
🧠 Yapay Zekaya Hazır Çıktı: Sadece HTML etiketlerini temizlemez. Başlıkları, paragrafları, listeleri ve tabloları akıllıca tespit ederek content alanında yapay zekaya vermeye uygun, markdown benzeri temiz bir metin üretir. Bu çıktı, RAG (Retrieval-Augmented Generation) sistemleri veya LLM (Büyük Dil Modeli) eğitimi için idealdir.
🛡️ Production Seviyesinde: Otomatik tekrar deneme (retry), ayarlanabilir zaman aşımı (timeout) ve User-Agent yönetimi gibi yerleşik özelliklerle gerçek dünya kullanımı için dayanıklıdır.

Kurulum

pip install descraper

Not: Dinamik mod için Firefox tarayıcısı gereklidir. Gerekli sürücü otomatik olarak indirilir.

Kullanım

1. Python Kütüphanesi Olarak (Önerilen)

En basit haliyle kullanımı:

from descraper import run_scrape

url = "https://tr.wikipedia.org/wiki/Bizans_imparatorları_listesi"

# 'smart' modu varsayılandır: Önce static dener, gerekirse dynamic'e geçer.
data = run_scrape(url)

if data:
    # 1. AI/RAG için Temiz İçerik
    print("--- AI İçin Temiz Metin ---")
    print(data['content']) 
    # Örnek Çıktı:
    # # Bizans imparatorları listesi
    # Bu madde, Konstantinopolis'in imparatorluk başkenti olarak kurulmasından...
    #
    # ## Palaiologos Hanedanı (1261-1453)
    # | Resim | İsim | Hüküm süresi | Notlar |
    # | --- | --- | --- | --- |
    # | | VIII. Mihail | 1261-1282 | Konstantinopolis'i Latinlerden geri aldı... |
    
    # 2. Metadata
    print(f"\nBaşlık: {data['title']}")
    print(f"Görsel Sayısı: {len(data['images'])}")

2. Komut Satırı (CLI)

Hızlıca bir URL'i test etmek veya JSON çıktısı almak için:

# Ekrana basar
web-scraper https://www.python.org

# Dosyaya kaydeder
web-scraper https://www.python.org -o sonuc.json

# Dinamik modda, 10 saniye scroll yaparak çalıştırır
web-scraper "https://medium.com/tag/python" -s dynamic -w 10

Çıktı Yapısı (JSON)

DeScraper aşağıdaki formatta zengin bir sözlük (dictionary) döndürür:

{
  "url": "https://...",
  "title": "Sayfa Başlığı",
  "meta_description": "SEO açıklaması...",
  "content": "# Başlık\n\nTablolar dahil, temizlenmiş ve markdown formatında ana metin...",
  "structured_text": [
    {"type": "heading_h1", "text": "Sayfa Başlığı", "tag": "h1"},
    {"type": "article_paragraph", "text": "İlk paragraf...", "tag": "p"}
  ],
  "links": {
    "internal": ["https://..."],
    "external": ["https://google.com..."]
  },
  "images": [{"src": "...", "alt": "..."}]
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.3

Feb 18, 2026

0.2.2

Feb 9, 2026

0.2.1

Feb 9, 2026

This version

0.2.0

Feb 9, 2026

0.1.4

Feb 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

descraper-0.2.0.tar.gz (15.7 kB view details)

Uploaded Feb 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

descraper-0.2.0-py3-none-any.whl (14.2 kB view details)

Uploaded Feb 9, 2026 Python 3

File details

Details for the file descraper-0.2.0.tar.gz.

File metadata

Download URL: descraper-0.2.0.tar.gz
Upload date: Feb 9, 2026
Size: 15.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for descraper-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`2c734ea0463270b21eb9860bf133fb63583ee3c83bcd78a1fee80aa99e5301d3`
MD5	`7642a0e8d23a3737e979b7f7303774de`
BLAKE2b-256	`4dd6d8846667706e55e4bbb1df14334e8d2f9f255e89579b2e776ff7c36c0ca0`

See more details on using hashes here.

File details

Details for the file descraper-0.2.0-py3-none-any.whl.

File metadata

Download URL: descraper-0.2.0-py3-none-any.whl
Upload date: Feb 9, 2026
Size: 14.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for descraper-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`758860ecb438791c99cda5ad74c285c34db4bcba52e66a3743ec32565a0745cd`
MD5	`64670ec159137a4734fdd9787eecad45`
BLAKE2b-256	`72535c2d88c1fdf0c045bcc54162398132c765165661d90fc720fc946ff75045`

See more details on using hashes here.

descraper 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DeScraper 🕷️

Why DeScraper?

Installation

Usage

1. As a Python Library (Recommended)

2. CLI Usage

Output Structure (JSON)

Neden DeScraper?

Kurulum

Kullanım

1. Python Kütüphanesi Olarak (Önerilen)

2. Komut Satırı (CLI)

Çıktı Yapısı (JSON)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes