Skip to main content

A robust web scraping pipeline with smart static/dynamic fallback and semantic text classification.

Project description

DeScraper 🕷️

DeScraper is a lightweight and intelligent Python library that converts web pages into clean, structured, and LLM-ready text, while also discovering all links for easy crawling.

from descraper import run_scrape

# Scrape an article, a list, or a table-heavy page
data = run_scrape("https://en.wikipedia.org/wiki/List_of_Byzantine_emperors")

# Get clean, LLM-ready markdown content and all links
print(data['content'])
print(data['links']['internal'])

🇬🇧 English Documentation (Click to expand)

Key Features

  • 🧠 AI-Ready Content: Converts messy HTML into clean Markdown, including full support for converting <table> elements into Markdown tables. Perfect for RAG pipelines.
  • 🚀 Smart Strategy: Automatically switches from a fast static scraper to a full browser engine (Selenium) if JavaScript rendering is detected or needed.
  • 🛡️ Noise Reduction: Intelligently removes ads, navigation menus, footers, and other boilerplate to isolate the main content of a page.
  • 📦 Production-Ready: Built-in retries, timeouts, and user-agent management for robust and reliable scraping.

Installation

pip install descraper

Note: The dynamic mode requires Firefox. The necessary driver is downloaded automatically.

Output Structure

DeScraper returns a dictionary, with the most important key being content: a clean, LLM-ready string of the page's main information.

{
  "url": "https://...",
  "title": "Page Title",
  "content": "# Title\n\nThe main article text, cleaned and formatted in markdown, including tables...",
  "links": {
    "internal": ["https://.../page1", "https://.../page2"],
    "external": ["https://google.com..."]
   },
  "structured_text": "[...]",
  "images": "[...]"
}

🇹🇷 Türkçe Dokümantasyon (Genişletmek için tıklayın)

Temel Özellikler

  • 🧠 Yapay Zekaya Hazır İçerik: Karışık HTML'i, <table> etiketlerini Markdown tablolarına dönüştürme dahil, temiz Markdown metnine çevirir. RAG sistemleri için idealdir.
  • 🚀 Akıllı Strateji: JavaScript ile render edilen siteleri veya zayıf içeriği algıladığında, hızlı statik scraper'dan tam bir tarayıcı motoruna (Selenium) otomatik olarak geçer.
  • 🛡️ Gürültü Engelleme: Reklamları, menüleri, footer'ları ve diğer alakasız şablonları akıllıca temizleyerek sayfanın ana içeriğini izole eder.
  • 📦 Production Seviyesinde: Dayanıklı ve güvenilir scraping için yerleşik tekrar deneme (retry), zaman aşımı (timeout) ve user-agent yönetimi içerir.

Kurulum

pip install descraper

Not: Dinamik mod için Firefox tarayıcısı gereklidir. Gerekli sürücü otomatik olarak indirilir.

Çıktı Yapısı

DeScraper, en önemlisi content anahtarı olan bir sözlük (dictionary) döndürür. Bu anahtar, sayfanın ana bilgisinin temiz, LLM'ye hazır bir metin halini içerir.

{
  "url": "https://...",
  "title": "Sayfa Başlığı",
  "content": "# Başlık\n\nTablolar dahil, temizlenmiş ve markdown formatında ana metin...",
  "links": {
    "internal": ["https://.../sayfa1", "https://.../sayfa2"],
    "external": ["https://google.com..."]
   },
  "structured_text": "[...]",
  "images": "[...]"
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

descraper-0.2.2.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

descraper-0.2.2-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file descraper-0.2.2.tar.gz.

File metadata

  • Download URL: descraper-0.2.2.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for descraper-0.2.2.tar.gz
Algorithm Hash digest
SHA256 2b02237fd577120389d394d16185b85fa99e763c4a6a362137bd7a00535967ce
MD5 b14c41a550e70f14794535f5f2563e35
BLAKE2b-256 76787ac7519973500f6aada08a1e45893566f592e8fe50dc082167a64674e54a

See more details on using hashes here.

File details

Details for the file descraper-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: descraper-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for descraper-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 88c1c0c6383699e94cdd112922e1d8cf3e0eb65f637db1b0b17204339c0b1e12
MD5 85faed2252173a5b9107f29f751b6812
BLAKE2b-256 621ec769e8d5f4e6773ced6e37ae3b7ee21903c655741e640a8315638b58a2dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page