Skip to main content

A robust web scraping pipeline with smart static/dynamic fallback and semantic text classification.

Project description

DeScraper 🕷️

An intelligent Python library that turns any web page into clean, structured, AI-ready content.

from descraper import run_scrape

# Scrape an article, a list, or a table-heavy page
data = run_scrape("https://en.wikipedia.org/wiki/List_of_Byzantine_emperors")

# Get clean, LLM-ready markdown content
print(data['content'])

🇬🇧 English Documentation (Click to expand)

Key Features

  • 🧠 AI-Ready Content: Converts messy HTML into clean Markdown, including full support for converting <table> elements into Markdown tables. Perfect for RAG pipelines.
  • 🚀 Smart Strategy: Automatically switches from a fast static scraper to a full browser engine (Selenium) if JavaScript rendering is detected or needed.
  • 🛡️ Noise Reduction: Intelligently removes ads, navigation menus, footers, and other boilerplate to isolate the main content of a page.
  • 📦 Production-Ready: Built-in retries, timeouts, and user-agent management for robust and reliable scraping.

Installation

pip install descraper

Note: The dynamic mode requires Firefox. The necessary driver is downloaded automatically.

Output Structure

DeScraper returns a dictionary, with the most important key being content: a clean, LLM-ready string of the page's main information.

{
  "url": "https://...",
  "title": "Page Title",
  "content": "# Title\n\nThe main article text, cleaned and formatted in markdown, including tables...",
  "structured_text": "[...]",
  "links": "{...}",
  "images": "[...]"
}

🇹🇷 Türkçe Dokümantasyon (Genişletmek için tıklayın)

Temel Özellikler

  • 🧠 Yapay Zekaya Hazır İçerik: Karışık HTML'i, <table> etiketlerini Markdown tablolarına dönüştürme dahil, temiz Markdown metnine çevirir. RAG sistemleri için idealdir.
  • 🚀 Akıllı Strateji: JavaScript ile render edilen siteleri veya zayıf içeriği algıladığında, hızlı statik scraper'dan tam bir tarayıcı motoruna (Selenium) otomatik olarak geçer.
  • 🛡️ Gürültü Engelleme: Reklamları, menüleri, footer'ları ve diğer alakasız şablonları akıllıca temizleyerek sayfanın ana içeriğini izole eder.
  • 📦 Production Seviyesinde: Dayanıklı ve güvenilir scraping için yerleşik tekrar deneme (retry), zaman aşımı (timeout) ve user-agent yönetimi içerir.

Kurulum

pip install descraper

Not: Dinamik mod için Firefox tarayıcısı gereklidir. Gerekli sürücü otomatik olarak indirilir.

Çıktı Yapısı

DeScraper, en önemlisi content anahtarı olan bir sözlük (dictionary) döndürür. Bu anahtar, sayfanın ana bilgisinin temiz, LLM'ye hazır bir metin halini içerir.

{
  "url": "https://...",
  "title": "Sayfa Başlığı",
  "content": "# Başlık\n\nTablolar dahil, temizlenmiş ve markdown formatında ana metin...",
  "structured_text": "[...]",
  "links": "{...}",
  "images": "[...]"
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

descraper-0.2.1.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

descraper-0.2.1-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file descraper-0.2.1.tar.gz.

File metadata

  • Download URL: descraper-0.2.1.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for descraper-0.2.1.tar.gz
Algorithm Hash digest
SHA256 367b6e82998ba93f07edd832f6d28364a700ad1a62cf8598ba8f4baed74fb0c3
MD5 5c6368410659620fb2ea02a8bb1388ae
BLAKE2b-256 68f18290af9b1c2809c6f40631cb482196d6574b6c32d3902403c8d87c0f8a99

See more details on using hashes here.

File details

Details for the file descraper-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: descraper-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for descraper-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 642d075d7f992371efed747dde4ce05683194fdc4f22062bb0e342d1185e41b6
MD5 8fa4f5ad12cc12bee080cdf4f198c16c
BLAKE2b-256 0348090bff4a5bb9130fc7036f1e2d1d3788fe8ab3acdde97f1e7b2a004265fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page