A robust web scraping pipeline with smart static/dynamic fallback and semantic text classification.
Project description
DeScraper 🕷️
DeScraper is a lightweight and intelligent Python library that converts web pages into clean, structured, and LLM-ready text, while also discovering all links for easy crawling.
from descraper import run_scrape
# Scrape an article, a list, or a table-heavy page
data = run_scrape("https://en.wikipedia.org/wiki/List_of_Byzantine_emperors")
# Get clean, LLM-ready markdown content and all links
print(data['content'])
print(data['links']['internal'])
🇬🇧 English Documentation (Click to expand)
Key Features
- 🧠 AI-Ready Content: Converts messy HTML into clean Markdown, including full support for converting
<table>elements into Markdown tables. Perfect for RAG pipelines. - 🚀 Smart Strategy: Automatically switches from a fast static scraper to a full browser engine (
Selenium) if JavaScript rendering is detected or needed. - 🛡️ Noise Reduction: Intelligently removes ads, navigation menus, footers, and other boilerplate to isolate the main content of a page.
- 📦 Production-Ready: Built-in retries, timeouts, and user-agent management for robust and reliable scraping.
Installation
pip install descraper
Note: The dynamic mode requires Firefox. The necessary driver is downloaded automatically.
Output Structure
DeScraper returns a dictionary, with the most important key being content: a clean, LLM-ready string of the page's main information.
{
"url": "https://...",
"title": "Page Title",
"content": "# Title\n\nThe main article text, cleaned and formatted in markdown, including tables...",
"links": {
"internal": ["https://.../page1", "https://.../page2"],
"external": ["https://google.com..."]
},
"structured_text": "[...]",
"images": "[...]"
}
🇹🇷 Türkçe Dokümantasyon (Genişletmek için tıklayın)
Temel Özellikler
- 🧠 Yapay Zekaya Hazır İçerik: Karışık HTML'i,
<table>etiketlerini Markdown tablolarına dönüştürme dahil, temiz Markdown metnine çevirir. RAG sistemleri için idealdir. - 🚀 Akıllı Strateji: JavaScript ile render edilen siteleri veya zayıf içeriği algıladığında, hızlı statik scraper'dan tam bir tarayıcı motoruna (
Selenium) otomatik olarak geçer. - 🛡️ Gürültü Engelleme: Reklamları, menüleri, footer'ları ve diğer alakasız şablonları akıllıca temizleyerek sayfanın ana içeriğini izole eder.
- 📦 Production Seviyesinde: Dayanıklı ve güvenilir scraping için yerleşik tekrar deneme (retry), zaman aşımı (timeout) ve user-agent yönetimi içerir.
Kurulum
pip install descraper
Not: Dinamik mod için Firefox tarayıcısı gereklidir. Gerekli sürücü otomatik olarak indirilir.
Çıktı Yapısı
DeScraper, en önemlisi content anahtarı olan bir sözlük (dictionary) döndürür. Bu anahtar, sayfanın ana bilgisinin temiz, LLM'ye hazır bir metin halini içerir.
{
"url": "https://...",
"title": "Sayfa Başlığı",
"content": "# Başlık\n\nTablolar dahil, temizlenmiş ve markdown formatında ana metin...",
"links": {
"internal": ["https://.../sayfa1", "https://.../sayfa2"],
"external": ["https://google.com..."]
},
"structured_text": "[...]",
"images": "[...]"
}
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file descraper-0.2.2.tar.gz.
File metadata
- Download URL: descraper-0.2.2.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b02237fd577120389d394d16185b85fa99e763c4a6a362137bd7a00535967ce
|
|
| MD5 |
b14c41a550e70f14794535f5f2563e35
|
|
| BLAKE2b-256 |
76787ac7519973500f6aada08a1e45893566f592e8fe50dc082167a64674e54a
|
File details
Details for the file descraper-0.2.2-py3-none-any.whl.
File metadata
- Download URL: descraper-0.2.2-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
88c1c0c6383699e94cdd112922e1d8cf3e0eb65f637db1b0b17204339c0b1e12
|
|
| MD5 |
85faed2252173a5b9107f29f751b6812
|
|
| BLAKE2b-256 |
621ec769e8d5f4e6773ced6e37ae3b7ee21903c655741e640a8315638b58a2dc
|