Skip to main content

A robust web scraping pipeline with smart static/dynamic fallback and semantic text classification.

Project description

DeScraper 🕷️

DeScraper is a lightweight and intelligent Python library that converts web pages into clean, structured, and LLM-ready text, while also discovering all links for easy crawling.

from descraper import run_scrape

# Scrape an article, a list, or a table-heavy page
data = run_scrape("https://en.wikipedia.org/wiki/List_of_Byzantine_emperors")

# Get clean, LLM-ready markdown content and all links
print(data['content'])
print(data['links']['internal'])

🇬🇧 English Documentation (Click to expand)

Key Features

  • 🧠 AI-Ready Content: Converts messy HTML into clean Markdown, including full support for converting <table> elements into Markdown tables. Perfect for RAG pipelines.
  • 🚀 Smart Strategy: Automatically switches from a fast static scraper to a full browser engine (Selenium) if JavaScript rendering is detected or needed.
  • 🛡️ Noise Reduction: Intelligently removes ads, navigation menus, footers, and other boilerplate to isolate the main content of a page.
  • 📦 Production-Ready: Built-in retries, timeouts, and user-agent management for robust and reliable scraping.

Installation

pip install descraper

Note: The dynamic mode requires Firefox. The necessary driver is downloaded automatically.

Output Structure

DeScraper returns a dictionary, with the most important key being content: a clean, LLM-ready string of the page's main information.

{
  "url": "https://example.com/actual-content",  # Final URL after redirects
  "original_url": "http://example.com",         # The URL you provided
  "title": "Example Domain",
  "content": "# Title\n\nThe main article text, cleaned and formatted in markdown, including tables...",
  "links": {
    "internal": ["https://.../page1", "https://.../page2"],
    "external": ["https://google.com..."]
   },
  "structured_text": "[...]",
  "images": "[...]"
}

🇹🇷 Türkçe Dokümantasyon (Genişletmek için tıklayın)

Temel Özellikler

  • 🧠 Yapay Zekaya Hazır İçerik: Karışık HTML'i, <table> etiketlerini Markdown tablolarına dönüştürme dahil, temiz Markdown metnine çevirir. RAG sistemleri için idealdir.
  • 🚀 Akıllı Strateji: JavaScript ile render edilen siteleri veya zayıf içeriği algıladığında, hızlı statik scraper'dan tam bir tarayıcı motoruna (Selenium) otomatik olarak geçer.
  • 🛡️ Gürültü Engelleme: Reklamları, menüleri, footer'ları ve diğer alakasız şablonları akıllıca temizleyerek sayfanın ana içeriğini izole eder.
  • 📦 Production Seviyesinde: Dayanıklı ve güvenilir scraping için yerleşik tekrar deneme (retry), zaman aşımı (timeout) ve user-agent yönetimi içerir.

Kurulum

pip install descraper

Not: Dinamik mod için Firefox tarayıcısı gereklidir. Gerekli sürücü otomatik olarak indirilir.

Çıktı Yapısı

DeScraper, en önemlisi content anahtarı olan bir sözlük (dictionary) döndürür. Bu anahtar, sayfanın ana bilgisinin temiz, LLM'ye hazır bir metin halini içerir.

{
  "url": "https://...",
  "title": "Sayfa Başlığı",
  "content": "# Başlık\n\nTablolar dahil, temizlenmiş ve markdown formatında ana metin...",
  "links": {
    "internal": ["https://.../sayfa1", "https://.../sayfa2"],
    "external": ["https://google.com..."]
   },
  "structured_text": "[...]",
  "images": "[...]"
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

descraper-0.2.3.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

descraper-0.2.3-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file descraper-0.2.3.tar.gz.

File metadata

  • Download URL: descraper-0.2.3.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for descraper-0.2.3.tar.gz
Algorithm Hash digest
SHA256 af26ed34ce2db467cd5757e35eea451a8a9905f75daf8bf611da5bcf36074faf
MD5 1ec8c70b3951ec8a960f793318c9af6f
BLAKE2b-256 89741d047f82823339a2936a26c844c096238c0520229cbc675ff2f963ec7a91

See more details on using hashes here.

File details

Details for the file descraper-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: descraper-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for descraper-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 efbcfbeaf8fae4a395be0fe9b886658eaeb4f37d6392b914ae7cea2d344302fc
MD5 2d795a12f35f098885a5eb87c1e60e90
BLAKE2b-256 f78656e98ca3fcbeee320ed26c59a8b9c81aba64562019eef2996ee88b1cae81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page