Skip to main content

Convert web content to Markdown & JSON files to fuel your GPTs and agent AI!

Project description

Web Scraper to Markdown 🌐✍️

This Python-based web scraper fetches content from URLs and exports it into Markdown and JSON formats, specifically designed for simplicity, extensibility, and for uploading JSON files to GPT models. It is ideal for those looking to leverage web content for AI training or analysis. 🤖💡

🚀 Quick Start

(Or even better, use Docker! 🐳)

Recommended installation using pipx (isolated environment)

pipx install crawler-to-md

Alternatively, install with pip

pip install crawler-to-md

Then run the scraper:

crawler-to-md --url https://www.example.com

🌟 Features

  • Scrapes web pages for content and metadata. 📄
  • Filters links by base URL. 🔍
  • Excludes URLs containing certain strings. ❌
  • Automatically finds links or can use a file of URLs to scrape. 🔗
  • Rate limiting and delay support. 🕘
  • Exports data to Markdown and JSON, ready for GPT uploads. 📤
  • Exports each page as an individual Markdown file if --export-individual is used. 📝
  • Uses SQLite for efficient data management. 📊
  • Configurable via command-line arguments. ⚙️
  • Include or exclude specific HTML elements using CSS-like selectors (#id, .class, tag) during Markdown conversion. 🧩
  • Docker support. 🐳

📋 Requirements

Python 3.10 or higher is required.

Project dependencies are managed with pyproject.toml. Install them with:

pip install .

🛠 Usage

Start scraping with the following command:

crawler-to-md --url <URL> [--output-folder ./output] [--cache-folder ./cache] [--overwrite-cache|-w] [--base-url <BASE_URL>] [--exclude-url <KEYWORD_IN_URL>] [--title <TITLE>] [--urls-file <URLS_FILE>] [-p <PROXY_URL>]

Options:

  • --url, -u: The starting URL. 🌍
  • --urls-file: Path to a file containing URLs to scrape, one URL per line. If '-', read from stdin. 📁
  • --output-folder, -o: Where to save Markdown files (default: ./output). 📂
  • --cache-folder, -c: Where to store the database (default: ./cache). 💾
  • --overwrite-cache, -w: Overwrite existing cache database before scraping. 🧹
  • --base-url, -b: Filter links by base URL (default: URL's base). 🔎
  • --title, -t: Final title of the markdown file. Defaults to the URL. 🏷️
  • --exclude-url, -e: Exclude URLs containing this string (repeatable). ❌
  • --export-individual, -ei: Export each page as an individual Markdown file. 📝
  • --rate-limit, -rl: Maximum number of requests per minute (default: 0, no rate limit). ⏱️
  • --delay, -d: Delay between requests in seconds (default: 0, no delay). 🕒
  • --proxy, -p: Proxy URL for HTTP or SOCKS requests. 🌐
  • --include, -i: CSS-like selector (#id, .class, tag) to include before Markdown conversion (repeatable). ✅
  • --exclude, -x: CSS-like selector (#id, .class, tag) to exclude before Markdown conversion (repeatable). 🚫

One of the --url or --urls-file options is required.

📚 Log level

By default, the WARN level is used. You can change it with the LOG_LEVEL environment variable.

🐳 Docker Support

Run with Docker:

docker run --rm \
  -v $(pwd)/output:/app/output \
  -v cache:/home/app/.cache/crawler-to-md \
  ghcr.io/obeone/crawler-to-md --url <URL>

Build from source:

docker build -t crawler-to-md .

docker run --rm \
  -v $(pwd)/output:/app/output \
  crawler-to-md --url <URL>

🤝 Contributing

Contributions are welcome! Feel free to submit pull requests or open issues. 🌟

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawler_to_md-0.6.0.tar.gz (23.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

crawler_to_md-0.6.0-py3-none-any.whl (15.8 kB view details)

Uploaded Python 3

File details

Details for the file crawler_to_md-0.6.0.tar.gz.

File metadata

  • Download URL: crawler_to_md-0.6.0.tar.gz
  • Upload date:
  • Size: 23.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for crawler_to_md-0.6.0.tar.gz
Algorithm Hash digest
SHA256 a980247e7af2d76cb767d8ba36fbae8b1e45b1015474d80dc2f5af0f4e53bf7d
MD5 3f968d62ee2aa4ae29a137cf8bf6c594
BLAKE2b-256 74518ea06aa9be1d6f6b00bff25dc5a08437f146c67ff885e130a41a03f41e26

See more details on using hashes here.

File details

Details for the file crawler_to_md-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: crawler_to_md-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 15.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for crawler_to_md-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 797ec37e3b4d46c6c4af627a53345c14728d776a51cea95f649b187466ace3de
MD5 60a21b28b1d2ca2685b565cdcfc8b5dc
BLAKE2b-256 096652291bfd1b5b5c76850a9544f40d6b5ff9cb118b5f8aa8a8e2e763dd5487

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page