Convert web content to Markdown & JSON files to fuel your GPTs and agent AI!

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
Topic
- Software Development :: Build Tools

Project description

Web Scraper to Markdown 🌐✍️

This Python-based web scraper fetches content from URLs and exports it into Markdown and JSON formats, specifically designed for simplicity, extensibility, and for uploading JSON files to GPT models. It is ideal for those looking to leverage web content for AI training or analysis. 🤖💡

🚀 Quick Start

(Or even better, use Docker! 🐳)

Recommended installation using pipx (isolated environment)

pipx install crawler-to-md

Alternatively, install with pip

pip install crawler-to-md

Then run the scraper:

crawler-to-md --url https://www.example.com

🌟 Features

Scrapes web pages for content and metadata. 📄
Filters links by base URL. 🔍
Excludes URLs containing certain strings. ❌
Automatically finds links or can use a file of URLs to scrape. 🔗
Rate limiting and delay support. 🕘
Exports data to Markdown and JSON, ready for GPT uploads. 📤
Exports each page as an individual Markdown file if --export-individual is used. 📝
Uses SQLite for efficient data management. 📊
Configurable via command-line arguments. ⚙️
Docker support. 🐳

📋 Requirements

Python 3.10 or higher is required.

Project dependencies are managed with pyproject.toml. Install them with:

pip install .

🛠 Usage

Start scraping with the following command:

crawler-to-md --url <URL> [--output-folder ./output] [--cache-folder ./cache] [--overwrite-cache|-w] [--base-url <BASE_URL>] [--exclude <KEYWORD_IN_URL>] [--title <TITLE>] [--urls-file <URLS_FILE>] [-p <PROXY_URL>]

Options:

--url, -u: The starting URL. 🌍
--urls-file: Path to a file containing URLs to scrape, one URL per line. If '-', read from stdin. 📁
--output-folder, -o: Where to save Markdown files (default: ./output). 📂
--cache-folder, -c: Where to store the database (default: ./cache). 💾
--overwrite-cache, -w: Overwrite existing cache database before scraping. 🧹
--base-url, -b: Filter links by base URL (default: URL's base). 🔎
--title, -t: Final title of the markdown file. Defaults to the URL. 🏷️
--exclude, -e: Exclude URLs containing this string (repeatable). ❌
--export-individual, -ei: Export each page as an individual Markdown file. 📝
--rate-limit, -rl: Maximum number of requests per minute (default: 0, no rate limit). ⏱️
--delay, -d: Delay between requests in seconds (default: 0, no delay). 🕒
--proxy, -p: Proxy URL for HTTP or SOCKS requests. 🌐

One of the --url or --urls-file options is required.

📚 Log level

By default, the WARN level is used. You can change it with the LOG_LEVEL environment variable.

🐳 Docker Support

Run with Docker:

docker run --rm -v $(pwd)/output:/app/output -v cache:/app/cache ghcr.io/obeone/crawler-to-md --url <URL>

Build from source:

docker build -t crawler-to-md .
docker run --rm -v $(pwd)/output:/app/output crawler-to-md --url <URL>

🤝 Contributing

Contributions are welcome! Feel free to submit pull requests or open issues. 🌟

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
Programming Language
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

0.6.0

Nov 19, 2025

0.5.0

Aug 6, 2025

This version

0.4.0

Jul 13, 2025

0.3.0

Jul 9, 2025

0.2.4

Jul 6, 2025

0.2.3

Jul 6, 2025

0.2.2

Jul 6, 2025

0.2.1

Jul 6, 2025

0.2.0

Jul 6, 2025

0.1.1

Jul 6, 2025

0.1.0

Jul 6, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crawler_to_md-0.4.0.tar.gz (21.8 kB view details)

Uploaded Jul 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crawler_to_md-0.4.0-py3-none-any.whl (14.9 kB view details)

Uploaded Jul 13, 2025 Python 3

File details

Details for the file crawler_to_md-0.4.0.tar.gz.

File metadata

Download URL: crawler_to_md-0.4.0.tar.gz
Upload date: Jul 13, 2025
Size: 21.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for crawler_to_md-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`da2acbaec33d477604ee7217cb422bbd12fcad51b13496373d14d19a02e08060`
MD5	`aaa17b8021bee6d8e3a7b7d67f54e182`
BLAKE2b-256	`3f3c3bf8b73616d361d8a29e47823dac8d7bcd5309e1edfc7f0ce96fb1b75821`

See more details on using hashes here.

File details

Details for the file crawler_to_md-0.4.0-py3-none-any.whl.

File metadata

Download URL: crawler_to_md-0.4.0-py3-none-any.whl
Upload date: Jul 13, 2025
Size: 14.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for crawler_to_md-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b5dbe04bb1bc414065d85a3bdf37fca366a94b6d4be17b1be4b0158216bf9b1f`
MD5	`4bc9c5c536c97178e5938e6308f0570a`
BLAKE2b-256	`8a8867b22f20f4ab5f3f7520c5bfa97ed0e655a68083b3afd140d16ecae4e377`

See more details on using hashes here.

crawler-to-md 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Web Scraper to Markdown 🌐✍️

🚀 Quick Start

Recommended installation using pipx (isolated environment)

Alternatively, install with pip

🌟 Features

📋 Requirements

🛠 Usage

📚 Log level

🐳 Docker Support

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes