Skip to main content

The easiest way to crawl a website and produce LLM ready markdown files

Project description

url2llm

I needed a super simple tool to crawl a website (or the links in a llms.txt) into formatted markdown files (without headers, navigation etc.) to add to Claude or ChatGPT project documents.

I haven't found an easy solution, there is some web based tool with a few free credits, but if you are already paying for some LLM with an api, why pay also someone else?

What it does

The script uses Crawl4AI:

  1. For each url in the crawling, the script produces a markdown
  2. Then it asks the LLM to extract only the content relevant to the given instruction and save all files to disk.
  3. Merge all files into one and save the merged file.

Installation

  1. Clone the repo, then:

    • (Recommended, with uv) – Nothing to do

    • (Alternative, with pip) – Install crawl4ai and fire

How to use

Run script with arguments:

uv run main.py \
   --url "<URL_OR_LLMS.TXT>" \
   --depth 1 \
   --instruction "I need documents related to <GOAL>" \
   --provider "<PROVIDER>/<MODELNAME>" \
   --api-key ${GEMINI_API_KEY} \
   --output-dir "<OUTPUT_DIR>"
  • To use another LLM provider, just change --provider to eg. openai/gpt-4o
    • always set --api-key, it is not always inferred correctly fron env vars
  • Provide a clear goal to --instruction. This will guide the LLM to filter out irrelevant pages.
  • Recommended depth (default = 2):
    • 2 or 1 for normal website
    • 1 for llms.txt
  • You can specify the concurrency with --concurrency (default = 16)
  • The scripts deletes files shorter than --min_chars (Default = 1000)

[!CAUTION] If you need to do more complex stuff use Crawl4AI directly and build it yourself: https://docs.crawl4ai.com/

How I use it

Thanks to uv, I can easily run it from anywhere in my system:

uv \
   --directory ~/Dev/url2llm/ \
   run main.py \
   --url "https://modelcontextprotocol.io/llms.txt" \
   --instruction "I need documents related to developing MCP (model context protocol) servers" \
   --provider "gemini/gemini-2.5-flash-preview-04-17" \
   --api_key ${GEMINI_API_KEY} \
   --output-dir ~/Desktop/crawl_out/

And drag ~/Desktop/crawl_out/merged/model-context-protocol-documentation.md into ChatGPT/Claude!

locally

uv pip install .

publish

uv run pip install --upgrade twine

twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url2llm-0.1.0.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

url2llm-0.1.0-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file url2llm-0.1.0.tar.gz.

File metadata

  • Download URL: url2llm-0.1.0.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for url2llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 41c338b541a3b743d7ee3c54ddca88b92995219b998e07ee33fdc627bfde06f5
MD5 d1ccacbe7ab40919d5b13f7541ff9f86
BLAKE2b-256 2e9f5bbb6bb4454886f490d06ede82adaa6c40c9511f676456bdeb96abc20b71

See more details on using hashes here.

File details

Details for the file url2llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: url2llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for url2llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c97345e38ffa57de672d30c2200a53b23bbcd5ff42df79f2d05358c69a4f59ec
MD5 6ba40ab7ff9a6242e5e2a1c69014ae41
BLAKE2b-256 0217f56e483300c894643875eff3b102fb9f485874c6bc94cfe153e85ee01b72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page