Skip to main content

The easiest way to crawl a website and produce LLM ready markdown files

Project description

url2llm

I needed a super simple tool to crawl a website (or the links in a llms.txt) into a formatted markdown file (without headers, navigation etc.) to add to Claude or ChatGPT project documents.

I haven't found an easy solution, there is some web based tool with a few free credits, but if you are already paying for some LLM with an api, why pay also someone else?

Quickstart

With uv (recommended):

Thanks to uv, you can easily run it from anywhere without installing anything:

uvx url2llm \
   --depth 1 \
   --url "https://modelcontextprotocol.io/llms.txt" \
   --instruction "I need documents related to developing MCP (model context protocol) servers" \
   --provider "gemini/gemini-2.5-flash-preview-04-17" \
   --api_key ${GEMINI_API_KEY}

Then drag ./model-context-protocol-documentation.md into ChatGPT/Claude!

[!TIP] You can invoke it with url2llm as a properly installed cli tool after running uv tool install url2llm.

With pip (alternative):

pip install url2llm

What it does

The script uses Crawl4AI:

  1. For each url in the crawling, the script produces a markdown
  2. Then it asks the LLM to extract from each page only the content relevant to the given instruction.
  3. Merge all pages into one and save the merged file.

Command args and hints

  • To use another LLM provider, just change --provider to eg. openai/gpt-4o
    • always set --api-key, it is not always inferred correctly fron env vars
  • Provide a clear goal to --instruction. This will guide the LLM to filter out irrelevant pages.
  • Recommended depth (default = 2):
    • 2 or 1 for normal website
    • 1 for llms.txt
  • Provide --output_dir to change where files are saved (default = .)
  • If you need the single pages, use --keep_pages True (default = False)
  • You can specify the concurrency with --concurrency (default = 16)
  • The scripts deletes files shorter than --min_chars (default = 1000)

[!CAUTION] If you need to do more complex stuff use Crawl4AI directly and build it yourself: https://docs.crawl4ai.com/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

url2llm-0.3.3.tar.gz (6.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

url2llm-0.3.3-py3-none-any.whl (7.1 kB view details)

Uploaded Python 3

File details

Details for the file url2llm-0.3.3.tar.gz.

File metadata

  • Download URL: url2llm-0.3.3.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for url2llm-0.3.3.tar.gz
Algorithm Hash digest
SHA256 ea090a12b25ac627a5d6573690800cac5e1f9404a0c3f9ef8f8e8daa7ff04e13
MD5 f9dde6e779412749a9c419c8f071b071
BLAKE2b-256 c959a6af56a452c564630ab579b4160e596f695c19d0e640ce953bb5fa5ad9e2

See more details on using hashes here.

File details

Details for the file url2llm-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: url2llm-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 7.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.10

File hashes

Hashes for url2llm-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 746a1c3c14a903bb2eafd54fad54c908929315fcaf7c95f36f9acf51fcf3aa5f
MD5 99b0cd42bd0316b6a15f70fecde0e6e5
BLAKE2b-256 16a3f2e3459c583dcc8b97a117bdc9556bcab90eb1147d1258b8d41721147ed0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page