The easiest way to crawl a website and produce LLM ready markdown files
Project description
url2llm
I needed a super simple tool to crawl a website (or the links in a llms.txt) into formatted markdown files (without headers, navigation etc.) to add to Claude or ChatGPT project documents.
I haven't found an easy solution, there is some web based tool with a few free credits, but if you are already paying for some LLM with an api, why pay also someone else?
What it does
The script uses Crawl4AI:
- For each url in the crawling, the script produces a markdown
- Then it asks the LLM to extract only the content relevant to the given instruction and save all files to disk.
- Merge all files into one and save the merged file.
Installation
-
Clone the repo, then:
-
(Recommended, with uv) – Nothing to do
-
(Alternative, with pip) – Install
crawl4aiandfire
-
How to use
Run script with arguments:
uv run main.py \
--url "<URL_OR_LLMS.TXT>" \
--depth 1 \
--instruction "I need documents related to <GOAL>" \
--provider "<PROVIDER>/<MODELNAME>" \
--api-key ${GEMINI_API_KEY} \
--output-dir "<OUTPUT_DIR>"
- To use another LLM provider, just change
--providerto eg.openai/gpt-4o- always set
--api-key, it is not always inferred correctly fron env vars
- always set
- Provide a clear goal to
--instruction. This will guide the LLM to filter out irrelevant pages. - Recommended depth (default =
2):2or1for normal website1for llms.txt
- You can specify the concurrency with
--concurrency(default = 16) - The scripts deletes files shorter than
--min_chars(Default = 1000)
[!CAUTION] If you need to do more complex stuff use Crawl4AI directly and build it yourself: https://docs.crawl4ai.com/
How I use it
Thanks to uv, I can easily run it from anywhere in my system:
uv \
--directory ~/Dev/url2llm/ \
run main.py \
--url "https://modelcontextprotocol.io/llms.txt" \
--instruction "I need documents related to developing MCP (model context protocol) servers" \
--provider "gemini/gemini-2.5-flash-preview-04-17" \
--api_key ${GEMINI_API_KEY} \
--output-dir ~/Desktop/crawl_out/
And drag ~/Desktop/crawl_out/merged/model-context-protocol-documentation.md into ChatGPT/Claude!
locally
uv pip install .
publish
uv run pip install --upgrade twine
twine upload dist/*
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file url2llm-0.1.0.tar.gz.
File metadata
- Download URL: url2llm-0.1.0.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41c338b541a3b743d7ee3c54ddca88b92995219b998e07ee33fdc627bfde06f5
|
|
| MD5 |
d1ccacbe7ab40919d5b13f7541ff9f86
|
|
| BLAKE2b-256 |
2e9f5bbb6bb4454886f490d06ede82adaa6c40c9511f676456bdeb96abc20b71
|
File details
Details for the file url2llm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: url2llm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c97345e38ffa57de672d30c2200a53b23bbcd5ff42df79f2d05358c69a4f59ec
|
|
| MD5 |
6ba40ab7ff9a6242e5e2a1c69014ae41
|
|
| BLAKE2b-256 |
0217f56e483300c894643875eff3b102fb9f485874c6bc94cfe153e85ee01b72
|