A tool to scrape web content into clean Markdown for LLMs.
Project description
Web2LLM
A command-line tool to scrape web pages, GitHub repos, local folders, and PDFs into clean, aggregated Markdown suitable for Large Language Models.
Description
This tool provides a unified interface to process various sources—from live websites and code repositories to local directories and PDF files—and convert them into a structured Markdown format. The clean, token-efficient output is ideal for use as context in prompts for Large Language Models, for Retrieval-Augmented Generation (RAG) pipelines, or for documentation archiving.
Installation
For standard scraping of static websites, local files, and GitHub repositories, install the base package: ___bash pip install web2llm
To enable JavaScript rendering for Single-Page Applications (SPAs) and other dynamic websites, you must install the [js] extra, which includes Playwright:
___bash
pip install "web2llm[js]"
After installing the js extra, you must also download the necessary browser binaries for Playwright to function:
___bash
playwright install
Usage
Command-Line Interface
The tool is run from the command line with the following structure:
___bash web2llm -o <OUTPUT_NAME> [OPTIONS]
<SOURCE>: The URL or local path to scrape.-o, --output: The base name for the output folder and the.mdand.jsonfiles created inside it.
All scraped content is saved to a new directory at output/<OUTPUT_NAME>/.
General Options:
--debug: Enable debug mode for verbose, step-by-step output to stderr.
Web Scraper Options (For URLs):
--render-js: Render JavaScript using a headless browser. Slower but necessary for SPAs. Requires installation with the[js]extra.--check-content-type: Force a network request to check the page'sContent-Typeheader. Use for URLs that serve PDFs without a.pdfextension.
Filesystem Options (For GitHub & Local Folders):
--exclude <PATTERN>: A.gitignore-style pattern for files/directories to exclude. Can be used multiple times.--include <PATTERN>: A pattern to re-include a file that would otherwise be ignored by default or by an--excluderule. Can be used multiple times.--include-all: Disables all default and project-level ignore patterns. Explicit--excludeflags are still respected.
Configuration
web2llm uses a hierarchical configuration system that gives you precise control over the scraping process:
- Default Config: The tool comes with a built-in
default_config.yamlcontaining a robust set of ignore patterns for common development files and selectors for web scraping. - Project-Specific Config: You can create a
.web2llm.yamlfile in the root of your project to override or extend the default settings. This is the recommended way to manage project-specific rules. - CLI Arguments: Command-line flags provide the final layer of control, overriding any settings from the configuration files for a single run.
Examples
1. Scrape a specific directory within a GitHub repo: ___bash web2llm 'https://github.com/tiangolo/fastapi' -o fastapi-src --include 'fastapi/'
2. Scrape a local project, excluding test and documentation folders: ___bash web2llm '~/dev/my-project' -o my-project-code --exclude 'tests/' --exclude 'docs/'
3. Scrape a local project but re-include the LICENSE file, which is ignored by default:
___bash
web2llm '.' -o my-project-with-license --include '!LICENSE'
4. Scrape everything in a project except the .git directory:
___bash
web2llm . -o my-project-full --include-all --exclude '.git/'
5. Scrape just the "Installation" section from a webpage: ___bash web2llm 'https://fastapi.tiangolo.com/#installation' -o fastapi-install
6. Scrape a PDF from an arXiv URL: ___bash web2llm 'https://arxiv.org/pdf/1706.03762.pdf' -o attention-is-all-you-need
Contributing
Contributions are welcome. Please refer to the project's issue tracker and CONTRIBUTING.md file for information on how to participate.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web2llm-0.4.0.tar.gz.
File metadata
- Download URL: web2llm-0.4.0.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c74a816fc8f2623567bfde039f3585a2d98471d63645b61a2e3f716a926d8bb6
|
|
| MD5 |
8452823a56fa9abd62aa64d3fc1a59c4
|
|
| BLAKE2b-256 |
3d85d429fdbafe30126ab0673a0119b208790dbb8b9934a69be466fee9392c52
|
Provenance
The following attestation bundles were made for web2llm-0.4.0.tar.gz:
Publisher:
publish-to-pypi.yml on herruzo99/web2llm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web2llm-0.4.0.tar.gz -
Subject digest:
c74a816fc8f2623567bfde039f3585a2d98471d63645b61a2e3f716a926d8bb6 - Sigstore transparency entry: 325154492
- Sigstore integration time:
-
Permalink:
herruzo99/web2llm@7bdfdaec67460aaa2144f23fb76e279784c4b6d7 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/herruzo99
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@7bdfdaec67460aaa2144f23fb76e279784c4b6d7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file web2llm-0.4.0-py3-none-any.whl.
File metadata
- Download URL: web2llm-0.4.0-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dd8f0e0c01b560a9e055d6bc55ed0693ddfdc758775b3f8542680b51bb325c6
|
|
| MD5 |
abdc0b1e03b1dc44bf38e718c7e675e2
|
|
| BLAKE2b-256 |
022ee63cb43b704e1a81005f688e1a3fa353c33f24e2dd7e66d3c3736279d7ff
|
Provenance
The following attestation bundles were made for web2llm-0.4.0-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on herruzo99/web2llm
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web2llm-0.4.0-py3-none-any.whl -
Subject digest:
7dd8f0e0c01b560a9e055d6bc55ed0693ddfdc758775b3f8542680b51bb325c6 - Sigstore transparency entry: 325154526
- Sigstore integration time:
-
Permalink:
herruzo99/web2llm@7bdfdaec67460aaa2144f23fb76e279784c4b6d7 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/herruzo99
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@7bdfdaec67460aaa2144f23fb76e279784c4b6d7 -
Trigger Event:
release
-
Statement type: