Skip to main content

A tool to scrape web content into clean Markdown for LLMs.

Project description

Web2LLM

CI/CD Pipeline

A command-line tool to scrape web pages, GitHub repos, local folders, and PDFs into clean, aggregated Markdown suitable for Large Language Models.

Description

This tool provides a unified interface to process various sources—from live websites and code repositories to local directories and PDF files—and convert them into a structured Markdown format. The clean, token-efficient output is ideal for use as context in prompts for Large Language Models, for Retrieval-Augmented Generation (RAG) pipelines, or for documentation archiving.

Installation

For standard scraping of static websites, local files, and GitHub repositories, install the base package: ___bash pip install web2llm


To enable JavaScript rendering for Single-Page Applications (SPAs) and other dynamic websites, you must install the [js] extra, which includes Playwright: ___bash pip install "web2llm[js]"


After installing the js extra, you must also download the necessary browser binaries for Playwright to function: ___bash playwright install


Usage

Command-Line Interface

The tool is run from the command line with the following structure:

___bash web2llm -o <OUTPUT_NAME> [OPTIONS]


  • <SOURCE>: The URL or local path to scrape.
  • -o, --output: The base name for the output folder and the .md and .json files created inside it.

All scraped content is saved to a new directory at output/<OUTPUT_NAME>/.

General Options:

  • --debug: Enable debug mode for verbose, step-by-step output to stderr.

Web Scraper Options (For URLs):

  • --render-js: Render JavaScript using a headless browser. Slower but necessary for SPAs. Requires installation with the [js] extra.
  • --check-content-type: Force a network request to check the page's Content-Type header. Use for URLs that serve PDFs without a .pdf extension.

Filesystem Options (For GitHub & Local Folders):

When scraping a local folder or a GitHub repository, web2llm will automatically find and respect the rules in the project's .gitignore file. This ensures that the scrape accurately reflects the intended source code of the project.

  • --exclude <PATTERN>: A .gitignore-style pattern for files/directories to exclude. Can be used multiple times.
  • --include <PATTERN>: A pattern to re-include a file that would otherwise be ignored by default or by an --exclude rule. Can be used multiple times.
  • --include-all: Disables all default, project-level, and .gitignore ignore patterns, providing a complete scrape of all text-based files. Explicit --exclude flags are still respected.

Configuration

web2llm uses a hierarchical configuration system that gives you precise control over the scraping process:

  1. Default Config: The tool comes with a built-in default_config.yaml containing a robust set of ignore patterns for common development files and selectors for web scraping.
  2. Project-Specific Config: You can create a .web2llm.yaml file in the root of your project to override or extend the default settings. This is the recommended way to manage project-specific rules.
  3. CLI Arguments: Command-line flags provide the final layer of control, overriding any settings from the configuration files for a single run.

Examples

1. Scrape a specific directory within a GitHub repo: ___bash web2llm 'https://github.com/tiangolo/fastapi' -o fastapi-src --include 'fastapi/'


2. Scrape a local project, excluding test and documentation folders: ___bash web2llm '~/dev/my-project' -o my-project-code --exclude 'tests/' --exclude 'docs/'


3. Scrape a local project but re-include the LICENSE file, which is ignored by default: ___bash web2llm '.' -o my-project-with-license --include '!LICENSE'


4. Scrape everything in a project, including files normally ignored by .gitignore: ___bash web2llm . -o my-project-full --include-all --exclude '.git/'


5. Scrape just the "Installation" section from a webpage: ___bash web2llm 'https://fastapi.tiangolo.com/#installation' -o fastapi-install


6. Scrape a PDF from an arXiv URL: ___bash web2llm 'https://arxiv.org/pdf/1706.03762.pdf' -o attention-is-all-you-need


Contributing

Contributions are welcome. Please refer to the project's issue tracker and CONTRIBUTING.md file for information on how to participate.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2llm-0.5.1.tar.gz (25.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web2llm-0.5.1-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file web2llm-0.5.1.tar.gz.

File metadata

  • Download URL: web2llm-0.5.1.tar.gz
  • Upload date:
  • Size: 25.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for web2llm-0.5.1.tar.gz
Algorithm Hash digest
SHA256 785fe986fac390ec6a2a123135eae8569fa550d25c5829a602f7c34801add415
MD5 1e48c2e73acb98ce08eaf54ef751d83d
BLAKE2b-256 f2fab5f4024f58ed8d88ff0f630e4595d04294d5961cbff9d00d90bbcaf35f22

See more details on using hashes here.

Provenance

The following attestation bundles were made for web2llm-0.5.1.tar.gz:

Publisher: publish-to-pypi.yml on herruzo99/web2llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file web2llm-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: web2llm-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for web2llm-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7bac2ec9868e71d6505493a8051970e19ab9fbbc610040f091467b4bf0b7125b
MD5 788d679bd991e481a883171b49fda391
BLAKE2b-256 5623ab12e770902d1e3aeb69cc550ae6d99a48455ab2807c13338e2ebe91bbe3

See more details on using hashes here.

Provenance

The following attestation bundles were made for web2llm-0.5.1-py3-none-any.whl:

Publisher: publish-to-pypi.yml on herruzo99/web2llm

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page