Convert web content into JSON using local Ollama LLM

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

web2json

web2json converts web content into structured JSON using a local Ollama server. It exposes a simple command line interface.

This repository began from code by abdo-Mansour and was adapted for use at the NOAA Global Systems Laboratory.

Installation

Clone the repository.
Install dependencies:
```
pip install -r requirements.txt
```
Install the package in editable mode so the web2json command is available:
```
pip install -e .
```
Optionally set OLLAMA_HOST (or OLLAMA_BASE_URL) and OLLAMA_MODEL to point to your Ollama instance and model. The dev container defaults to gemma3:12b.
Set OPENAI_API_KEY to use the OpenAI API instead of the local Ollama server. When this variable is present, the OLLAMA_* settings are ignored. Use OPENAI_MODEL to choose a model (default gpt-3.5-turbo). If OPENAI_API_KEY is left undefined in the dev container, it may be set to a placeholder value like ${{ secrets.OPENAI_API_KEY }}; this is ignored and the local Ollama server will be used.

Command line usage

Run the CLI module with the content to process and your schema definition. The tool can also crawl multiple pages from a starting URL:

python -m web2json.cli --schema SCHEMA [--url] [--crawl] [--max_pages N] [--output FILE] CONTENT

CONTENT can be a URL or raw text.
--url tells the tool to treat CONTENT as a URL.
--schema accepts a JSON Schema definition directly or the path to a JSON file containing it.
When loading a schema from a file, an optional prompt string may be included to append additional instructions for the language model.
The file may also contain a postprocess section with link_patterns that map field names to regular expressions and optional css_selectors for extracting values using CSS paths. These settings help fill in missing data based on the cleaned HTML.
--crawl treats the content as a starting URL and processes each discovered page.
--max_pages limits how many pages are crawled when using --crawl (default: 10).
--debug prints the preprocessed content and other intermediate information to stderr.
--output writes the resulting JSON to FILE instead of only printing to stdout.
When a URL is provided, relative links in the page are converted to absolute URLs so they can be extracted correctly. The page URL itself is assigned to the url field if that key exists in the schema. Missing URLs can be filled using regex patterns in the post-processor when provided.
Character encoding is determined automatically when downloading pages so accented characters are preserved correctly.
If your schema defines a content field, the CLI removes common header, footer and navigation sections (including the official U.S. government banner) so that field only contains the main page body.

Example:

python -m web2json.cli https://example.com --url --schema '{"properties": {"title": {"type": "string", "description": "Page title"}}}'

You can place the schema in a file instead. For example schema.json:

{
  "properties": {"title": {"type": "string"}},
  "postprocess": {
    "link_patterns": {"ftp_download": "(ftp://\\S+)"},
    "css_selectors": {"preview": {"selector": "a.preview", "attr": "href"}}
  }
}

Run the CLI using that file:

python -m web2json.cli https://example.com --url --schema schema.json

To crawl and process multiple pages under https://example.com/docs/:

python -m web2json.cli https://example.com/docs/ --crawl --schema '{"properties": {"title": {"type": "string"}}}'

The extracted JSON is printed to standard output. Unicode characters are preserved so accent marks appear correctly. Any schema validation errors are reported to standard error. When --debug is used, intermediate output such as the cleaned HTML is also sent to standard error.

Library usage

The pipeline components are exposed as Python classes so you can build custom workflows.

from web2json.cli import parse_schema_input
from web2json.preprocessor import BasicPreprocessor
from web2json.postprocessor import PostProcessor
from web2json.pipeline import Pipeline
from web2json.ai_extractor import OllamaLLMClient

schema_json = '{"properties": {"title": {"type": "string"}, "content": {"type": "string"}}}'
schema = parse_schema_input(schema_json)
# Exclude header, footer and navigation markup when cleaning HTML
pre = BasicPreprocessor(config={"remove_boilerplate": True})
llm = OllamaLLMClient()
post = PostProcessor()
pipe = Pipeline(pre, llm, post)
result = pipe.run("<h1>Title</h1>", False, schema)

Regex link_patterns and css_selectors can be supplied for special cases where the language model misses links or other values.

The post-processor also applies simple heuristics to recover categories, keywords, and notable features directly from the cleaned HTML. These values replace the model output when they differ from the page content.

Code overview

Preprocessor - cleans and normalizes HTML or text input. When remove_boilerplate is enabled, common header, footer and navigation elements (like the U.S. government banner) are stripped before text extraction. The CLI turns this setting on automatically if your schema includes a content field.
AIExtractor - sends a prompt to the LLM and returns the raw JSON text.
PostProcessor - repairs malformed JSON and adds missing URLs.

These pieces are wired together by the Pipeline class and driven by the CLI script.

Running tests

Install pytest and run the suite:

pip install -r requirements.txt
pip install pytest
pytest

Tests also run automatically through GitHub Actions on every push and pull request.

Dev container

The .devcontainer folder provides a configuration for Dev Containers and GitHub Codespaces. Open the project in Visual Studio Code and choose Reopen in Container to automatically build the image and install the dependencies listed in requirements.txt. Environment variables like OPENAI_API_KEY, OLLAMA_BASE_URL, and OLLAMA_MODEL are set through the remoteEnv section of devcontainer.json. The repository is mounted inside the container at /workspace/web2json and the site-config directory is available at /workspace/site-config.

Additional tests

The test suite now covers the CLI utilities as well as core components. Additional tests live under tests/ and exercise:

The AIExtractor prompt formatting logic.
Error handling in PostProcessor.process when invalid JSON is returned.
The _fetch_content method in BasicPreprocessor.
run_pipeline success and error scenarios.
Pipeline operation with a mocked LLM.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

HacksHaven

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.13

Jun 27, 2025

0.0.12

Jun 20, 2025

0.0.11

Jun 20, 2025

0.0.10

Jun 17, 2025

0.0.9

Jun 17, 2025

0.0.8

Jun 17, 2025

0.0.7

Jun 17, 2025

0.0.6

Jun 17, 2025

0.0.5

Jun 17, 2025

0.0.4

Jun 17, 2025

0.0.2

Jun 17, 2025

0.0.1

Jun 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2json-0.0.13.tar.gz (21.6 kB view details)

Uploaded Jun 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

web2json-0.0.13-py3-none-any.whl (17.5 kB view details)

Uploaded Jun 27, 2025 Python 3

File details

Details for the file web2json-0.0.13.tar.gz.

File metadata

Download URL: web2json-0.0.13.tar.gz
Upload date: Jun 27, 2025
Size: 21.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for web2json-0.0.13.tar.gz
Algorithm	Hash digest
SHA256	`f324543b6f2d4c4abfc271f0ae0e4831c2286fa9ff94cfd8de94194f9dafbd68`
MD5	`b92b9f4adcae3d4a2b4304863b1bf054`
BLAKE2b-256	`01e9d412950f00a8a2515013a7397d9cc3c9ca74e9209aa764af68b8c88fc634`

See more details on using hashes here.

Provenance

The following attestation bundles were made for web2json-0.0.13.tar.gz:

Publisher: publish.yml on NOAA-GSL/web2json

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: web2json-0.0.13.tar.gz
- Subject digest: f324543b6f2d4c4abfc271f0ae0e4831c2286fa9ff94cfd8de94194f9dafbd68
- Sigstore transparency entry: 253672407
- Sigstore integration time: Jun 27, 2025
Source repository:
- Permalink: NOAA-GSL/web2json@87e8c6a7a25a3e8d4269da8793fbe61b12e0cb22
- Branch / Tag: refs/heads/main
- Owner: https://github.com/NOAA-GSL
- Access: internal
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@87e8c6a7a25a3e8d4269da8793fbe61b12e0cb22
- Trigger Event: push

File details

Details for the file web2json-0.0.13-py3-none-any.whl.

File metadata

Download URL: web2json-0.0.13-py3-none-any.whl
Upload date: Jun 27, 2025
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for web2json-0.0.13-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8dfe4e777c34178467dc19e88062c44c53317fdf546a52326fc3229101fceca2`
MD5	`f9653787820f849c9582ef5b30094457`
BLAKE2b-256	`c6f01f12bad9ae6b5872ae25b02d973ae40596822694c9a6865750cdd17824f8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for web2json-0.0.13-py3-none-any.whl:

Publisher: publish.yml on NOAA-GSL/web2json

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: web2json-0.0.13-py3-none-any.whl
- Subject digest: 8dfe4e777c34178467dc19e88062c44c53317fdf546a52326fc3229101fceca2
- Sigstore transparency entry: 253672416
- Sigstore integration time: Jun 27, 2025
Source repository:
- Permalink: NOAA-GSL/web2json@87e8c6a7a25a3e8d4269da8793fbe61b12e0cb22
- Branch / Tag: refs/heads/main
- Owner: https://github.com/NOAA-GSL
- Access: internal
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@87e8c6a7a25a3e8d4269da8793fbe61b12e0cb22
- Trigger Event: push

web2json 0.0.13

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

web2json

Installation

Command line usage

Library usage

Code overview

Running tests

Dev container

Additional tests

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance