Convert web content into JSON using local Ollama LLM
Project description
web2json
web2json converts web content into structured JSON using a local Ollama server. It exposes a simple command line interface.
This repository began from code by abdo-Mansour and was adapted for use at the NOAA Global Systems Laboratory.
Installation
- Clone the repository.
- Install dependencies:
pip install -r requirements.txt
- Optionally set
OLLAMA_HOSTandOLLAMA_MODELto point to your Ollama instance and model.
Command line usage
Run the CLI module with the content to process and your schema definition. The tool can also crawl multiple pages from a starting URL:
python -m web2json.cli --schema SCHEMA [--url] [--crawl] [--max_pages N] [--output FILE] CONTENT
CONTENTcan be a URL or raw text.--urltells the tool to treatCONTENTas a URL.--schemaaccepts the schema definition directly or the path to a file containing it. Schemas may be defined using simple field definitions, JSON Schema, or a PythonBaseModel.--crawltreats the content as a starting URL and processes each discovered page.--max_pageslimits how many pages are crawled when using--crawl(default: 10).--debugprints the preprocessed content and other intermediate information to stderr.--outputwrites the resulting JSON toFILEinstead of only printing to stdout.- When a URL is provided, relative links in the page are converted to absolute URLs so they can be extracted correctly. The page URL itself is assigned to the
urlfield if that key exists in the schema. Missing URLs may be filled automatically using regex patterns in the post-processor (default patterns handle download and preview links). - Character encoding is determined automatically when downloading pages so accented characters are preserved correctly.
Example:
python -m web2json.cli https://example.com --url --schema "title: str = Page title"
To crawl and process multiple pages under https://example.com/docs/:
python -m web2json.cli https://example.com/docs/ --crawl --schema "title: str"
The extracted JSON is printed to standard output. Unicode characters are
preserved so accent marks appear correctly. Any schema validation errors are
reported to standard error. When --debug is used, intermediate output such as the cleaned HTML is also sent to standard error.
Library usage
The pipeline components are exposed as Python classes so you can build custom workflows.
from web2json.cli import parse_schema_input
from web2json.preprocessor import BasicPreprocessor
from web2json.postprocessor import PostProcessor
from web2json.pipeline import Pipeline
from web2json.ai_extractor import OllamaLLMClient
schema = parse_schema_input("title: str\ncontent: str")
pre = BasicPreprocessor()
llm = OllamaLLMClient()
post = PostProcessor(link_patterns={"preview": r"(https?://[^\s]+\.mp4)"})
pipe = Pipeline(pre, llm, post)
result = pipe.run("<h1>Title</h1>", False, schema)
The link_patterns option helps recover URLs when the LLM omits them from the output JSON.
Code overview
- Preprocessor - cleans and normalizes HTML or text input.
- AIExtractor - sends a prompt to the LLM and returns the raw JSON text.
- PostProcessor - repairs malformed JSON and adds missing URLs.
These pieces are wired together by the Pipeline class and driven by the CLI script.
Running tests
Install pytest and run the suite:
pip install -r requirements.txt
pip install pytest
pytest
Tests also run automatically through GitHub Actions on every push and pull request.
Additional tests
The test suite now covers the CLI utilities as well as core components.
Additional tests live under tests/ and exercise:
- The
AIExtractorprompt formatting logic. - Error handling in
PostProcessor.processwhen invalid JSON is returned. - The
_fetch_contentmethod inBasicPreprocessor. - run_pipeline success and error scenarios.
- Pipeline operation with a mocked LLM.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web2json-0.0.7.tar.gz.
File metadata
- Download URL: web2json-0.0.7.tar.gz
- Upload date:
- Size: 15.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78137380c5d3ac187d4d115242a524f542a52a399471e9a599b55f6982387e94
|
|
| MD5 |
b24729efd5b62e62181f69f1d954e6d5
|
|
| BLAKE2b-256 |
6c9bf9920e9fceb352f0aefd5c0c875fe34eb4aa81440c9e412096d24cad2f0c
|
Provenance
The following attestation bundles were made for web2json-0.0.7.tar.gz:
Publisher:
publish.yml on NOAA-GSL/web2json
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web2json-0.0.7.tar.gz -
Subject digest:
78137380c5d3ac187d4d115242a524f542a52a399471e9a599b55f6982387e94 - Sigstore transparency entry: 242374968
- Sigstore integration time:
-
Permalink:
NOAA-GSL/web2json@d5ca6a9ce7132c01c6d0a3dbd2203382085d6ccd -
Branch / Tag:
refs/heads/main - Owner: https://github.com/NOAA-GSL
-
Access:
internal
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d5ca6a9ce7132c01c6d0a3dbd2203382085d6ccd -
Trigger Event:
push
-
Statement type:
File details
Details for the file web2json-0.0.7-py3-none-any.whl.
File metadata
- Download URL: web2json-0.0.7-py3-none-any.whl
- Upload date:
- Size: 14.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9266f1fa8b9e0abf56f487269162435bf28329077f6b1b96c4f40ca1dd4da594
|
|
| MD5 |
a8335e866c6b9fa907dbc3f672caea4f
|
|
| BLAKE2b-256 |
3743399afe7269061be832b1c0bba2206f6f6867411efe7598e4da0e313ffffa
|
Provenance
The following attestation bundles were made for web2json-0.0.7-py3-none-any.whl:
Publisher:
publish.yml on NOAA-GSL/web2json
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
web2json-0.0.7-py3-none-any.whl -
Subject digest:
9266f1fa8b9e0abf56f487269162435bf28329077f6b1b96c4f40ca1dd4da594 - Sigstore transparency entry: 242374986
- Sigstore integration time:
-
Permalink:
NOAA-GSL/web2json@d5ca6a9ce7132c01c6d0a3dbd2203382085d6ccd -
Branch / Tag:
refs/heads/main - Owner: https://github.com/NOAA-GSL
-
Access:
internal
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d5ca6a9ce7132c01c6d0a3dbd2203382085d6ccd -
Trigger Event:
push
-
Statement type: