Skip to main content

Convert web content into JSON using local Ollama LLM

Project description

web2json

web2json converts web content into structured JSON using a local Ollama server. It exposes a simple command line interface.

This repository began from code by abdo-Mansour and was adapted for use at the NOAA Global Systems Laboratory.

Installation

  1. Clone the repository.
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Optionally set OLLAMA_HOST and OLLAMA_MODEL to point to your Ollama instance and model.

Command line usage

Run the CLI module with the content to process and your schema definition. The tool can also crawl multiple pages from a starting URL:

python -m web2json.cli --schema SCHEMA [--url] [--crawl] [--max_pages N] [--output FILE] CONTENT
  • CONTENT can be a URL or raw text.
  • --url tells the tool to treat CONTENT as a URL.
  • --schema accepts the schema definition directly or the path to a file containing it. Schemas may be defined using simple field definitions, JSON Schema, or a Python BaseModel.
  • --crawl treats the content as a starting URL and processes each discovered page.
  • --max_pages limits how many pages are crawled when using --crawl (default: 10).
  • --debug prints the preprocessed content and other intermediate information to stderr.
  • --output writes the resulting JSON to FILE instead of only printing to stdout.
  • When a URL is provided, relative links in the page are converted to absolute URLs so they can be extracted correctly. The page URL itself is assigned to the url field if that key exists in the schema. Missing URLs may be filled automatically using regex patterns in the post-processor (default patterns handle download and preview links).
  • Character encoding is determined automatically when downloading pages so accented characters are preserved correctly.

Example:

python -m web2json.cli https://example.com --url --schema "title: str = Page title"

To crawl and process multiple pages under https://example.com/docs/:

python -m web2json.cli https://example.com/docs/ --crawl --schema "title: str"

The extracted JSON is printed to standard output. Unicode characters are preserved so accent marks appear correctly. Any schema validation errors are reported to standard error. When --debug is used, intermediate output such as the cleaned HTML is also sent to standard error.

Library usage

The pipeline components are exposed as Python classes so you can build custom workflows.

from web2json.cli import parse_schema_input
from web2json.preprocessor import BasicPreprocessor
from web2json.postprocessor import PostProcessor
from web2json.pipeline import Pipeline
from web2json.ai_extractor import OllamaLLMClient

schema = parse_schema_input("title: str\ncontent: str")
pre = BasicPreprocessor()
llm = OllamaLLMClient()
post = PostProcessor(link_patterns={"preview": r"(https?://[^\s]+\.mp4)"})
pipe = Pipeline(pre, llm, post)
result = pipe.run("<h1>Title</h1>", False, schema)

The link_patterns option helps recover URLs when the LLM omits them from the output JSON.

Code overview

  1. Preprocessor - cleans and normalizes HTML or text input.
  2. AIExtractor - sends a prompt to the LLM and returns the raw JSON text.
  3. PostProcessor - repairs malformed JSON and adds missing URLs.

These pieces are wired together by the Pipeline class and driven by the CLI script.

Running tests

Install pytest and run the suite:

pip install -r requirements.txt
pip install pytest
pytest

Tests also run automatically through GitHub Actions on every push and pull request.

Additional tests

The test suite now covers the CLI utilities as well as core components. Additional tests live under tests/ and exercise:

  • The AIExtractor prompt formatting logic.
  • Error handling in PostProcessor.process when invalid JSON is returned.
  • The _fetch_content method in BasicPreprocessor.
  • run_pipeline success and error scenarios.
  • Pipeline operation with a mocked LLM.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

web2json-0.0.8.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

web2json-0.0.8-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file web2json-0.0.8.tar.gz.

File metadata

  • Download URL: web2json-0.0.8.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for web2json-0.0.8.tar.gz
Algorithm Hash digest
SHA256 776414e747347c7f66193716269cff5334b4ccb18a8f2d0c71c6b4f313c019df
MD5 44cd8fbb59e7309405048c012afcb0a8
BLAKE2b-256 83f92f46994d06f04e6aa0b2b294aa57e22bcca75a19e81d35b6bae4a41cb13a

See more details on using hashes here.

Provenance

The following attestation bundles were made for web2json-0.0.8.tar.gz:

Publisher: publish.yml on NOAA-GSL/web2json

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file web2json-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: web2json-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 14.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for web2json-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 4016f5f3e17e376c2dd4d5aa3c4039d3202b62c5e0e8f94dab51703c949503e6
MD5 75c82f40a50bfb1e1483c1c07ccb9b19
BLAKE2b-256 6bbb10709d89952a6e667429fd0310fd01b8e0d17a52ac002508b10a356b3af1

See more details on using hashes here.

Provenance

The following attestation bundles were made for web2json-0.0.8-py3-none-any.whl:

Publisher: publish.yml on NOAA-GSL/web2json

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page