Skip to main content

Add your description here

Project description

ainfo

Publish documentation Upload Python Package

gather structured information from any website - ready for LLMs

Architecture

The project separates concerns into distinct modules:

  • fetching – obtain raw data from a source
  • parsing – transform raw data into a structured form
  • extraction – pull relevant information from the parsed data
  • output – handle presentation of the extracted results

Usage

Command line

Install the project and run the CLI against a URL:

pip install ainfo
ainfo run https://example.com

The command fetches the page, parses its content and prints the page text. Specify one or more built-in extractors with --extract to pull extra information. For example, to collect contact details and hyperlinks:

ainfo run https://example.com --extract contacts --extract links

Available extractors include:

  • contacts – emails, phone numbers, addresses and social profiles
  • links – all hyperlinks on the page
  • headings – text of headings (h1–h6)

Use --json to emit machine-readable JSON instead of the default human-friendly format. The JSON keys mirror the selected extractors, with text included by default. Pass --no-text when you only need the extraction results. Retrieve the JSON schema for contact details with ainfo.output.json_schema.

For use within an existing asyncio application, the package exposes an async_fetch_data coroutine:

import asyncio
from ainfo import async_fetch_data

async def main():
    html = await async_fetch_data("https://example.com")
    print(html[:60])

asyncio.run(main())

To delegate information extraction or summarisation to an LLM, provide an OpenRouter API key via the OPENROUTER_API_KEY environment variable and pass --use-llm or --summarize:

export OPENROUTER_API_KEY=your_key
ainfo run https://example.com --use-llm --summarize

If the target site relies on client-side JavaScript, enable rendering with a headless browser:

ainfo run https://example.com --render-js

To crawl multiple pages starting from a URL and optionally run extractors on each page:

ainfo crawl https://example.com --depth 2 --extract contacts

The crawler visits pages breadth-first up to the specified depth and prints results for every page encountered. Pass --json to output the aggregated results as JSON instead.

Both commands accept --render-js to execute JavaScript before scraping, which uses Playwright. Installing the browser drivers may require running playwright install.

Utilities chunk_text and stream_chunks are available to break large pages into manageable pieces when sending content to LLMs.

Programmatic API

Most components can also be used directly from Python. Fetch and parse a page, then run the extractors yourself:

from ainfo.extractors import AVAILABLE_EXTRACTORS

from ainfo import fetch_data, parse_data, extract_information, extract_custom

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

# Contact details via built-in extractor
contacts = AVAILABLE_EXTRACTORS["contacts"](doc)

# All links
links = AVAILABLE_EXTRACTORS["links"](doc)

# Any additional data via regular expressions
extra = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
print(contacts.emails, extra["prices"])

Serialise results with to_json or inspect the JSON schema with json_schema(ContactDetails).

Custom extractors

Define your own extractor by writing a function that accepts a Document and registering it in ainfo.extractors.AVAILABLE_EXTRACTORS.

# my_extractors.py
from ainfo.models import Document
from ainfo.extraction import extract_custom
from ainfo.extractors import AVAILABLE_EXTRACTORS

def extract_prices(doc: Document) -> list[str]:
    data = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
    return data.get("prices", [])

AVAILABLE_EXTRACTORS["prices"] = extract_prices

After importing my_extractors your extractor becomes available on the command line:

ainfo run https://example.com --extract prices --no-text

LLM-based extraction

extract_custom can also delegate to a large language model. Supply an LLMService and a prompt describing the desired output:

from ainfo import fetch_data, parse_data
from ainfo.extraction import extract_custom
from ainfo.llm_service import LLMService

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

with LLMService() as llm:
    data = extract_custom(
        doc,
        llm=llm,
        prompt="List all products with their prices as JSON under 'products'",
    )
print(data["products"])

Workflow examples

Save contact details to JSON

pip install ainfo
ainfo run https://example.com --json > contacts.json

Summarize a large page with chunk_text

from ainfo import fetch_data, parse_data, chunk_text
from some_llm import summarize  # pseudo-code

html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")

parts = [summarize(chunk) for chunk in chunk_text(doc.text_content(), 1000)]
print(" ".join(parts))

Stream chunks on the fly

Fetch and chunk a page directly by URL or pass in raw text:

from ainfo import stream_chunks

for chunk in stream_chunks("https://example.com", size=1000):
    handle(chunk)  # send to LLM or other processor

Environment configuration

Copy .env.example to .env and fill in OPENROUTER_API_KEY, OPENROUTER_MODEL, and OPENROUTER_BASE_URL to enable LLM-powered features.

Limitations

  • The built-in extract_information targets contact and social media details. Use extract_custom for other patterns or implement your own domain-specific extractors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ainfo-0.2.2.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ainfo-0.2.2-py3-none-any.whl (22.6 kB view details)

Uploaded Python 3

File details

Details for the file ainfo-0.2.2.tar.gz.

File metadata

  • Download URL: ainfo-0.2.2.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ainfo-0.2.2.tar.gz
Algorithm Hash digest
SHA256 150c1eb917519f81e17245e78f2133e6f4edaaa842c285c5b15cd5254a7f87c7
MD5 4a366574a57beedc3b1be7745ba54396
BLAKE2b-256 46638b20b56d07ffe72c7b97e20c9b9bfa1563448ee1cd19cbfa134d1f776951

See more details on using hashes here.

Provenance

The following attestation bundles were made for ainfo-0.2.2.tar.gz:

Publisher: python-publish.yml on MisterXY89/ainfo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ainfo-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: ainfo-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 22.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ainfo-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 005b49845d55a132da2c42c347b6a6b2f2663919882b67d5d3a5ebd5f13b09b0
MD5 d6f4a81ef22c766975951d471b6fcf21
BLAKE2b-256 03cd7a411c198b5edb5b42127a69d50bf337a5473f0b8a902f7564e5ca0c9fc0

See more details on using hashes here.

Provenance

The following attestation bundles were made for ainfo-0.2.2-py3-none-any.whl:

Publisher: python-publish.yml on MisterXY89/ainfo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page