Add your description here
Project description
ainfo
gather structured information from any website - ready for LLMs
Architecture
The project separates concerns into distinct modules:
fetching– obtain raw data from a sourceparsing– transform raw data into a structured formextraction– pull relevant information from the parsed dataoutput– handle presentation of the extracted results
Usage
Command line
Install the project and run the CLI against a URL:
pip install -e .
ainfo run https://example.com
The command fetches the page, parses its content and prints the page text.
Specify one or more built-in extractors with --extract to pull extra
information. For example, to collect contact details and hyperlinks:
ainfo run https://example.com --extract contacts --extract links
Available extractors include:
contacts– emails, phone numbers, addresses and social profileslinks– all hyperlinks on the pageheadings– text of headings (h1–h6)
Use --json to emit machine-readable JSON instead of the default
human-friendly format. The JSON keys mirror the selected extractors, with
text always included. Retrieve the JSON schema for contact details with
ainfo.output.json_schema.
For use within an existing asyncio application, the package exposes an
async_fetch_data coroutine:
import asyncio
from ainfo import async_fetch_data
async def main():
html = await async_fetch_data("https://example.com")
print(html[:60])
asyncio.run(main())
To delegate information extraction or summarisation to an LLM, provide an
OpenRouter API key via the OPENROUTER_API_KEY environment variable and pass
--use-llm or --summarize:
export OPENROUTER_API_KEY=your_key
ainfo run https://example.com --use-llm --summarize
If the target site relies on client-side JavaScript, enable rendering with a headless browser:
ainfo run https://example.com --render-js
To crawl multiple pages starting from a URL and optionally run extractors on each page:
ainfo crawl https://example.com --depth 2 --extract contacts
The crawler visits pages breadth-first up to the specified depth and prints
results for every page encountered. Pass --json to output the aggregated
results as JSON instead.
Both commands accept --render-js to execute JavaScript before scraping, which
uses Playwright. Installing the browser drivers may
require running playwright install.
Utilities chunk_text and stream_chunks are available to break large
pages into manageable pieces when sending content to LLMs.
Programmatic API
Most components can also be used directly from Python. Fetch and parse a page, then run the extractors yourself:
from ainfo.extractors import AVAILABLE_EXTRACTORS
from ainfo import fetch_data, parse_data, extract_information, extract_custom
html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")
# Contact details via built-in extractor
contacts = AVAILABLE_EXTRACTORS["contacts"](doc)
# All links
links = AVAILABLE_EXTRACTORS["links"](doc)
# Any additional data via regular expressions
extra = extract_custom(doc, {"prices": r"\$\d+(?:\.\d{2})?"})
print(contacts.emails, extra["prices"])
Serialise results with to_json or inspect the JSON schema with
json_schema(ContactDetails).
Workflow examples
Save contact details to JSON
pip install -e .
ainfo run https://example.com --json > contacts.json
Summarize a large page with chunk_text
from ainfo import fetch_data, parse_data, chunk_text
from some_llm import summarize # pseudo-code
html = fetch_data("https://example.com")
doc = parse_data(html, url="https://example.com")
parts = [summarize(chunk) for chunk in chunk_text(doc.text_content(), 1000)]
print(" ".join(parts))
Stream chunks on the fly
Fetch and chunk a page directly by URL or pass in raw text:
from ainfo import stream_chunks
for chunk in stream_chunks("https://example.com", size=1000):
handle(chunk) # send to LLM or other processor
Environment configuration
Copy .env.example to .env and fill in OPENROUTER_API_KEY, OPENROUTER_MODEL, and OPENROUTER_BASE_URL to enable LLM-powered features.
Limitations
- The built-in
extract_informationtargets contact and social media details. Useextract_customfor other patterns or implement your own domain-specific extractors.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ainfo-0.2.1.tar.gz.
File metadata
- Download URL: ainfo-0.2.1.tar.gz
- Upload date:
- Size: 19.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49d3c26d067187777c5218c30eafcc87fda244b879b0643d9f4f6122c2e247cb
|
|
| MD5 |
ba1e7ffd159abe3260aa79d60afa27fa
|
|
| BLAKE2b-256 |
46bf3edbcb8fc0e0794e50a1115cec0732ff12fd98ce09697e94931029193524
|
Provenance
The following attestation bundles were made for ainfo-0.2.1.tar.gz:
Publisher:
python-publish.yml on MisterXY89/ainfo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ainfo-0.2.1.tar.gz -
Subject digest:
49d3c26d067187777c5218c30eafcc87fda244b879b0643d9f4f6122c2e247cb - Sigstore transparency entry: 469369262
- Sigstore integration time:
-
Permalink:
MisterXY89/ainfo@c190ecbb04eec7c44694a38b806ce127ce76c46f -
Branch / Tag:
refs/tags/v0.2-pre - Owner: https://github.com/MisterXY89
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@c190ecbb04eec7c44694a38b806ce127ce76c46f -
Trigger Event:
release
-
Statement type:
File details
Details for the file ainfo-0.2.1-py3-none-any.whl.
File metadata
- Download URL: ainfo-0.2.1-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
789e9585827671418cacce5e8a37443a19e661b2284b870fa06ed043a7c37b84
|
|
| MD5 |
ae879ff75cbee21fd6e5ff68df037f7f
|
|
| BLAKE2b-256 |
b1d97f12927c24b2eff1010daf92c11935da12b9f6f61b647d4e58c8fba193d6
|
Provenance
The following attestation bundles were made for ainfo-0.2.1-py3-none-any.whl:
Publisher:
python-publish.yml on MisterXY89/ainfo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ainfo-0.2.1-py3-none-any.whl -
Subject digest:
789e9585827671418cacce5e8a37443a19e661b2284b870fa06ed043a7c37b84 - Sigstore transparency entry: 469369279
- Sigstore integration time:
-
Permalink:
MisterXY89/ainfo@c190ecbb04eec7c44694a38b806ce127ce76c46f -
Branch / Tag:
refs/tags/v0.2-pre - Owner: https://github.com/MisterXY89
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@c190ecbb04eec7c44694a38b806ce127ce76c46f -
Trigger Event:
release
-
Statement type: