Multi-engine web search, vertical lookups, and content fetching with optional LLM filtering — standalone library from the Tofu AI assistant
Project description
🔍 tofu-search
Multi-engine web search + content fetching with optional LLM filtering — a standalone Python library extracted from the Tofu AI assistant.
This is a full re-extraction that keeps 100% of Tofu's current search/fetch capabilities: every engine, the structured "vertical" lookups, one-hop deepening, the SPA/bot-protection Playwright fallback, authenticated-source fetching, and the host-browser fallback — the last two exposed through optional provider seams so the library stays dependency-free when used standalone.
Features
- Multi-engine search (parallel): DuckDuckGo (HTML + API), Brave, Bing, SearXNG, Marginalia — plus Xiaohongshu when an auth-source provider supplies a logged-in session.
- Vertical / structured search: auto-detects CVE IDs, arXiv IDs, DOIs, stock tickers, PyPI/npm packages, GitHub repos, IP addresses, Hugging Face daily papers, and Semantic Scholar related-work — answered from the relevant free API alongside web results.
- Content deduplication: Jaccard similarity on shingles (CJK + Latin aware).
- Concurrent page fetching: Race-to-N strategy with SSL fallback + a per-domain circuit breaker.
- One-hop deepening (opt-in): follow the best query-relevant outbound links one hop deeper, bounded like a crawl budget.
- LLM content filter (optional): relevance verdict + noise removal. When no LLM is configured the step is silently skipped (raw text returned as-is).
- BM25 reranking: pure-Python, no external API calls.
- SPA / bot-protection support: optional Playwright fallback for JS-rendered and challenge pages.
- PDF extraction: optional pymupdf / pymupdf4llm integration.
- Host integration seams: register a browser provider (fetch/search via a real browser the user controls) and an auth-source provider (cookies/proxy for login-walled domains) — both no-ops by default.
Quick Start
pip install tofu-search
Basic search (no LLM required)
from tofu_search import search
results = search("Python asyncio tutorial")
for r in results:
print(f"{r['title']}: {r['url']}")
if r.get('full_content'):
print(f" {r['full_content'][:200]}...")
With OpenAI content filtering
from tofu_search import search, configure
configure(
llm_api_key="sk-...",
llm_base_url="https://api.openai.com/v1",
llm_model="gpt-4o-mini",
)
results = search("Python asyncio tutorial")
With a custom LLM callable
from tofu_search import search, configure
def my_llm(messages, **kwargs):
# Your LLM call — receives OpenAI-format messages.
# kwargs may include: stop, temperature, timeout
return "response text"
configure(llm_function=my_llm)
results = search("Python asyncio tutorial")
Fetch a single URL
from tofu_search import fetch_url
content = fetch_url("https://example.com")
if content:
print(f"Got {len(content)} characters")
Vertical (structured-identifier) search
from tofu_search import detect_vertical_intent, search_vertical
domain, identifier, params = detect_vertical_intent("CVE-2021-44228")
record = search_vertical(domain, identifier, params)
print(record['content']) # CVSS score, description, references from NVD
# Or force a domain-level fan-out (free-text → Hugging Face + Semantic Scholar):
from tofu_search import search_vertical_domain
print(search_vertical_domain('academic', 'mamba state space models')['content'])
Host integration (provider seams)
The standalone library never imports a host application. To unlock the two host-only capabilities, register a provider — dependency points inward (host → library), exactly like a plugin.
from tofu_search import (
BrowserProvider, AuthSourceProvider,
register_browser_provider, register_auth_source_provider,
)
class MyBrowser(BrowserProvider):
def is_connected(self): return True
def fetch_url(self, url, *, max_chars=None, timeout=15): ...
def search(self, query, *, max_results=8): ...
class MyAuth(AuthSourceProvider):
def match_source(self, url): ... # → {'domain','cookies','proxy',...} | None
def get_source(self, domain): ...
register_browser_provider(MyBrowser()) # last-resort fetch/search fallback
register_auth_source_provider(MyAuth()) # cookies for login-walled domains
When no provider is registered, the browser fallback and authenticated fetch paths are inert no-ops — the anonymous HTTP + Playwright pipeline runs as normal.
Configuration
from tofu_search import configure
configure(
# Search / fetch settings
fetch_top_n=6, # Max results to return
fetch_timeout=15, # HTTP timeout per request (seconds)
fetch_max_chars_search=60000, # Max chars per page in search results
fetch_max_chars_direct=200000, # Max chars for direct fetch_url()
# LLM settings (for content filter)
llm_api_key="sk-...",
llm_base_url="https://api.openai.com/v1",
llm_model="gpt-4o-mini",
# Or a custom callable instead:
# llm_function=my_callable,
# Filter settings
filter_enabled=True, # Enable/disable LLM filter
filter_min_chars=3000, # Min chars to trigger LLM filter
)
Many settings also read from environment variables: FETCH_TOP_N,
FETCH_TIMEOUT, FETCH_MAX_CHARS_SEARCH, FETCH_MAX_CHARS_DIRECT,
FETCH_MAX_CHARS_PDF, FETCH_MAX_BYTES. One-hop deepening is enabled with
SEARCH_DEEPEN_HOPS=1 (or per call: perform_web_search(..., deepen=True)).
Semantic Scholar raises its rate limit with SEMANTIC_SCHOLAR_API_KEY.
Pipeline
perform_web_search runs an overlapping streaming pipeline:
- Multi-engine search: engines fire in parallel; each engine's URLs are deduped and submitted to the fetch pool the moment they arrive (the first page fetch starts before slow engines finish).
- URL dedup: scheme/trailing-slash-insensitive keys.
- Content dedup: Jaccard similarity on title+snippet shingles.
- Page fetch: concurrent HTTP with race-to-N; SSL retry, circuit breaker,
Playwright fallback for SPA/bot-protection pages.
- 4b. Deepen (opt-in): one hop along the best query-relevant links.
- LLM content filter (optional): relevance verdict + noise removal.
- BM25 rerank: score documents against the query, select top-N.
Step 5 is automatically skipped when no LLM is configured.
Optional Dependencies
# SPA / JS-rendered page support
pip install tofu-search[playwright]
python -m playwright install chromium
# PDF extraction
pip install tofu-search[pdf]
# Everything
pip install tofu-search[all]
Or just run ./install.sh (see below).
Install script
./install.sh # core deps
./install.sh --all # core + playwright + pdf, and installs chromium
./install.sh --playwright
./install.sh --pdf
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tofu_search-0.2.0.tar.gz.
File metadata
- Download URL: tofu_search-0.2.0.tar.gz
- Upload date:
- Size: 82.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"CentOS Linux","version":"7","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b082d2901242f6645218d54c87b90d6c1bbf9c5851e03544895770156ba8840f
|
|
| MD5 |
977e8f0d3b07dfc1d8201d7daa605a65
|
|
| BLAKE2b-256 |
f9f290c2e5acbeafa2b36a154c8b916fde7a23d369f678d7759b33f5c8d8c06a
|
File details
Details for the file tofu_search-0.2.0-py3-none-any.whl.
File metadata
- Download URL: tofu_search-0.2.0-py3-none-any.whl
- Upload date:
- Size: 94.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.17 {"installer":{"name":"uv","version":"0.11.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"CentOS Linux","version":"7","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4404c2ef288d0d3b52d708335b9a340873a634cbdcd37b8f9cfd512068043bfa
|
|
| MD5 |
c05271f7e65eebe41f1203379732f94e
|
|
| BLAKE2b-256 |
4fd07b76b8a1e98eebf45076444c7c5760e65357a100ca7288e461ef873e73c7
|