ReAct agent that searches, scrapes, and reasons across the web — Gemini-powered with model fallback chain
Project description
Web Research Agent
A from-scratch implementation of the ReAct (Reasoning and Acting) paradigm for autonomous web research. No LangChain, no agent frameworks — just a minimal, well-structured reasoning loop with real tool use.
Originally built as a take-home coding challenge before "deep research" became a product category. The focus was on understanding where the ReAct loop actually breaks on adversarial research tasks, and fixing those failure modes one by one.
Install
pip install web-research-agent
Optional extras:
pip install "web-research-agent[providers]" # Groq / OpenRouter fallback via openai package
pip install "web-research-agent[browser]" # JS-rendered page scraping via Playwright
pip install "web-research-agent[all]" # providers + browser
Requires Python 3.8+.
Setup
webresearch
On first run, an interactive setup wizard prompts for API keys and stores them securely in the system keyring (no .env files needed):
- Gemini API key — Google AI Studio (free tier)
- Serper API key — Serper.dev (free tier: 2,500 searches/month)
Optional fallback providers (used if Gemini quota is exhausted):
- Groq API key — fast inference, free tier
- OpenRouter API key — multi-model routing
- Ollama base URL — local model serving
Keys can be reconfigured at any time from the [6] reconfigure keys menu option.
Usage
Interactive TUI
webresearch
Menu options:
| Key | Action |
|---|---|
[1] |
Run a single research query |
[2] |
Deep research (parallel sub-queries) |
[3] |
Process a task file (batch) |
[4] |
View query history |
[5] |
View execution logs |
[6] |
Reconfigure API keys |
[7] |
Clear conversation memory |
[q] |
Exit |
The live research panel shows a spinner, elapsed time, and contextual progress phrases keyed to the current tool in use (searching / scraping / running code / writing file). Phrases rotate every 6 seconds so they're readable without flickering.
Multi-line queries are supported in both [1] and [2] modes. End your first line with : to enter continuation mode — subsequent lines are collected until you press Enter on a blank line:
❯ Research question: Compile a list of companies matching:
(continuing — press Enter on a blank line to finish)
- Based in the EU
- Revenue > €1B in 2023
- Motor vehicle sector
[blank line]
Single-line queries work exactly as before — type and press Enter once.
Batch Mode
python main.py tasks.txt -o results.txt -v
Task file format — one task per block, separated by blank lines:
Find the name of the COO of the organization that mediated secret talks
between US and Chinese AI companies in Geneva in 2023.
By what percentage did Volkswagen reduce the sum of their Scope 1 and
Scope 2 greenhouse gas emissions in 2023 compared to 2021?
Python API
from webresearch import initialize_agent
agent = initialize_agent()
result = agent.run("Your research question here")
print(result)
Architecture
The agent implements the ReAct paradigm (Yao et al., 2023):
Thought → Action → Observation → [repeat] → Final Answer
webresearch/
├── agent.py # ReAct loop, step parsing, sliding-window prompt history
├── llm.py # Gemini LLM interface
├── llm_compat.py # OpenAI-compatible interface (Groq, OpenRouter, Ollama)
├── llm_chain.py # Model fallback chain with thread-safe provider rotation
├── config.py # Configuration (env vars + keyring)
├── credentials.py # Keyring-backed secure credential storage
├── memory.py # Conversation memory (within-session Q&A context)
├── parallel.py # Parallel deep research: decomposes task → fan-out → synthesize
└── tools/
├── base.py # Tool abstract base class
├── think.py # Reasoning scratchpad — no external call, pure planning/verification
├── search.py # Serper.dev web search
├── scrape.py # HTTP + BeautifulSoup; tables → markdown, encoding fix, 5xx retry
├── pdf.py # pdfplumber PDF extraction with table parsing and page targeting
├── browser.py # Playwright JS-rendered scraping
├── code_executor.py # Sandboxed Python subprocess
└── file_ops.py # Read/write for cross-step data persistence
The Think Tool and the ReAct Reasoning Gap
The standard ReAct loop has a structural constraint: every Thought must be followed by an Action that calls an external tool. There is no way to have a reasoning step that is the action — thinking is always a preamble to doing something.
This works well on single-hop lookups. It breaks on multi-hop queries where the agent needs to plan before acting, or where it needs to stop and verify whether an entity found in results actually matches the task description before spending more iterations on it.
The failure mode without think:
Task: "Find the COO of the organization that mediated secret AI talks in Geneva in 2023"
Step 1 Thought: I'll search for organizations that mediated AI talks in Geneva.
Action: search {"query": "organization mediated US China AI talks Geneva 2023"}
Obs: Results mention WEF, UN Geneva offices, diplomatic bodies...
Step 2 Thought: The World Economic Forum is based in Geneva and hosts AI discussions.
Action: search {"query": "World Economic Forum COO"}
Obs: Sarita Nayyar / Julien Gattoni...
Step 3 Thought: The COO of WEF is [name].
Final Answer: [wrong]
The agent never asked: does WEF actually match "mediated secret talks between US and Chinese AI companies"? It anchored on the most prominent Geneva organization in the search results and ran with it. By step 2, recovery requires contradicting its own prior reasoning — something the model resists without an explicit prompt to verify.
With think:
Step 1 Thought: This is a multi-hop question. I need to find the org first, then its COO.
Action: think {"thought": "The question describes a specific event — secret talks
between US and Chinese AI companies — not a general Geneva forum. I should
search for the event itself, not assume WEF or UN."}
Obs: Reasoning recorded.
Step 2 Thought: I'll search for the specific event described.
Action: search {"query": "secret talks US Chinese AI companies Geneva 2023 mediator"}
Obs: Results mention the Shaikh Group...
Step 3 Thought: Before proceeding, I should verify the Shaikh Group actually fits the
description — a private advisory firm that arranged AI company meetings.
Action: think {"thought": "The Shaikh Group is described as a geopolitical advisory
firm that brokered the Geneva meetings. This matches the task: 'mediated
secret talks between US and Chinese AI companies'. Not WEF. Proceed."}
Obs: Reasoning recorded.
Step 4 Action: search {"query": "Shaikh Group COO"}
...
Final Answer: Yannis Pallikaris, COO of the Shaikh Group.
The think tool does three things the bare ReAct loop cannot:
-
Upfront planning — the agent can spend a step decomposing the query and choosing a search strategy before any API call is made. On a 15-iteration budget, one planning step spent before three wrong searches is cheaper than recovering from three wrong searches.
-
Entity verification — the agent can explicitly check whether an entity from search results matches the task description. This is the critical gate that prevents anchoring on a superficially related but wrong entity.
-
Mid-run reorientation — if evidence contradicts the current path, the agent can reason about the contradiction and pivot without the reasoning appearing inconsistent in the next
Thought:prefix.
The tool itself is a no-op: it accepts a thought string and returns a neutral confirmation. Its value is entirely in the reasoning trace it forces into the step history, where subsequent Thought: steps can reference it.
Why the Worked Example in the Prompt Matters
The think tool existed for several versions before it reliably fired. The instruction said:
"Use the think tool to plan your approach on multi-step questions..."
That is advisory, conditional language. The model has to evaluate whether the task is "complex enough" — which is itself the reasoning you wanted it to do. Under token pressure or on tasks that feel familiar, it categorises the query as routine and skips straight to searching.
The fix comes from a well-established technique: few-shot prompting.
Few-shot prompting was demonstrated at scale in the original GPT-3 paper (Brown et al., 2020): showing the model one or two worked examples of the exact pattern you want reliably outperforms any amount of instructional text describing the same pattern. The model infers format, depth, and decision logic from examples in a way that rules alone don't produce.
The specific numbers for our case come from Anthropic's own study of the think tool on τ-bench (Anthropic Engineering, 2025):
| Configuration | Performance on τ-bench (airline domain) |
|---|---|
| No think tool | 0.332 (baseline) |
| Think tool, advisory prompt ("use when complex") | 0.404 (+22%) |
| Think tool, mandatory prompt + worked example | 0.584 (+76%) |
The worked example in the system prompt — showing the agent exactly how to decompose a task, call think, then verify an entity before proceeding — is the difference between a 22% and a 76% lift over baseline.
This is also consistent with findings across other agent frameworks. AWS Bedrock's production Claude template (Amazon Bedrock docs) uses: "Always output your thoughts within <thinking></thinking> xml tags before and after you invoke a function." — unconditional, anchored to a structural event (function invocation), not a quality judgement.
How this is implemented here:
The system prompt in webresearch/agent.py now contains three layers:
-
Mandatory, unconditional language — "Your very first action on every task MUST be
think. You are not permitted to call search, scrape, or any other tool before calling think first." -
A worked example — the exact three-step think → search → think pattern for the kind of multi-hop entity query where the failure mode is most acute.
-
Parser-level enforcement — a
_think_calledflag in the agent'srun()loop. If the model still skips think and calls any other tool first, the observation is replaced with a corrective error message and the actual tool call is not made. The model self-corrects on the next iteration. This converts the prompt rule into a structural feedback loop — the same pattern used in LangChain's output parser for format violations.
The result: think is now a core part of the reasoning loop, not a suggestion.
Model Fallback Chain
When a provider hits a quota or rate limit, the agent automatically falls back to the next available provider:
Gemini 2.5 Flash → Groq → OpenRouter → Ollama
The chain is thread-safe: a threading.Lock guards provider rotation and a Semaphore(1) per provider serializes concurrent calls to the same endpoint. Retry backoff is max(10s, 2^(attempt+2)) — generous enough to avoid hammering free-tier rate limits.
Prompt Injection Defence
All tool observations are sanitised before entering the prompt. Patterns like ignore all previous instructions, <system>, [INST], etc. are replaced with [FILTERED]. The scraper applies a second pass on raw HTML output.
Scraper Hardening
The scraper handles several failure modes that would otherwise silently waste iterations:
- HTML tables → markdown: BeautifulSoup extracts
<table>elements and converts them to aligned markdown tables before html2text processes the page. Column values and numbers are preserved exactly — critical for emissions data, financial statements, and any structured tabular source. - JS-only pages: large HTML with <400 chars extracted text → returns a
scrape_jssuggestion - Paywall teasers: 200 OK with <600 chars + subscribe/sign-in keywords → skips and suggests alternatives
- Auth redirects: raw HTML scanned for login form signatures (
<input type="password">, sign-in prose) even when the page returns 200 with substantial content - 5xx retry: 500/502/503/504 retried twice with 2s/4s backoff before giving up
- Encoding:
apparent_encodingused when server omits charset or defaults to ISO-8859-1 — fixes mangled characters in EU/government filings - Content selectors: 30+ CSS selectors covering common CMS and news-site patterns (
.article-body,.post-content,.entry-content,data-componentattributes, etc.) before falling back to full<body> - UA rotation: request headers rotate across a pool of current browser strings per request
- HTTP 401/403/406/429: returned as actionable skip messages, not exceptions
PDF Extraction
The pdf_extract tool downloads and parses PDFs using pdfplumber:
- Tables extracted as aligned grids with exact cell values (not OCR text)
pages="12-18"parameter lets the agent target specific page ranges once it knows the document structure- Total page count shown in output header so the agent can navigate large documents
- Handles login-wall redirects (server returns HTML instead of PDF)
The typical pattern for a sustainability report task: scrape the report landing page to find the PDF URL → pdf_extract with pages="all" to see the table of contents → pdf_extract with a targeted range to get the GHG table → execute_code to compute the percentage change.
Session Memory
ConversationMemory keeps a fixed-capacity FIFO of (query, answer) pairs for the lifetime of the CLI process (up to 5 pairs). Context is injected differently depending on the mode:
- Single query
[1]: prior Q&A pairs are prepended to the task before the ReAct loop starts, so the agent can reference earlier findings directly. - Deep research
[2]: prior context is passed only to the final synthesis step, never to the sub-question decomposer. This prevents unrelated prior queries from polluting the fan-out questions.
Session memory is cleared with [7] or when the process exits.
Benchmarking
Standard NLP benchmarks (MMLU, HellaSwag, BIG-Bench, GPQA) measure a model's parametric knowledge — what it already knows from pre-training. They test the model in isolation: fixed prompt in, fixed answer out. For an agentic web research system, that tests the wrong thing entirely.
What actually breaks in a research agent is different:
- Multi-hop reasoning chains: the agent must find fact A, then use A to find fact B. A wrong answer at hop 1 compounds — the agent searches for the wrong thing at hop 2, finds plausible-sounding results, and hallucinates a confident final answer. Static benchmarks have no tool calls, so they never surface this failure mode.
- Hallucination under rate pressure: when the primary model is rate-limited and the chain falls back to a weaker model mid-run, reasoning quality drops. The weaker model may anchor on a plausible-sounding entity from a previous step rather than continuing to search. This looks like "the model knows the answer" but is actually "the model stopped researching too early".
- Context contamination: session memory from a previous failed query can leak into the next one, steering the agent toward a wrong entity it found earlier. Fixed-prompt benchmarks have no session state, so they never catch this.
- Scraping failures: a JS-rendered page returns 80 chars of content; the agent should detect this and switch to the browser tool. Whether it does that correctly is not a knowledge question — it is a tool-use and observation-interpretation question.
The benchmark in benchmarks/ addresses these directly. Each case is a real-world multi-hop research question with a known ground-truth answer:
benchmarks/
├── benchmark.json # cases: query + expected_contains + expected_not_contains + source
└── run_benchmark.py # runner: executes each case, checks answer, reports PASS/FAIL
Evaluation is keyword-based rather than LLM-as-judge: expected_contains lists anchor terms that must appear in the answer (e.g. "shaikh group"), and expected_not_contains lists known hallucinations to reject (e.g. "world economic forum"). This is the same approach used by BioASQ, TriviaQA, and Natural Questions — robust to phrasing variation while still catching wrong entities.
# Run one case
python benchmarks/run_benchmark.py --ids geneva-ai-talks-coo
# Run all cases
python benchmarks/run_benchmark.py
# Save results to a specific file
python benchmarks/run_benchmark.py --out results/2026-03-29.json
Output:
[geneva-ai-talks-coo]
status: PASS (215s)
answer: The Shaikh Group mediated the talks. Their COO is...
[vw-scope-emissions]
status: FAIL
missing: ['%']
answer: Volkswagen reduced Scope 1 and Scope 2 emissions...
Adding a new case: add an entry to benchmark.json with id, query, expected_contains, expected_not_contains, source, and notes. No code changes required. The recommended workflow is: run the query manually in the TUI, verify the answer against the primary source, then record the discriminating anchor terms.
Context Window Management
The prompt uses a sliding window: the 8 most recent steps are included in full; earlier steps are condensed to one-line summaries. Tool output is truncated at MAX_TOOL_OUTPUT_LENGTH characters before entering the prompt.
Configuration
All settings can be overridden with environment variables:
| Variable | Default | Description |
|---|---|---|
MAX_ITERATIONS |
15 |
ReAct loop iterations before forced termination |
MAX_TOOL_OUTPUT_LENGTH |
3000 |
Characters of observation fed back to LLM |
TEMPERATURE |
0.1 |
LLM temperature; lower = more deterministic |
MODEL_NAME |
gemini-2.5-flash |
Primary model identifier |
WEB_REQUEST_TIMEOUT |
30 |
Seconds before HTTP request timeout |
CODE_EXECUTION_TIMEOUT |
60 |
Seconds before subprocess kill |
LOG_LEVEL |
WARNING |
Python logging level (DEBUG, INFO, WARNING, ERROR). Set to DEBUG or INFO to see per-step reasoning and tool calls in the terminal. |
QUIET_FALLBACK |
false |
Set to true to suppress the "Rate limit reached… Switching to…" console message when the model fallback chain activates. Useful in scripted or CI contexts where provider churn is expected. |
Adding a Tool
The ToolManager uses a registration pattern. No changes to core agent logic required:
# webresearch/tools/my_tool.py
from .base import Tool
class MyTool(Tool):
@property
def name(self) -> str:
return "my_tool"
@property
def description(self) -> str:
return """Use this tool to [description].
Parameters:
- param1 (str): Description of param1"""
def execute(self, param1: str) -> str:
return result
from webresearch import ToolManager
from webresearch.tools.my_tool import MyTool
tool_manager = ToolManager()
tool_manager.register_tool(MyTool())
From Source
git clone https://github.com/ashioyajotham/web_research_agent.git
cd web_research_agent
pip install -e .
webresearch
Known Limitations
Anti-bot fingerprinting defeats the scraper on major commercial sites. UA rotation helps against trivial checks but Cloudflare, Akamai, and Datadome fingerprint TLS handshake, header order, and timing — none of which the requests-based scraper controls. Affected sites return a challenge page or silent 403. The scrape_js (Playwright) tool passes these more often but is not immune.
Observation truncation loses context. At MAX_TOOL_OUTPUT_LENGTH=3000, long scraped pages are cut off. Key facts in the truncated portion are permanently lost. A chunking + retrieval approach would address this but adds latency.
PDF table extraction degrades on scanned/image PDFs. pdfplumber works on text-layer PDFs (the majority of corporate reports). Scanned documents with no text layer return empty pages. There is no OCR fallback.
Synthesis strategy is LLM-selected, not validated. The model chooses a reasoning strategy (factual lookup, list compilation, structured extraction, open synthesis) inside its own reasoning pass. There is no external classifier validating the selection. Ambiguous tasks sometimes get the wrong strategy.
Free-tier rate limits constrain throughput. Groq allows ~6,000 tokens/min on the free tier. With max_tokens=4096 per step, the first rate limit typically hits around step 2-3 of a 15-step run. The fallback chain and 10s+ backoff floor mitigate this but do not eliminate it. The MAX_TOOL_OUTPUT_LENGTH=3000 cap on observations reduces prompt size and partially offsets the increased response budget.
References
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. arxiv.org/abs/2210.03629
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. arxiv.org/abs/2201.11903
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS 2023. arxiv.org/abs/2302.04761
Greshake, K., et al. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arxiv.org/abs/2302.12173
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file web_research_agent-2.5.0.tar.gz.
File metadata
- Download URL: web_research_agent-2.5.0.tar.gz
- Upload date:
- Size: 102.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
263272c71a689a1e7bae58cc14abb4e082a1b62a0e67ee287964da0c046473bf
|
|
| MD5 |
9253e4364ddf2457880bce0ff2601fc2
|
|
| BLAKE2b-256 |
ad8a33ce235f60c75104b81baff1004c449241897c828de99c010c541e473d53
|
File details
Details for the file web_research_agent-2.5.0-py3-none-any.whl.
File metadata
- Download URL: web_research_agent-2.5.0-py3-none-any.whl
- Upload date:
- Size: 72.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
605cb723e496589512c011809eecebce09d50aac35ab69fa00fd27c850080dfa
|
|
| MD5 |
eff7fe3efef4d20a3baccad7d2e50a49
|
|
| BLAKE2b-256 |
2abbb472997ac102bc69136a10cb8116d39f08fc6869fdb3d6e701e5a2c6bd59
|