Query public data sources worldwide via scraping and APIs
Project description
OpenQuery
Query public data sources worldwide through a unified CLI and REST API.
OpenQuery provides a plugin-based framework for scraping government websites, public registries, and open data APIs. It handles the hard parts — browser automation, CAPTCHA solving, WAF bypass, caching, and rate limiting — so you can focus on the data.
Features
- Unified interface — one CLI and one API endpoint for all data sources
- Browser automation — Playwright-based scraping for JavaScript-heavy sites
- Multi-engine CAPTCHA solving — PaddleOCR (100%), EasyOCR+Tesseract voting (90%), with cloud and paid fallbacks
- LLM-powered knowledge CAPTCHAs — Ollama (local), HuggingFace, Anthropic, OpenAI fallback chain
- Audit & evidence — screenshots, network logs, and PDF evidence reports for compliance
- WAF bypass — browser-context API calls preserve session cookies
- Caching — in-memory, Redis, or SQLite backends with configurable TTL
- Rate limiting — per-source token-bucket to respect server limits
- REST API — FastAPI server with auto-generated OpenAPI docs
- Extensible — add new data sources by implementing a single class
- Country-organized — sources grouped by country code (
co,us, etc.)
Built-in Sources
| Source | Country | Description | Inputs | Browser |
|---|---|---|---|---|
co.simit |
CO | Traffic fines and violations | cedula, placa | Yes |
co.runt |
CO | Vehicle registry (SOAT, RTM, ownership) | vin, placa, cedula | Yes |
co.procuraduria |
CO | Disciplinary records | cedula | Yes |
co.policia |
CO | Criminal background | cedula | Yes |
co.adres |
CO | Health system enrollment (EPS) | cedula | Yes |
co.pico_y_placa |
CO | Driving restrictions (Bogota/Medellin/Cali) | placa | No |
co.peajes |
CO | Toll road tariffs (ANI) | custom | No |
co.combustible |
CO | Fuel prices by city/station | custom | No |
co.estaciones_ev |
CO | EV charging stations | custom | No |
co.siniestralidad |
CO | Road crash hotspots (ANSV) | custom | No |
co.vehiculos |
CO | National vehicle fleet data | placa, custom | No |
co.fasecolda |
CO | Vehicle reference prices (insurance) | custom | Yes |
co.recalls |
CO | Vehicle safety recalls (SIC) | custom | Yes |
Installation
pip install openquery
Or with uv:
uv add openquery
System Dependencies
Playwright browsers are required for web scraping:
playwright install chromium
CAPTCHA Engines (pick one or more)
OpenQuery auto-detects installed OCR engines and builds an optimal solver chain:
| Engine | Accuracy | Speed | Install |
|---|---|---|---|
| PaddleOCR (recommended) | 100% | ~130ms | pip install "openquery[paddleocr]" |
| EasyOCR + Tesseract (voting) | 90% | ~500ms | pip install "openquery[easyocr]" + brew install tesseract |
| Tesseract alone | 80% | ~390ms | brew install tesseract (included by default) |
For knowledge-based CAPTCHAs (Procuraduria), you need at least one LLM backend:
| Backend | Cost | Setup |
|---|---|---|
| Ollama (recommended) | Free | ollama pull llama3.2:1b |
| HuggingFace Inference | Free | Set HF_TOKEN env var |
| Anthropic | Paid | Set ANTHROPIC_API_KEY env var |
| OpenAI | Paid | Set OPENAI_API_KEY env var |
Optional Extras
pip install "openquery[paddleocr]" # PaddleOCR — best CAPTCHA accuracy (100%)
pip install "openquery[easyocr]" # EasyOCR — good accuracy (85%), combines with Tesseract for 90%
pip install "openquery[huggingface]" # HuggingFace Inference API (OCR + QA)
pip install "openquery[serve]" # FastAPI server (fastapi, uvicorn)
pip install "openquery[redis]" # Redis cache backend
pip install "openquery[captcha]" # 2captcha paid CAPTCHA solving (last resort)
Quick Start
CLI
# List available data sources
openquery sources
# Query Colombian traffic fines by cedula
openquery query co.simit --cedula 12345678
# Query Colombian vehicle registry by plate
openquery query co.runt --placa ABC123
# Query by VIN
openquery query co.runt --vin 5YJ3E1EA1PF000001
# Disciplinary records
openquery query co.procuraduria --cedula 12345678
# Criminal background
openquery query co.policia --cedula 12345678
# Health system enrollment
openquery query co.adres --cedula 12345678
# Pico y placa — is my plate restricted today?
openquery query co.pico_y_placa --placa ABC123
# Toll tariffs
openquery query co.peajes --custom peaje --extra '{"peaje": "ALVARADO"}'
# Fuel prices in Bogota
openquery query co.combustible --custom fuel --extra '{"municipio": "BOGOTA"}'
# EV charging stations in Medellin
openquery query co.estaciones_ev --custom ev --extra '{"ciudad": "Medellin"}'
# Road crash hotspots
openquery query co.siniestralidad --custom stats --extra '{"departamento": "CUNDINAMARCA"}'
# Vehicle fleet lookup by plate
openquery query co.vehiculos --placa ABC123
# Output raw JSON
openquery query co.simit --cedula 12345678 --json
# Generate audit evidence (screenshots + PDF report)
openquery query co.runt --placa ABC123 --audit --audit-dir ./evidence
REST API
# Start the API server
openquery serve
# Or with custom host/port
openquery serve --host 127.0.0.1 --port 3000
Then query via HTTP:
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{
"source": "co.simit",
"document_type": "cedula",
"document_number": "12345678"
}'
Response:
{
"ok": true,
"source": "co.simit",
"queried_at": "2026-03-31T10:30:00Z",
"cached": false,
"latency_ms": 4523,
"data": {
"comparendos": 0,
"multas": 0,
"total_deuda": 0.0,
"paz_y_salvo": true
}
}
API Endpoints:
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/query |
Query a data source |
GET |
/api/v1/sources |
List available sources |
GET |
/api/v1/health |
Health check and cache stats |
GET |
/docs |
Interactive API documentation |
Docker
docker compose up
This starts the API server with Redis caching on port 8000.
Configuration
All settings use environment variables with the OPENQUERY_ prefix:
| Variable | Default | Description |
|---|---|---|
OPENQUERY_API_KEY |
(none) | API key for server authentication |
OPENQUERY_CACHE_BACKEND |
memory |
Cache backend: memory, redis, sqlite |
OPENQUERY_CACHE_TTL_DEFAULT |
3600 |
Default cache TTL in seconds |
OPENQUERY_REDIS_URL |
redis://localhost:6379/0 |
Redis connection URL |
OPENQUERY_BROWSER_HEADLESS |
true |
Run browser in headless mode |
OPENQUERY_BROWSER_TIMEOUT |
30.0 |
Browser operation timeout in seconds |
OPENQUERY_RATE_LIMIT_DEFAULT_RPM |
10 |
Default requests per minute per source |
OPENQUERY_LOG_LEVEL |
INFO |
Logging level |
TWO_CAPTCHA_API_KEY |
(none) | 2captcha.com API key (paid fallback) |
HF_TOKEN |
(none) | HuggingFace token (free OCR + QA) |
ANTHROPIC_API_KEY |
(none) | Anthropic API key (paid QA fallback) |
OPENAI_API_KEY |
(none) | OpenAI API key (paid QA fallback) |
Adding a New Source
Create a new source by implementing the BaseSource class:
# src/openquery/sources/us/nhtsa.py
from pydantic import BaseModel
from openquery.sources import register
from openquery.sources.base import BaseSource, DocumentType, QueryInput, SourceMeta
class NhtsaResult(BaseModel):
manufacturer: str = ""
model: str = ""
year: int = 0
recalls: list[dict] = []
@register
class NhtsaSource(BaseSource):
def meta(self) -> SourceMeta:
return SourceMeta(
name="us.nhtsa",
display_name="NHTSA Vehicle Safety",
description="US vehicle safety recalls and VIN decoding",
country="US",
url="https://vpic.nhtsa.dot.gov/api/",
supported_inputs=[DocumentType.VIN],
requires_captcha=False,
requires_browser=False,
rate_limit_rpm=30,
)
def query(self, input: QueryInput) -> NhtsaResult:
import httpx
resp = httpx.get(
f"https://vpic.nhtsa.dot.gov/api/vehicles/decodevin/{input.document_number}",
params={"format": "json"},
)
data = resp.json()
# Parse and return NhtsaResult...
The @register decorator automatically makes the source available in the CLI, API, and source listing.
Architecture
openquery/
├── core/
│ ├── browser.py # Playwright browser management
│ ├── captcha.py # Multi-engine CAPTCHA solvers (PaddleOCR, EasyOCR, Tesseract, voting)
│ ├── llm.py # LLM QA chain (Ollama, HuggingFace, Anthropic, OpenAI)
│ ├── audit.py # Evidence capture (screenshots, network logs, PDF reports)
│ ├── cache.py # Caching backends (memory, Redis, SQLite)
│ └── rate_limit.py # Token-bucket rate limiting
├── sources/ # Data source plugins, organized by country
│ ├── base.py # BaseSource ABC — implement this to add sources
│ ├── co/ # Colombia (SIMIT, RUNT, Procuraduria, Policia, ADRES)
│ └── us/ # United States (future)
├── models/ # Pydantic response models, organized by country
├── server/ # FastAPI REST API
└── commands/ # Typer CLI commands
Development
git clone https://github.com/dacrypt/openquery.git
cd openquery
uv sync --all-extras
playwright install chromium
# Run tests
uv run pytest
# Lint
uv run ruff check src/ tests/
See CONTRIBUTING.md for detailed guidelines.
Documentation
| Guide | Description |
|---|---|
| Getting Started | Installation, first query, engine setup |
| Sources Guide | All 5 Colombian sources with field reference |
| CAPTCHA Guide | OCR engines, voting, LLM backends, benchmarks |
| Audit Guide | Evidence capture, PDF reports, compliance |
| API Guide | REST endpoints, authentication, deployment |
| Adding Sources | Step-by-step guide to create new source plugins |
| Changelog | Version history |
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openquery-0.3.1.tar.gz.
File metadata
- Download URL: openquery-0.3.1.tar.gz
- Upload date:
- Size: 382.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
725c104a8cf385268d70445035656240dab31512e2354985dd66467d84f51165
|
|
| MD5 |
e93bc80e6db8908544149c43e7d85a2d
|
|
| BLAKE2b-256 |
f1079f4fb098738437910a13313f72356933491d138b902d5d45fd08f59d65be
|
File details
Details for the file openquery-0.3.1-py3-none-any.whl.
File metadata
- Download URL: openquery-0.3.1-py3-none-any.whl
- Upload date:
- Size: 76.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9a5161f210555367b41f2908504745f7e79791bd810dc6066f0d8becd0cdad0
|
|
| MD5 |
c0989319609c9d1a94fbd80a3368aba5
|
|
| BLAKE2b-256 |
f75887ae8c41ed7c68c541ceef4b3048f479ca2d204cabbd44606f2044136b4f
|