Skip to main content

Query public data sources worldwide via scraping and APIs

Project description

OpenQuery

CI PyPI Python License: MIT

Query public data sources worldwide through a unified CLI and REST API.

OpenQuery provides a plugin-based framework for scraping government websites, public registries, and open data APIs. It handles the hard parts — browser automation, CAPTCHA solving, WAF bypass, caching, and rate limiting — so you can focus on the data.

Features

  • Unified interface — one CLI and one API endpoint for all data sources
  • Browser automation — Playwright-based scraping for JavaScript-heavy sites
  • Multi-engine CAPTCHA solving — PaddleOCR (100%), EasyOCR+Tesseract voting (90%), with cloud and paid fallbacks
  • LLM-powered knowledge CAPTCHAs — Ollama (local), HuggingFace, Anthropic, OpenAI fallback chain
  • Audit & evidence — screenshots, network logs, and PDF evidence reports for compliance
  • WAF bypass — browser-context API calls preserve session cookies
  • Caching — in-memory, Redis, or SQLite backends with configurable TTL
  • Rate limiting — per-source token-bucket to respect server limits
  • REST API — FastAPI server with auto-generated OpenAPI docs
  • Extensible — add new data sources by implementing a single class
  • Country-organized — sources grouped by country code (co, us, etc.)

Built-in Sources

Source Country Description Inputs Browser
co.simit CO Traffic fines and violations cedula, placa Yes
co.runt CO Vehicle registry (SOAT, RTM, ownership) vin, placa, cedula Yes
co.procuraduria CO Disciplinary records cedula Yes
co.policia CO Criminal background cedula Yes
co.adres CO Health system enrollment (EPS) cedula Yes
co.pico_y_placa CO Driving restrictions (Bogota/Medellin/Cali) placa No
co.peajes CO Toll road tariffs (ANI) custom No
co.combustible CO Fuel prices by city/station custom No
co.estaciones_ev CO EV charging stations custom No
co.siniestralidad CO Road crash hotspots (ANSV) custom No
co.vehiculos CO National vehicle fleet data placa, custom No
co.fasecolda CO Vehicle reference prices (insurance) custom Yes
co.recalls CO Vehicle safety recalls (SIC) custom Yes

Installation

pip install openquery

Or with uv:

uv add openquery

System Dependencies

Playwright browsers are required for web scraping:

playwright install chromium

CAPTCHA Engines (pick one or more)

OpenQuery auto-detects installed OCR engines and builds an optimal solver chain:

Engine Accuracy Speed Install
PaddleOCR (recommended) 100% ~130ms pip install "openquery[paddleocr]"
EasyOCR + Tesseract (voting) 90% ~500ms pip install "openquery[easyocr]" + brew install tesseract
Tesseract alone 80% ~390ms brew install tesseract (included by default)

For knowledge-based CAPTCHAs (Procuraduria), you need at least one LLM backend:

Backend Cost Setup
Ollama (recommended) Free ollama pull llama3.2:1b
HuggingFace Inference Free Set HF_TOKEN env var
Anthropic Paid Set ANTHROPIC_API_KEY env var
OpenAI Paid Set OPENAI_API_KEY env var

Optional Extras

pip install "openquery[paddleocr]"   # PaddleOCR — best CAPTCHA accuracy (100%)
pip install "openquery[easyocr]"     # EasyOCR — good accuracy (85%), combines with Tesseract for 90%
pip install "openquery[huggingface]" # HuggingFace Inference API (OCR + QA)
pip install "openquery[serve]"       # FastAPI server (fastapi, uvicorn)
pip install "openquery[redis]"       # Redis cache backend
pip install "openquery[captcha]"     # 2captcha paid CAPTCHA solving (last resort)

Quick Start

CLI

# List available data sources
openquery sources

# Query Colombian traffic fines by cedula
openquery query co.simit --cedula 12345678

# Query Colombian vehicle registry by plate
openquery query co.runt --placa ABC123

# Query by VIN
openquery query co.runt --vin 5YJ3E1EA1PF000001

# Disciplinary records
openquery query co.procuraduria --cedula 12345678

# Criminal background
openquery query co.policia --cedula 12345678

# Health system enrollment
openquery query co.adres --cedula 12345678

# Pico y placa — is my plate restricted today?
openquery query co.pico_y_placa --placa ABC123

# Toll tariffs
openquery query co.peajes --custom peaje --extra '{"peaje": "ALVARADO"}'

# Fuel prices in Bogota
openquery query co.combustible --custom fuel --extra '{"municipio": "BOGOTA"}'

# EV charging stations in Medellin
openquery query co.estaciones_ev --custom ev --extra '{"ciudad": "Medellin"}'

# Road crash hotspots
openquery query co.siniestralidad --custom stats --extra '{"departamento": "CUNDINAMARCA"}'

# Vehicle fleet lookup by plate
openquery query co.vehiculos --placa ABC123

# Output raw JSON
openquery query co.simit --cedula 12345678 --json

# Generate audit evidence (screenshots + PDF report)
openquery query co.runt --placa ABC123 --audit --audit-dir ./evidence

REST API

# Start the API server
openquery serve

# Or with custom host/port
openquery serve --host 127.0.0.1 --port 3000

Then query via HTTP:

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "source": "co.simit",
    "document_type": "cedula",
    "document_number": "12345678"
  }'

Response:

{
  "ok": true,
  "source": "co.simit",
  "queried_at": "2026-03-31T10:30:00Z",
  "cached": false,
  "latency_ms": 4523,
  "data": {
    "comparendos": 0,
    "multas": 0,
    "total_deuda": 0.0,
    "paz_y_salvo": true
  }
}

API Endpoints:

Method Path Description
POST /api/v1/query Query a data source
GET /api/v1/sources List available sources
GET /api/v1/health Health check and cache stats
GET /docs Interactive API documentation

Docker

docker compose up

This starts the API server with Redis caching on port 8000.

Configuration

All settings use environment variables with the OPENQUERY_ prefix:

Variable Default Description
OPENQUERY_API_KEY (none) API key for server authentication
OPENQUERY_CACHE_BACKEND memory Cache backend: memory, redis, sqlite
OPENQUERY_CACHE_TTL_DEFAULT 3600 Default cache TTL in seconds
OPENQUERY_REDIS_URL redis://localhost:6379/0 Redis connection URL
OPENQUERY_BROWSER_HEADLESS true Run browser in headless mode
OPENQUERY_BROWSER_TIMEOUT 30.0 Browser operation timeout in seconds
OPENQUERY_RATE_LIMIT_DEFAULT_RPM 10 Default requests per minute per source
OPENQUERY_LOG_LEVEL INFO Logging level
TWO_CAPTCHA_API_KEY (none) 2captcha.com API key (paid fallback)
HF_TOKEN (none) HuggingFace token (free OCR + QA)
ANTHROPIC_API_KEY (none) Anthropic API key (paid QA fallback)
OPENAI_API_KEY (none) OpenAI API key (paid QA fallback)

Adding a New Source

Create a new source by implementing the BaseSource class:

# src/openquery/sources/us/nhtsa.py
from pydantic import BaseModel
from openquery.sources import register
from openquery.sources.base import BaseSource, DocumentType, QueryInput, SourceMeta


class NhtsaResult(BaseModel):
    manufacturer: str = ""
    model: str = ""
    year: int = 0
    recalls: list[dict] = []


@register
class NhtsaSource(BaseSource):
    def meta(self) -> SourceMeta:
        return SourceMeta(
            name="us.nhtsa",
            display_name="NHTSA Vehicle Safety",
            description="US vehicle safety recalls and VIN decoding",
            country="US",
            url="https://vpic.nhtsa.dot.gov/api/",
            supported_inputs=[DocumentType.VIN],
            requires_captcha=False,
            requires_browser=False,
            rate_limit_rpm=30,
        )

    def query(self, input: QueryInput) -> NhtsaResult:
        import httpx
        resp = httpx.get(
            f"https://vpic.nhtsa.dot.gov/api/vehicles/decodevin/{input.document_number}",
            params={"format": "json"},
        )
        data = resp.json()
        # Parse and return NhtsaResult...

The @register decorator automatically makes the source available in the CLI, API, and source listing.

Architecture

openquery/
├── core/
│   ├── browser.py    # Playwright browser management
│   ├── captcha.py    # Multi-engine CAPTCHA solvers (PaddleOCR, EasyOCR, Tesseract, voting)
│   ├── llm.py        # LLM QA chain (Ollama, HuggingFace, Anthropic, OpenAI)
│   ├── audit.py      # Evidence capture (screenshots, network logs, PDF reports)
│   ├── cache.py      # Caching backends (memory, Redis, SQLite)
│   └── rate_limit.py # Token-bucket rate limiting
├── sources/          # Data source plugins, organized by country
│   ├── base.py       # BaseSource ABC — implement this to add sources
│   ├── co/           # Colombia (SIMIT, RUNT, Procuraduria, Policia, ADRES)
│   └── us/           # United States (future)
├── models/           # Pydantic response models, organized by country
├── server/           # FastAPI REST API
└── commands/         # Typer CLI commands

Development

git clone https://github.com/dacrypt/openquery.git
cd openquery
uv sync --all-extras
playwright install chromium

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

See CONTRIBUTING.md for detailed guidelines.

Documentation

Guide Description
Getting Started Installation, first query, engine setup
Sources Guide All 5 Colombian sources with field reference
CAPTCHA Guide OCR engines, voting, LLM backends, benchmarks
Audit Guide Evidence capture, PDF reports, compliance
API Guide REST endpoints, authentication, deployment
Adding Sources Step-by-step guide to create new source plugins
Changelog Version history

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openquery-0.3.1.tar.gz (382.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openquery-0.3.1-py3-none-any.whl (76.5 kB view details)

Uploaded Python 3

File details

Details for the file openquery-0.3.1.tar.gz.

File metadata

  • Download URL: openquery-0.3.1.tar.gz
  • Upload date:
  • Size: 382.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openquery-0.3.1.tar.gz
Algorithm Hash digest
SHA256 725c104a8cf385268d70445035656240dab31512e2354985dd66467d84f51165
MD5 e93bc80e6db8908544149c43e7d85a2d
BLAKE2b-256 f1079f4fb098738437910a13313f72356933491d138b902d5d45fd08f59d65be

See more details on using hashes here.

File details

Details for the file openquery-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: openquery-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 76.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openquery-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e9a5161f210555367b41f2908504745f7e79791bd810dc6066f0d8becd0cdad0
MD5 c0989319609c9d1a94fbd80a3368aba5
BLAKE2b-256 f75887ae8c41ed7c68c541ceef4b3048f479ca2d204cabbd44606f2044136b4f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page