Skip to main content

Query public data sources worldwide via scraping and APIs

Project description

OpenQuery

CI PyPI Python License: MIT

Query public data sources worldwide through a unified CLI and REST API.

OpenQuery provides a plugin-based framework for scraping government websites, public registries, and open data APIs. It handles the hard parts — browser automation, CAPTCHA solving, WAF bypass, caching, and rate limiting — so you can focus on the data.

Features

  • Unified interface — one CLI and one API endpoint for all data sources
  • Browser automation — Playwright-based scraping for JavaScript-heavy sites
  • Multi-engine CAPTCHA solving — PaddleOCR (100%), EasyOCR+Tesseract voting (90%), with cloud and paid fallbacks
  • LLM-powered knowledge CAPTCHAs — Ollama (local), HuggingFace, Anthropic, OpenAI fallback chain
  • Audit & evidence — screenshots, network logs, and PDF evidence reports for compliance
  • WAF bypass — browser-context API calls preserve session cookies
  • Caching — in-memory, Redis, or SQLite backends with configurable TTL
  • Rate limiting — per-source token-bucket to respect server limits
  • REST API — FastAPI server with auto-generated OpenAPI docs
  • Extensible — add new data sources by implementing a single class
  • Country-organized — sources grouped by country code (co, us, etc.)

Built-in Sources

Source Country Description Inputs Browser
co.simit CO Traffic fines and violations cedula, placa Yes
co.runt CO Vehicle registry (SOAT, RTM, ownership) vin, placa, cedula Yes
co.procuraduria CO Disciplinary records cedula Yes
co.policia CO Criminal background cedula Yes
co.adres CO Health system enrollment (EPS) cedula Yes
co.pico_y_placa CO Driving restrictions (Bogota/Medellin/Cali) placa No
co.peajes CO Toll road tariffs (ANI) custom No
co.combustible CO Fuel prices by city/station custom No
co.estaciones_ev CO EV charging stations custom No
co.siniestralidad CO Road crash hotspots (ANSV) custom No
co.vehiculos CO National vehicle fleet data placa, custom No
co.fasecolda CO Vehicle reference prices (insurance) custom Yes
co.recalls CO Vehicle safety recalls (SIC) custom Yes

Installation

pip install openquery

Or with uv:

uv add openquery

System Dependencies

Playwright browsers are required for web scraping:

playwright install chromium

CAPTCHA Engines (pick one or more)

OpenQuery auto-detects installed OCR engines and builds an optimal solver chain:

Engine Accuracy Speed Install
PaddleOCR (recommended) 100% ~130ms pip install "openquery[paddleocr]"
EasyOCR + Tesseract (voting) 90% ~500ms pip install "openquery[easyocr]" + brew install tesseract
Tesseract alone 80% ~390ms brew install tesseract (included by default)

For knowledge-based CAPTCHAs (Procuraduria), you need at least one LLM backend:

Backend Cost Setup
Ollama (recommended) Free ollama pull llama3.2:1b
HuggingFace Inference Free Set HF_TOKEN env var
Anthropic Paid Set ANTHROPIC_API_KEY env var
OpenAI Paid Set OPENAI_API_KEY env var

Optional Extras

pip install "openquery[paddleocr]"   # PaddleOCR — best CAPTCHA accuracy (100%)
pip install "openquery[easyocr]"     # EasyOCR — good accuracy (85%), combines with Tesseract for 90%
pip install "openquery[huggingface]" # HuggingFace Inference API (OCR + QA)
pip install "openquery[serve]"       # FastAPI server (fastapi, uvicorn)
pip install "openquery[redis]"       # Redis cache backend
pip install "openquery[captcha]"     # 2captcha paid CAPTCHA solving (last resort)

Quick Start

CLI

# List available data sources
openquery sources

# Query Colombian traffic fines by cedula
openquery query co.simit --cedula 12345678

# Query Colombian vehicle registry by plate
openquery query co.runt --placa ABC123

# Query by VIN
openquery query co.runt --vin 5YJ3E1EA1PF000001

# Disciplinary records
openquery query co.procuraduria --cedula 12345678

# Criminal background
openquery query co.policia --cedula 12345678

# Health system enrollment
openquery query co.adres --cedula 12345678

# Pico y placa — is my plate restricted today?
openquery query co.pico_y_placa --placa ABC123

# Toll tariffs
openquery query co.peajes --custom peaje --extra '{"peaje": "ALVARADO"}'

# Fuel prices in Bogota
openquery query co.combustible --custom fuel --extra '{"municipio": "BOGOTA"}'

# EV charging stations in Medellin
openquery query co.estaciones_ev --custom ev --extra '{"ciudad": "Medellin"}'

# Road crash hotspots
openquery query co.siniestralidad --custom stats --extra '{"departamento": "CUNDINAMARCA"}'

# Vehicle fleet lookup by plate
openquery query co.vehiculos --placa ABC123

# Output raw JSON
openquery query co.simit --cedula 12345678 --json

# Generate audit evidence (screenshots + PDF report)
openquery query co.runt --placa ABC123 --audit --audit-dir ./evidence

REST API

# Start the API server
openquery serve

# Or with custom host/port
openquery serve --host 127.0.0.1 --port 3000

Then query via HTTP:

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "source": "co.simit",
    "document_type": "cedula",
    "document_number": "12345678"
  }'

Response:

{
  "ok": true,
  "source": "co.simit",
  "queried_at": "2026-03-31T10:30:00Z",
  "cached": false,
  "latency_ms": 4523,
  "data": {
    "comparendos": 0,
    "multas": 0,
    "total_deuda": 0.0,
    "paz_y_salvo": true
  }
}

API Endpoints:

Method Path Description
POST /api/v1/query Query a data source
GET /api/v1/sources List available sources
GET /api/v1/health Health check and cache stats
GET /docs Interactive API documentation

Docker

docker compose up

This starts the API server with Redis caching on port 8000.

Configuration

All settings use environment variables with the OPENQUERY_ prefix:

Variable Default Description
OPENQUERY_API_KEY (none) API key for server authentication
OPENQUERY_CACHE_BACKEND memory Cache backend: memory, redis, sqlite
OPENQUERY_CACHE_TTL_DEFAULT 3600 Default cache TTL in seconds
OPENQUERY_REDIS_URL redis://localhost:6379/0 Redis connection URL
OPENQUERY_BROWSER_HEADLESS true Run browser in headless mode
OPENQUERY_BROWSER_TIMEOUT 30.0 Browser operation timeout in seconds
OPENQUERY_RATE_LIMIT_DEFAULT_RPM 10 Default requests per minute per source
OPENQUERY_LOG_LEVEL INFO Logging level
TWO_CAPTCHA_API_KEY (none) 2captcha.com API key (paid fallback)
HF_TOKEN (none) HuggingFace token (free OCR + QA)
ANTHROPIC_API_KEY (none) Anthropic API key (paid QA fallback)
OPENAI_API_KEY (none) OpenAI API key (paid QA fallback)

Adding a New Source

Create a new source by implementing the BaseSource class:

# src/openquery/sources/us/nhtsa.py
from pydantic import BaseModel
from openquery.sources import register
from openquery.sources.base import BaseSource, DocumentType, QueryInput, SourceMeta


class NhtsaResult(BaseModel):
    manufacturer: str = ""
    model: str = ""
    year: int = 0
    recalls: list[dict] = []


@register
class NhtsaSource(BaseSource):
    def meta(self) -> SourceMeta:
        return SourceMeta(
            name="us.nhtsa",
            display_name="NHTSA Vehicle Safety",
            description="US vehicle safety recalls and VIN decoding",
            country="US",
            url="https://vpic.nhtsa.dot.gov/api/",
            supported_inputs=[DocumentType.VIN],
            requires_captcha=False,
            requires_browser=False,
            rate_limit_rpm=30,
        )

    def query(self, input: QueryInput) -> NhtsaResult:
        import httpx
        resp = httpx.get(
            f"https://vpic.nhtsa.dot.gov/api/vehicles/decodevin/{input.document_number}",
            params={"format": "json"},
        )
        data = resp.json()
        # Parse and return NhtsaResult...

The @register decorator automatically makes the source available in the CLI, API, and source listing.

Architecture

openquery/
├── core/
│   ├── browser.py    # Playwright browser management
│   ├── captcha.py    # Multi-engine CAPTCHA solvers (PaddleOCR, EasyOCR, Tesseract, voting)
│   ├── llm.py        # LLM QA chain (Ollama, HuggingFace, Anthropic, OpenAI)
│   ├── audit.py      # Evidence capture (screenshots, network logs, PDF reports)
│   ├── cache.py      # Caching backends (memory, Redis, SQLite)
│   └── rate_limit.py # Token-bucket rate limiting
├── sources/          # Data source plugins, organized by country
│   ├── base.py       # BaseSource ABC — implement this to add sources
│   ├── co/           # Colombia (SIMIT, RUNT, Procuraduria, Policia, ADRES)
│   └── us/           # United States (future)
├── models/           # Pydantic response models, organized by country
├── server/           # FastAPI REST API
└── commands/         # Typer CLI commands

Development

git clone https://github.com/dacrypt/openquery.git
cd openquery
uv sync --all-extras
playwright install chromium

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

See CONTRIBUTING.md for detailed guidelines.

Documentation

Guide Description
Getting Started Installation, first query, engine setup
Sources Guide All 5 Colombian sources with field reference
CAPTCHA Guide OCR engines, voting, LLM backends, benchmarks
Audit Guide Evidence capture, PDF reports, compliance
API Guide REST endpoints, authentication, deployment
Adding Sources Step-by-step guide to create new source plugins
Changelog Version history

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openquery-0.3.2.tar.gz (408.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openquery-0.3.2-py3-none-any.whl (76.5 kB view details)

Uploaded Python 3

File details

Details for the file openquery-0.3.2.tar.gz.

File metadata

  • Download URL: openquery-0.3.2.tar.gz
  • Upload date:
  • Size: 408.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openquery-0.3.2.tar.gz
Algorithm Hash digest
SHA256 2fa103e7bd935c839bb4b5c922824fd3f130922db2e1cea1b185d146583817df
MD5 502baf9b7e7c16386144c6afaa1751d9
BLAKE2b-256 2bb57720bac821bf5ad5ca403dfb3c23dc19ea11312ee1b365a8eea3bd3fa26d

See more details on using hashes here.

File details

Details for the file openquery-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: openquery-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 76.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openquery-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 98e4fe4594c1dd4f219de598a22ea7913df19adfbf6b3b6628e666a46de59647
MD5 2312c749ce655174c04c0e913a34fc2a
BLAKE2b-256 267d111811cac462dc778531c35add59f23bd91a245a6b742a6de42f2ba4a85b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page