Skip to main content

Query public data sources worldwide via scraping and APIs

Project description

OpenQuery

CI PyPI Python License: MIT

Query public data sources worldwide through a unified CLI and REST API.

OpenQuery provides a plugin-based framework for scraping government websites, public registries, and open data APIs. It handles the hard parts — browser automation, CAPTCHA solving, WAF bypass, caching, and rate limiting — so you can focus on the data.

Features

  • Unified interface — one CLI and one API endpoint for all data sources
  • Browser automation — Playwright-based scraping for JavaScript-heavy sites
  • Multi-engine CAPTCHA solving — PaddleOCR (100%), EasyOCR+Tesseract voting (90%), with cloud and paid fallbacks
  • LLM-powered knowledge CAPTCHAs — Ollama (local), HuggingFace, Anthropic, OpenAI fallback chain
  • Audit & evidence — screenshots, network logs, and PDF evidence reports for compliance
  • WAF bypass — browser-context API calls preserve session cookies
  • Caching — in-memory, Redis, or SQLite backends with configurable TTL
  • Rate limiting — per-source token-bucket to respect server limits
  • REST API — FastAPI server with auto-generated OpenAPI docs
  • Extensible — add new data sources by implementing a single class
  • Country-organized — sources grouped by country code (co, us, etc.)

Built-in Sources

Source Country Description Inputs CAPTCHA
co.simit CO Traffic fines and violations cedula, placa No
co.runt CO National vehicle registry (SOAT, RTM, ownership) vin, placa, cedula Yes (image OCR)
co.procuraduria CO Disciplinary records (antecedentes) cedula Yes (knowledge QA)
co.policia CO Criminal background (antecedentes penales) cedula No
co.adres CO Health system enrollment (EPS/regime) cedula No

Installation

pip install openquery

Or with uv:

uv add openquery

System Dependencies

Playwright browsers are required for web scraping:

playwright install chromium

CAPTCHA Engines (pick one or more)

OpenQuery auto-detects installed OCR engines and builds an optimal solver chain:

Engine Accuracy Speed Install
PaddleOCR (recommended) 100% ~130ms pip install "openquery[paddleocr]"
EasyOCR + Tesseract (voting) 90% ~500ms pip install "openquery[easyocr]" + brew install tesseract
Tesseract alone 80% ~390ms brew install tesseract (included by default)

For knowledge-based CAPTCHAs (Procuraduria), you need at least one LLM backend:

Backend Cost Setup
Ollama (recommended) Free ollama pull llama3.2:1b
HuggingFace Inference Free Set HF_TOKEN env var
Anthropic Paid Set ANTHROPIC_API_KEY env var
OpenAI Paid Set OPENAI_API_KEY env var

Optional Extras

pip install "openquery[paddleocr]"   # PaddleOCR — best CAPTCHA accuracy (100%)
pip install "openquery[easyocr]"     # EasyOCR — good accuracy (85%), combines with Tesseract for 90%
pip install "openquery[huggingface]" # HuggingFace Inference API (OCR + QA)
pip install "openquery[serve]"       # FastAPI server (fastapi, uvicorn)
pip install "openquery[redis]"       # Redis cache backend
pip install "openquery[captcha]"     # 2captcha paid CAPTCHA solving (last resort)

Quick Start

CLI

# List available data sources
openquery sources

# Query Colombian traffic fines by cedula
openquery query co.simit --cedula 12345678

# Query Colombian vehicle registry by plate
openquery query co.runt --placa ABC123

# Query by VIN
openquery query co.runt --vin 5YJ3E1EA1PF000001

# Disciplinary records
openquery query co.procuraduria --cedula 12345678

# Criminal background
openquery query co.policia --cedula 12345678

# Health system enrollment
openquery query co.adres --cedula 12345678

# Output raw JSON
openquery query co.simit --cedula 12345678 --json

# Generate audit evidence (screenshots + PDF report)
openquery query co.runt --placa ABC123 --audit --audit-dir ./evidence

REST API

# Start the API server
openquery serve

# Or with custom host/port
openquery serve --host 127.0.0.1 --port 3000

Then query via HTTP:

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "source": "co.simit",
    "document_type": "cedula",
    "document_number": "12345678"
  }'

Response:

{
  "ok": true,
  "source": "co.simit",
  "queried_at": "2026-03-31T10:30:00Z",
  "cached": false,
  "latency_ms": 4523,
  "data": {
    "comparendos": 0,
    "multas": 0,
    "total_deuda": 0.0,
    "paz_y_salvo": true
  }
}

API Endpoints:

Method Path Description
POST /api/v1/query Query a data source
GET /api/v1/sources List available sources
GET /api/v1/health Health check and cache stats
GET /docs Interactive API documentation

Docker

docker compose up

This starts the API server with Redis caching on port 8000.

Configuration

All settings use environment variables with the OPENQUERY_ prefix:

Variable Default Description
OPENQUERY_API_KEY (none) API key for server authentication
OPENQUERY_CACHE_BACKEND memory Cache backend: memory, redis, sqlite
OPENQUERY_CACHE_TTL_DEFAULT 3600 Default cache TTL in seconds
OPENQUERY_REDIS_URL redis://localhost:6379/0 Redis connection URL
OPENQUERY_BROWSER_HEADLESS true Run browser in headless mode
OPENQUERY_BROWSER_TIMEOUT 30.0 Browser operation timeout in seconds
OPENQUERY_RATE_LIMIT_DEFAULT_RPM 10 Default requests per minute per source
OPENQUERY_LOG_LEVEL INFO Logging level
TWO_CAPTCHA_API_KEY (none) 2captcha.com API key (paid fallback)
HF_TOKEN (none) HuggingFace token (free OCR + QA)
ANTHROPIC_API_KEY (none) Anthropic API key (paid QA fallback)
OPENAI_API_KEY (none) OpenAI API key (paid QA fallback)

Adding a New Source

Create a new source by implementing the BaseSource class:

# src/openquery/sources/us/nhtsa.py
from pydantic import BaseModel
from openquery.sources import register
from openquery.sources.base import BaseSource, DocumentType, QueryInput, SourceMeta


class NhtsaResult(BaseModel):
    manufacturer: str = ""
    model: str = ""
    year: int = 0
    recalls: list[dict] = []


@register
class NhtsaSource(BaseSource):
    def meta(self) -> SourceMeta:
        return SourceMeta(
            name="us.nhtsa",
            display_name="NHTSA Vehicle Safety",
            description="US vehicle safety recalls and VIN decoding",
            country="US",
            url="https://vpic.nhtsa.dot.gov/api/",
            supported_inputs=[DocumentType.VIN],
            requires_captcha=False,
            requires_browser=False,
            rate_limit_rpm=30,
        )

    def query(self, input: QueryInput) -> NhtsaResult:
        import httpx
        resp = httpx.get(
            f"https://vpic.nhtsa.dot.gov/api/vehicles/decodevin/{input.document_number}",
            params={"format": "json"},
        )
        data = resp.json()
        # Parse and return NhtsaResult...

The @register decorator automatically makes the source available in the CLI, API, and source listing.

Architecture

openquery/
├── core/
│   ├── browser.py    # Playwright browser management
│   ├── captcha.py    # Multi-engine CAPTCHA solvers (PaddleOCR, EasyOCR, Tesseract, voting)
│   ├── llm.py        # LLM QA chain (Ollama, HuggingFace, Anthropic, OpenAI)
│   ├── audit.py      # Evidence capture (screenshots, network logs, PDF reports)
│   ├── cache.py      # Caching backends (memory, Redis, SQLite)
│   └── rate_limit.py # Token-bucket rate limiting
├── sources/          # Data source plugins, organized by country
│   ├── base.py       # BaseSource ABC — implement this to add sources
│   ├── co/           # Colombia (SIMIT, RUNT, Procuraduria, Policia, ADRES)
│   └── us/           # United States (future)
├── models/           # Pydantic response models, organized by country
├── server/           # FastAPI REST API
└── commands/         # Typer CLI commands

Development

git clone https://github.com/dacrypt/openquery.git
cd openquery
uv sync --all-extras
playwright install chromium

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

See CONTRIBUTING.md for detailed guidelines.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openquery-0.2.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openquery-0.2.0-py3-none-any.whl (55.5 kB view details)

Uploaded Python 3

File details

Details for the file openquery-0.2.0.tar.gz.

File metadata

  • Download URL: openquery-0.2.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openquery-0.2.0.tar.gz
Algorithm Hash digest
SHA256 28471d7427b18e34a0e957814fdde34d0d4d92e0e032a42746022ba018476fa7
MD5 dc72ebde3febfc6fe90ba79020b7b703
BLAKE2b-256 05de2a6014f89ec9a0ad6f0f5c32cf6d0de73382fc805c5578687edc26966bed

See more details on using hashes here.

File details

Details for the file openquery-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: openquery-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 55.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for openquery-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 eee6fcb4f33357a86fa66fb055165ed7fe23931ee99b523cdf8d23a6af9818c1
MD5 91f1bbfca3feb112e16b35112f865ab0
BLAKE2b-256 fe8dfb0b4e02cb0be3a3e35e65bcec63a056ac7726a9bfd3e044247486c7d853

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page