Skip to main content

Query public data sources worldwide via scraping and APIs

Project description

OpenQuery

CI PyPI Python License: MIT

Query public data sources worldwide through a unified CLI and REST API.

OpenQuery provides a plugin-based framework for scraping government websites, public registries, and open data APIs. It handles the hard parts — browser automation, CAPTCHA solving, WAF bypass, caching, and rate limiting — so you can focus on the data.

Current rollout note: wave 1 is America-first. The machine-readable America inventory lives in docs/americas-source-inventory.json, and docs/sources.md derives public inventory counts from that snapshot. Callable INTL runtime connectors remain outside the America completeness contract for this wave, and the repo intentionally defers a new typecheck gate until a later wave.

Features

  • Unified interface — one CLI and one API endpoint for all data sources
  • Browser automation — Playwright-based scraping for JavaScript-heavy sites
  • Multi-engine CAPTCHA solving — PaddleOCR (100%), EasyOCR+Tesseract voting (90%), with cloud and paid fallbacks
  • LLM-powered knowledge CAPTCHAs — Ollama (local), HuggingFace, Anthropic, OpenAI fallback chain
  • Audit & evidence — screenshots, network logs, and PDF evidence reports for compliance
  • WAF bypass — browser-context API calls preserve session cookies
  • Caching — in-memory, Redis, or SQLite backends with configurable TTL
  • Rate limiting — per-source token-bucket to respect server limits
  • REST API — FastAPI server with auto-generated OpenAPI docs
  • Document OCR — extract structured data from ID documents (cedula, INE, DNI, carnet, passport)
  • Face verification — 1:1 face comparison with liveness detection (DeepFace/ArcFace)
  • Health monitoring — per-source circuit breaker with automatic failover
  • Dashboard — web UI for source browsing, querying, and health monitoring
  • Extensible — add new data sources by implementing a single class
  • Country-organized — sources grouped by country code (co, us, etc.)

Built-in Sources — 300 sources across 21 country and region namespaces

For current America rollout status, use docs/sources.md and docs/americas-source-inventory.json as the source of truth. Public inventory counts there are derived from the America snapshot; docs/test_results.md is a live accountability report and can include callable runtime connectors that are outside the America inventory contract (for example, INTL).

Colombia

Antecedentes y Justicia

Source Description Inputs Browser
co.policia Criminal background (Policía Nacional) cedula Yes
co.procuraduria Disciplinary records (Procuraduría) cedula Yes
co.contraloria Fiscal responsibility (Contraloría) cedula, nit, pasaporte Yes
co.rnmc Police corrective measures (RNMC) cedula, pasaporte Yes
co.consulta_procesos Judicial processes (Rama Judicial) cedula, nit Yes
co.tutelas Constitutional protection actions (Tutelas) cedula, nit Yes
co.jep Transitional justice (JEP) cedula Yes
co.inpec Prison population (INPEC) Deprecated cedula Yes

Identidad y Registro Civil

Source Description Inputs Browser
co.estado_cedula Cédula status (Registraduría) Deprecated cedula Yes
co.estado_tramite_cedula ID card processing status Deprecated cedula Yes
co.defuncion Cédula vigency — alive/deceased Deprecated cedula Yes
co.puesto_votacion Voting station lookup Deprecated cedula Yes
co.registro_civil Civil registry certificate Deprecated cedula Yes
co.nombre_completo Full name lookup by document Deprecated cedula Yes
co.libreta_militar Military service status Deprecated cedula Yes
co.migracion_ppt PPT temporary protection permit custom Yes
co.estado_cedula_extranjeria Foreign ID card status (Migración) Deprecated custom Yes
co.validar_policia Police officer validation custom Yes

Compliance y AML

Source Description Inputs Browser
co.pep Politically Exposed Persons (SIGEP) cedula No
co.proveedores_ficticios DIAN fictitious providers nit No
co.rne Do Not Call registry (RNE/CRC) custom No

Seguridad Social

Source Description Inputs Browser
co.adres Health system enrollment (EPS/BDUA) Deprecated cedula Yes
co.colpensiones Pension affiliation (Colpensiones) Deprecated cedula Yes
co.fopep Pensioners payroll (FOPEP) cedula Yes
co.ruaf Unified affiliates registry (SISPRO) Deprecated cedula Yes
co.rethus Health workforce registry (RETHUS) Deprecated cedula Yes
co.soi Social security payments (SOI/PILA) cedula, nit Yes
co.seguridad_social Integrated social security status Deprecated cedula, nit Yes
co.afiliados_compensado Compensation fund affiliation cedula Yes
co.sisben Socioeconomic classification (SISBEN) cedula Yes

Empresas y Comercio

Source Description Inputs Browser
co.dian_rut Tax registry status (DIAN RUT) cedula, nit Yes
co.rues Business registry (RUES/Confecámaras) cedula, nit Yes
co.secop Public procurement (SECOP) nit No
co.cufe_dian Electronic invoice verification (CUFE) custom Yes
co.einforma Business intelligence (eInforma) Deprecated nit Yes
co.camara_comercio_medellin Medellín Chamber of Commerce nit, custom Yes
co.directorio_empresas Business directory (datos.gov.co) nit, custom No
co.empresas_google Business search (Google Maps) custom Yes
co.supersociedades Insolvency proceedings (Ley 1116) nit, cedula, custom Yes

Propiedad e Inmuebles

Source Description Inputs Browser
co.snr Property owner index (SNR) cedula, nit Yes
co.certificado_tradicion Property title certificate (SNR) Deprecated custom Yes
co.garantias_mobiliarias Movable collateral registry cedula Yes
co.cambio_estrato Socioeconomic stratum certification cedula Yes

Vehículos y Tránsito

Source Description Inputs Browser
co.simit Traffic fines and violations (SIMIT) cedula, placa Yes
co.runt Vehicle registry (RUNT) vin, placa, cedula Yes
co.runt_conductor Driver information (RUNT) cedula Yes
co.runt_soat Mandatory insurance status (SOAT) placa Yes
co.runt_rtm Technical inspection status (RTM) placa Yes
co.comparendos_transito Detailed traffic violations cedula, placa Yes
co.fasecolda Vehicle reference prices (insurance) custom Yes
co.recalls Vehicle safety recalls (SIC) custom Yes
co.retencion_vehiculos Impounded vehicles placa Yes
co.pico_y_placa Driving restrictions (13 cities) placa No
co.peajes Toll road tariffs custom No
co.combustible Fuel prices by city/station custom No
co.estaciones_ev EV charging stations custom No
co.siniestralidad Road crash hotspots (ANSV) custom No
co.vehiculos National vehicle fleet data placa, custom No

Vivienda y Servicios

Source Description Inputs Browser
co.mi_casa_ya Housing subsidies (Mi Casa Ya) Deprecated cedula Yes
co.tarifas_energia Electricity tariffs (SUI) custom No

Turismo

Source Description Inputs Browser
co.rnt_turismo National tourism registry (RNT) nit No

Salud

Source Description Inputs Browser
co.licencias_salud Health service providers (REPS) nit No

Consejos Profesionales

Source Description Inputs Browser
co.copnia Engineering (COPNIA) cedula, nit Yes
co.conaltel Electrical technology (CONALTEL) cedula Yes
co.consejo_mecanica Mechanical/Electronic engineering cedula Yes
co.cpae Business administration (CPAE) cedula Yes
co.cpip Petroleum engineering (CPIP) cedula Yes
co.cpiq Chemical engineering (CPIQ) cedula Yes
co.cpnaa Architecture (CPNAA) cedula, pasaporte Yes
co.cpnt Topography (CPNT) cedula Yes
co.cpbiol Biology (CPBiol) cedula Yes
co.veterinario Veterinary medicine (COMVEZCOL) cedula Yes
co.urna Law professionals (CSJ) cedula, nit Yes

United States

Source Description Inputs Browser
us.ofac OFAC SDN sanctions list (US Treasury) cedula, nit, pasaporte, custom No
us.nhtsa_vin VIN decode (NHTSA vPIC) vin No
us.nhtsa_recalls Vehicle safety recalls (NHTSA) custom No
us.nhtsa_complaints Vehicle safety complaints (NHTSA) custom No
us.epa_fuel_economy EPA fuel economy ratings custom No

Ecuador

Source Description Inputs Browser
ec.sri_ruc Tax registry RUC (SRI) custom No
ec.ant_citaciones Traffic fines (ANT) cedula, placa, custom No
ec.cne_padron Voter registry (CNE) Deprecated cedula Yes
ec.funcion_judicial Judicial processes (Función Judicial) cedula, custom Yes
ec.supercias Company registry (Superintendencia) custom Yes
ec.senescyt Professional degrees (SENESCYT) cedula, custom Yes

Peru

Source Description Inputs Browser
pe.sunat_ruc Tax registry RUC (SUNAT) custom Yes
pe.poder_judicial Judicial case search (CEJ) custom Yes
pe.osce_sancionados Sanctioned gov contractors (OSCE) custom Yes
pe.sunarp_vehicular Vehicle registry (SUNARP) placa Yes
pe.servir_sanciones Public servant sanctions (SERVIR) Deprecated custom Yes

Chile

Source Description Inputs Browser
cl.sii_rut Tax registry RUT (SII) custom Yes
cl.pjud Judicial case search (PJUD) Deprecated custom Yes
cl.fiscalizacion Traffic infractions Deprecated placa Yes
cl.superir Insolvency/bankruptcy (Superir) custom Yes

Mexico

Source Description Inputs Browser
mx.curp Population registry CURP (RENAPO) custom Yes
mx.sat_efos SAT blacklist EFOS/EDOS custom Yes
mx.siem Business directory SIEM Deprecated custom Yes
mx.repuve Stolen vehicle check (REPUVE) Deprecated placa, vin Yes

Argentina

Source Description Inputs Browser
ar.afip_cuit Tax registry CUIT/CUIL (AFIP) custom Yes
ar.pjn Federal judiciary cases (PJN) custom Yes
ar.dnrpa Vehicle registration (DNRPA) placa Yes

Brazil (1 source)

Source Description Inputs Browser
br.cnpj Business registry CNPJ (BrasilAPI) nit, custom No

Costa Rica (1 source)

Source Description Inputs Browser
cr.cedula Voter registry cédula (TSE) cedula, custom Yes

Dominican Republic (1 source)

Source Description Inputs Browser
do.rnc Tax registry RNC (DGII) cedula, nit, custom Yes

Paraguay (1 source)

Source Description Inputs Browser
py.ruc Tax registry RUC (SET/DNIT) custom Yes

Guatemala (1 source)

Source Description Inputs Browser
gt.nit Tax registry NIT (SAT) nit, custom Yes

Honduras (1 source)

Source Description Inputs Browser
hn.rtn Tax registry RTN (SAR) custom Yes

International

Source Description Inputs Browser
intl.onu UN Security Council sanctions list cedula, nit, pasaporte, custom No
intl.ship_tracking Global vessel position tracking custom No

Installation

pip install openquery

Or with uv:

uv add openquery

System Dependencies

Playwright browsers are required for web scraping:

playwright install chromium

CAPTCHA Engines (pick one or more)

OpenQuery auto-detects installed OCR engines and builds an optimal solver chain:

Engine Accuracy Speed Install
PaddleOCR (recommended) 100% ~130ms pip install "openquery[paddleocr]"
EasyOCR + Tesseract (voting) 90% ~500ms pip install "openquery[easyocr]" + brew install tesseract
Tesseract alone 80% ~390ms brew install tesseract (included by default)

For knowledge-based CAPTCHAs (Procuraduria), you need at least one LLM backend:

Backend Cost Setup
Ollama (recommended) Free ollama pull llama3.2:1b
HuggingFace Inference Free Set HF_TOKEN env var
Anthropic Paid Set ANTHROPIC_API_KEY env var
OpenAI Paid Set OPENAI_API_KEY env var

Optional Extras

pip install "openquery[paddleocr]"   # PaddleOCR — best CAPTCHA accuracy (100%)
pip install "openquery[easyocr]"     # EasyOCR — good accuracy (85%), combines with Tesseract for 90%
pip install "openquery[huggingface]" # HuggingFace Inference API (OCR + QA)
pip install "openquery[serve]"       # FastAPI server + dashboard (fastapi, uvicorn)
pip install "openquery[redis]"       # Redis cache backend
pip install "openquery[captcha]"     # 2captcha paid CAPTCHA solving (last resort)
pip install "openquery[deepface]"    # Face verification (DeepFace + ArcFace)
pip install "openquery[passport]"    # Passport MRZ reading (passporteye)

Quick Start

CLI

# List available data sources
openquery sources

# Query Colombian traffic fines by cedula
openquery query co.simit --cedula 12345678

# Query Colombian vehicle registry by plate
openquery query co.runt --placa ABC123

# Query by VIN
openquery query co.runt --vin 5YJ3E1EA1PF000001

# Disciplinary records
openquery query co.procuraduria --cedula 12345678

# Criminal background
openquery query co.policia --cedula 12345678

# Health system enrollment
openquery query co.adres --cedula 12345678

# Pico y placa — is my plate restricted today?
openquery query co.pico_y_placa --placa ABC123

# Toll tariffs
openquery query co.peajes --custom peaje --extra '{"peaje": "ALVARADO"}'

# Fuel prices in Bogota
openquery query co.combustible --custom fuel --extra '{"municipio": "BOGOTA"}'

# EV charging stations in Medellin
openquery query co.estaciones_ev --custom ev --extra '{"ciudad": "Medellin"}'

# Road crash hotspots
openquery query co.siniestralidad --custom stats --extra '{"departamento": "CUNDINAMARCA"}'

# Vehicle fleet lookup by plate
openquery query co.vehiculos --placa ABC123

# Output raw JSON
openquery query co.simit --cedula 12345678 --json

# Generate audit evidence (screenshots + PDF report)
openquery query co.runt --placa ABC123 --audit --audit-dir ./evidence

# Source health status
openquery health

# Extract data from ID document photo
openquery ocr --type co.cedula cedula_photo.jpg

# Face verification (compare ID photo vs selfie)
openquery face-verify id_photo.jpg selfie.jpg

REST API

# Start the API server
openquery serve

# Or with custom host/port
openquery serve --host 127.0.0.1 --port 3000

Then query via HTTP:

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "source": "co.simit",
    "document_type": "cedula",
    "document_number": "12345678"
  }'

Response:

{
  "ok": true,
  "source": "co.simit",
  "queried_at": "2026-03-31T10:30:00Z",
  "cached": false,
  "latency_ms": 4523,
  "data": {
    "comparendos": 0,
    "multas": 0,
    "total_deuda": 0.0,
    "paz_y_salvo": true
  }
}

API Endpoints:

Method Path Description
POST /api/v1/query Query a data source
GET /api/v1/sources List available sources
GET /api/v1/health Health check and cache stats
GET /api/v1/sources/health Detailed per-source health report
POST /api/v1/ocr/extract Extract data from ID document image
POST /api/v1/face/verify Face verification (1:1 comparison)
GET /dashboard Web dashboard UI
GET /docs Interactive API documentation

Docker

docker compose up

This starts the API server with Redis caching on port 8000.

Configuration

All settings use environment variables with the OPENQUERY_ prefix:

Variable Default Description
OPENQUERY_API_KEY (none) API key for server authentication
OPENQUERY_CACHE_BACKEND memory Cache backend: memory, redis, sqlite
OPENQUERY_CACHE_TTL_DEFAULT 3600 Default cache TTL in seconds
OPENQUERY_REDIS_URL redis://localhost:6379/0 Redis connection URL
OPENQUERY_BROWSER_HEADLESS true Run browser in headless mode
OPENQUERY_BROWSER_TIMEOUT 30.0 Browser operation timeout in seconds
OPENQUERY_RATE_LIMIT_DEFAULT_RPM 10 Default requests per minute per source
OPENQUERY_LOG_LEVEL INFO Logging level
TWO_CAPTCHA_API_KEY (none) 2captcha.com API key (paid fallback)
HF_TOKEN (none) HuggingFace token (free OCR + QA)
ANTHROPIC_API_KEY (none) Anthropic API key (paid QA fallback)
OPENAI_API_KEY (none) OpenAI API key (paid QA fallback)

Adding a New Source

Create a new source by implementing the BaseSource class:

# src/openquery/sources/us/nhtsa.py
from pydantic import BaseModel
from openquery.sources import register
from openquery.sources.base import BaseSource, DocumentType, QueryInput, SourceMeta


class NhtsaResult(BaseModel):
    manufacturer: str = ""
    model: str = ""
    year: int = 0
    recalls: list[dict] = []


@register
class NhtsaSource(BaseSource):
    def meta(self) -> SourceMeta:
        return SourceMeta(
            name="us.nhtsa",
            display_name="NHTSA Vehicle Safety",
            description="US vehicle safety recalls and VIN decoding",
            country="US",
            url="https://vpic.nhtsa.dot.gov/api/",
            supported_inputs=[DocumentType.VIN],
            requires_captcha=False,
            requires_browser=False,
            rate_limit_rpm=30,
        )

    def query(self, input: QueryInput) -> NhtsaResult:
        import httpx
        resp = httpx.get(
            f"https://vpic.nhtsa.dot.gov/api/vehicles/decodevin/{input.document_number}",
            params={"format": "json"},
        )
        data = resp.json()
        # Parse and return NhtsaResult...

The @register decorator automatically makes the source available in the CLI, API, and source listing.

Architecture

openquery/
├── core/
│   ├── browser.py    # Playwright browser management
│   ├── captcha.py    # Multi-engine CAPTCHA solvers (PaddleOCR, EasyOCR, Tesseract, voting)
│   ├── llm.py        # LLM QA chain (Ollama, HuggingFace, Anthropic, OpenAI)
│   ├── audit.py      # Evidence capture (screenshots, network logs, PDF reports)
│   ├── cache.py      # Caching backends (memory, Redis, SQLite)
│   └── rate_limit.py # Token-bucket rate limiting
├── sources/          # Data source plugins, organized by country
│   ├── base.py       # BaseSource ABC — implement this to add sources
│   ├── co/           # Colombia
│   ├── ec/           # Ecuador
│   ├── pe/           # Peru
│   ├── cl/           # Chile
│   ├── mx/           # Mexico
│   ├── ar/           # Argentina
│   ├── us/           # United States
│   └── intl/         # International
├── models/           # Pydantic response models, organized by country
├── server/           # FastAPI REST API
└── commands/         # Typer CLI commands

Development

git clone https://github.com/dacrypt/openquery.git
cd openquery
uv sync --all-extras
playwright install chromium

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

See CONTRIBUTING.md for detailed guidelines.

Documentation

Guide Description
Getting Started Installation, first query, engine setup
Sources Guide All 102 sources across 8 countries with field reference
CAPTCHA Guide OCR engines, voting, LLM backends, benchmarks
Audit Guide Evidence capture, PDF reports, compliance
API Guide REST endpoints, authentication, deployment
Adding Sources Step-by-step guide to create new source plugins
Test Results Real query results against live government services
Competitors Competitive landscape analysis (15 tools compared)
Changelog Version history

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openquery-1.0.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openquery-1.0.0-py3-none-any.whl (807.3 kB view details)

Uploaded Python 3

File details

Details for the file openquery-1.0.0.tar.gz.

File metadata

  • Download URL: openquery-1.0.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openquery-1.0.0.tar.gz
Algorithm Hash digest
SHA256 1968408989c96284023cf2431c58d7bced51cdc20f4c836d9aeb648f5f583905
MD5 14dc7e363f89aa315c85a0f8368637e5
BLAKE2b-256 995ad3c3891b8bddcc3cc38f3d363264f9032a57474da1eb14aebab0f5a3ce71

See more details on using hashes here.

Provenance

The following attestation bundles were made for openquery-1.0.0.tar.gz:

Publisher: publish.yml on dacrypt/openquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openquery-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: openquery-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 807.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openquery-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed5be8b3647d535b3d0f77e7c8c41bf72e6edce9c796fa9ad3b7cd7751832b1a
MD5 a1cd66b0bdb11397c953df7001c89cef
BLAKE2b-256 0af62a041f33f697f5d72e8f19ba582d4c666965a3d5bf5e301669e7ab4962a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for openquery-1.0.0-py3-none-any.whl:

Publisher: publish.yml on dacrypt/openquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page