Skip to main content

Query public data sources worldwide via scraping and APIs

Project description

OpenQuery

CI PyPI Python License: MIT

Query public data sources worldwide through a unified CLI and REST API.

OpenQuery provides a plugin-based framework for scraping government websites, public registries, and open data APIs. It handles the hard parts — browser automation, CAPTCHA solving, WAF bypass, caching, and rate limiting — so you can focus on the data.

Features

  • Unified interface — one CLI and one API endpoint for all data sources
  • Browser automation — Playwright-based scraping for JavaScript-heavy sites
  • Multi-engine CAPTCHA solving — PaddleOCR (100%), EasyOCR+Tesseract voting (90%), with cloud and paid fallbacks
  • LLM-powered knowledge CAPTCHAs — Ollama (local), HuggingFace, Anthropic, OpenAI fallback chain
  • Audit & evidence — screenshots, network logs, and PDF evidence reports for compliance
  • WAF bypass — browser-context API calls preserve session cookies
  • Caching — in-memory, Redis, or SQLite backends with configurable TTL
  • Rate limiting — per-source token-bucket to respect server limits
  • REST API — FastAPI server with auto-generated OpenAPI docs
  • Document OCR — extract structured data from ID documents (cedula, INE, DNI, carnet, passport)
  • Face verification — 1:1 face comparison with liveness detection (DeepFace/ArcFace)
  • Health monitoring — per-source circuit breaker with automatic failover
  • Dashboard — web UI for source browsing, querying, and health monitoring
  • Extensible — add new data sources by implementing a single class
  • Country-organized — sources grouped by country code (co, us, etc.)

Built-in Sources — 102 sources across 8 countries

Colombia (73 sources)

Antecedentes y Justicia

Source Description Inputs Browser
co.policia Criminal background (Policía Nacional) cedula Yes
co.procuraduria Disciplinary records (Procuraduría) cedula Yes
co.contraloria Fiscal responsibility (Contraloría) cedula, nit, pasaporte Yes
co.rnmc Police corrective measures (RNMC) cedula, pasaporte Yes
co.consulta_procesos Judicial processes (Rama Judicial) cedula, nit Yes
co.tutelas Constitutional protection actions (Tutelas) cedula, nit Yes
co.jep Transitional justice (JEP) cedula Yes
co.inpec Prison population (INPEC) cedula Yes

Identidad y Registro Civil

Source Description Inputs Browser
co.estado_cedula Cédula status (Registraduría) cedula Yes
co.estado_tramite_cedula ID card processing status cedula Yes
co.defuncion Cédula vigency — alive/deceased cedula Yes
co.puesto_votacion Voting station lookup cedula Yes
co.registro_civil Civil registry certificate cedula Yes
co.nombre_completo Full name lookup by document cedula Yes
co.libreta_militar Military service status cedula Yes
co.migracion_ppt PPT temporary protection permit custom Yes
co.estado_cedula_extranjeria Foreign ID card status (Migración) custom Yes
co.validar_policia Police officer validation custom Yes

Compliance y AML

Source Description Inputs Browser
co.pep Politically Exposed Persons (SIGEP) cedula No
co.proveedores_ficticios DIAN fictitious providers nit No
co.rne Do Not Call registry (RNE/CRC) custom No

Seguridad Social

Source Description Inputs Browser
co.adres Health system enrollment (EPS/BDUA) cedula Yes
co.colpensiones Pension affiliation (Colpensiones) cedula Yes
co.fopep Pensioners payroll (FOPEP) cedula Yes
co.ruaf Unified affiliates registry (SISPRO) cedula Yes
co.rethus Health workforce registry (RETHUS) cedula Yes
co.soi Social security payments (SOI/PILA) cedula, nit Yes
co.seguridad_social Integrated social security status cedula, nit Yes
co.afiliados_compensado Compensation fund affiliation cedula Yes
co.sisben Socioeconomic classification (SISBEN) cedula Yes

Empresas y Comercio

Source Description Inputs Browser
co.dian_rut Tax registry status (DIAN RUT) cedula, nit Yes
co.rues Business registry (RUES/Confecámaras) cedula, nit Yes
co.secop Public procurement (SECOP) nit No
co.cufe_dian Electronic invoice verification (CUFE) custom Yes
co.einforma Business intelligence (eInforma) nit Yes
co.camara_comercio_medellin Medellín Chamber of Commerce nit, custom Yes
co.directorio_empresas Business directory (datos.gov.co) nit, custom No
co.empresas_google Business search (Google Maps) custom Yes
co.supersociedades Insolvency proceedings (Ley 1116) nit, cedula, custom Yes

Propiedad e Inmuebles

Source Description Inputs Browser
co.snr Property owner index (SNR) cedula, nit Yes
co.certificado_tradicion Property title certificate (SNR) custom Yes
co.garantias_mobiliarias Movable collateral registry cedula Yes
co.cambio_estrato Socioeconomic stratum certification cedula Yes

Vehículos y Tránsito

Source Description Inputs Browser
co.simit Traffic fines and violations (SIMIT) cedula, placa Yes
co.runt Vehicle registry (RUNT) vin, placa, cedula Yes
co.runt_conductor Driver information (RUNT) cedula Yes
co.runt_soat Mandatory insurance status (SOAT) placa Yes
co.runt_rtm Technical inspection status (RTM) placa Yes
co.comparendos_transito Detailed traffic violations cedula, placa Yes
co.fasecolda Vehicle reference prices (insurance) custom Yes
co.recalls Vehicle safety recalls (SIC) custom Yes
co.retencion_vehiculos Impounded vehicles placa Yes
co.pico_y_placa Driving restrictions (13 cities) placa No
co.peajes Toll road tariffs custom No
co.combustible Fuel prices by city/station custom No
co.estaciones_ev EV charging stations custom No
co.siniestralidad Road crash hotspots (ANSV) custom No
co.vehiculos National vehicle fleet data placa, custom No

Vivienda y Servicios

Source Description Inputs Browser
co.mi_casa_ya Housing subsidies (Mi Casa Ya) cedula Yes
co.tarifas_energia Electricity tariffs (SUI) custom No

Turismo

Source Description Inputs Browser
co.rnt_turismo National tourism registry (RNT) nit No

Salud

Source Description Inputs Browser
co.licencias_salud Health service providers (REPS) nit No

Consejos Profesionales (11 sources)

Source Description Inputs Browser
co.copnia Engineering (COPNIA) cedula, nit Yes
co.conaltel Electrical technology (CONALTEL) cedula Yes
co.consejo_mecanica Mechanical/Electronic engineering cedula Yes
co.cpae Business administration (CPAE) cedula Yes
co.cpip Petroleum engineering (CPIP) cedula Yes
co.cpiq Chemical engineering (CPIQ) cedula Yes
co.cpnaa Architecture (CPNAA) cedula, pasaporte Yes
co.cpnt Topography (CPNT) cedula Yes
co.cpbiol Biology (CPBiol) cedula Yes
co.veterinario Veterinary medicine (COMVEZCOL) cedula Yes
co.urna Law professionals (CSJ) cedula, nit Yes

United States (5 sources)

Source Description Inputs Browser
us.ofac OFAC SDN sanctions list (US Treasury) cedula, nit, pasaporte, custom No
us.nhtsa_vin VIN decode (NHTSA vPIC) vin No
us.nhtsa_recalls Vehicle safety recalls (NHTSA) custom No
us.nhtsa_complaints Vehicle safety complaints (NHTSA) custom No
us.epa_fuel_economy EPA fuel economy ratings custom No

Ecuador (6 sources)

Source Description Inputs Browser
ec.sri_ruc Tax registry RUC (SRI) custom No
ec.ant_citaciones Traffic fines (ANT) cedula, placa, custom No
ec.cne_padron Voter registry (CNE) cedula Yes
ec.funcion_judicial Judicial processes (Función Judicial) cedula, custom Yes
ec.supercias Company registry (Superintendencia) custom Yes
ec.senescyt Professional degrees (SENESCYT) cedula, custom Yes

Peru (5 sources)

Source Description Inputs Browser
pe.sunat_ruc Tax registry RUC (SUNAT) custom Yes
pe.poder_judicial Judicial case search (CEJ) custom Yes
pe.osce_sancionados Sanctioned gov contractors (OSCE) custom Yes
pe.sunarp_vehicular Vehicle registry (SUNARP) placa Yes
pe.servir_sanciones Public servant sanctions (SERVIR) custom Yes

Chile (4 sources)

Source Description Inputs Browser
cl.sii_rut Tax registry RUT (SII) custom Yes
cl.pjud Judicial case search (PJUD) custom Yes
cl.fiscalizacion Traffic infractions placa Yes
cl.superir Insolvency/bankruptcy (Superir) custom Yes

Mexico (4 sources)

Source Description Inputs Browser
mx.curp Population registry CURP (RENAPO) custom Yes
mx.sat_efos SAT blacklist EFOS/EDOS custom Yes
mx.siem Business directory SIEM custom Yes
mx.repuve Stolen vehicle check (REPUVE) placa, vin Yes

Argentina (3 sources)

Source Description Inputs Browser
ar.afip_cuit Tax registry CUIT/CUIL (AFIP) custom Yes
ar.pjn Federal judiciary cases (PJN) custom Yes
ar.dnrpa Vehicle registration (DNRPA) placa Yes

International (2 sources)

Source Description Inputs Browser
intl.onu UN Security Council sanctions list cedula, nit, pasaporte, custom No
intl.ship_tracking Global vessel position tracking custom No

Installation

pip install openquery

Or with uv:

uv add openquery

System Dependencies

Playwright browsers are required for web scraping:

playwright install chromium

CAPTCHA Engines (pick one or more)

OpenQuery auto-detects installed OCR engines and builds an optimal solver chain:

Engine Accuracy Speed Install
PaddleOCR (recommended) 100% ~130ms pip install "openquery[paddleocr]"
EasyOCR + Tesseract (voting) 90% ~500ms pip install "openquery[easyocr]" + brew install tesseract
Tesseract alone 80% ~390ms brew install tesseract (included by default)

For knowledge-based CAPTCHAs (Procuraduria), you need at least one LLM backend:

Backend Cost Setup
Ollama (recommended) Free ollama pull llama3.2:1b
HuggingFace Inference Free Set HF_TOKEN env var
Anthropic Paid Set ANTHROPIC_API_KEY env var
OpenAI Paid Set OPENAI_API_KEY env var

Optional Extras

pip install "openquery[paddleocr]"   # PaddleOCR — best CAPTCHA accuracy (100%)
pip install "openquery[easyocr]"     # EasyOCR — good accuracy (85%), combines with Tesseract for 90%
pip install "openquery[huggingface]" # HuggingFace Inference API (OCR + QA)
pip install "openquery[serve]"       # FastAPI server + dashboard (fastapi, uvicorn)
pip install "openquery[redis]"       # Redis cache backend
pip install "openquery[captcha]"     # 2captcha paid CAPTCHA solving (last resort)
pip install "openquery[deepface]"    # Face verification (DeepFace + ArcFace)
pip install "openquery[passport]"    # Passport MRZ reading (passporteye)

Quick Start

CLI

# List available data sources
openquery sources

# Query Colombian traffic fines by cedula
openquery query co.simit --cedula 12345678

# Query Colombian vehicle registry by plate
openquery query co.runt --placa ABC123

# Query by VIN
openquery query co.runt --vin 5YJ3E1EA1PF000001

# Disciplinary records
openquery query co.procuraduria --cedula 12345678

# Criminal background
openquery query co.policia --cedula 12345678

# Health system enrollment
openquery query co.adres --cedula 12345678

# Pico y placa — is my plate restricted today?
openquery query co.pico_y_placa --placa ABC123

# Toll tariffs
openquery query co.peajes --custom peaje --extra '{"peaje": "ALVARADO"}'

# Fuel prices in Bogota
openquery query co.combustible --custom fuel --extra '{"municipio": "BOGOTA"}'

# EV charging stations in Medellin
openquery query co.estaciones_ev --custom ev --extra '{"ciudad": "Medellin"}'

# Road crash hotspots
openquery query co.siniestralidad --custom stats --extra '{"departamento": "CUNDINAMARCA"}'

# Vehicle fleet lookup by plate
openquery query co.vehiculos --placa ABC123

# Output raw JSON
openquery query co.simit --cedula 12345678 --json

# Generate audit evidence (screenshots + PDF report)
openquery query co.runt --placa ABC123 --audit --audit-dir ./evidence

# Source health status
openquery health

# Extract data from ID document photo
openquery ocr --type co.cedula cedula_photo.jpg

# Face verification (compare ID photo vs selfie)
openquery face-verify id_photo.jpg selfie.jpg

REST API

# Start the API server
openquery serve

# Or with custom host/port
openquery serve --host 127.0.0.1 --port 3000

Then query via HTTP:

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "source": "co.simit",
    "document_type": "cedula",
    "document_number": "12345678"
  }'

Response:

{
  "ok": true,
  "source": "co.simit",
  "queried_at": "2026-03-31T10:30:00Z",
  "cached": false,
  "latency_ms": 4523,
  "data": {
    "comparendos": 0,
    "multas": 0,
    "total_deuda": 0.0,
    "paz_y_salvo": true
  }
}

API Endpoints:

Method Path Description
POST /api/v1/query Query a data source
GET /api/v1/sources List available sources
GET /api/v1/health Health check and cache stats
GET /api/v1/sources/health Detailed per-source health report
POST /api/v1/ocr/extract Extract data from ID document image
POST /api/v1/face/verify Face verification (1:1 comparison)
GET /dashboard Web dashboard UI
GET /docs Interactive API documentation

Docker

docker compose up

This starts the API server with Redis caching on port 8000.

Configuration

All settings use environment variables with the OPENQUERY_ prefix:

Variable Default Description
OPENQUERY_API_KEY (none) API key for server authentication
OPENQUERY_CACHE_BACKEND memory Cache backend: memory, redis, sqlite
OPENQUERY_CACHE_TTL_DEFAULT 3600 Default cache TTL in seconds
OPENQUERY_REDIS_URL redis://localhost:6379/0 Redis connection URL
OPENQUERY_BROWSER_HEADLESS true Run browser in headless mode
OPENQUERY_BROWSER_TIMEOUT 30.0 Browser operation timeout in seconds
OPENQUERY_RATE_LIMIT_DEFAULT_RPM 10 Default requests per minute per source
OPENQUERY_LOG_LEVEL INFO Logging level
TWO_CAPTCHA_API_KEY (none) 2captcha.com API key (paid fallback)
HF_TOKEN (none) HuggingFace token (free OCR + QA)
ANTHROPIC_API_KEY (none) Anthropic API key (paid QA fallback)
OPENAI_API_KEY (none) OpenAI API key (paid QA fallback)

Adding a New Source

Create a new source by implementing the BaseSource class:

# src/openquery/sources/us/nhtsa.py
from pydantic import BaseModel
from openquery.sources import register
from openquery.sources.base import BaseSource, DocumentType, QueryInput, SourceMeta


class NhtsaResult(BaseModel):
    manufacturer: str = ""
    model: str = ""
    year: int = 0
    recalls: list[dict] = []


@register
class NhtsaSource(BaseSource):
    def meta(self) -> SourceMeta:
        return SourceMeta(
            name="us.nhtsa",
            display_name="NHTSA Vehicle Safety",
            description="US vehicle safety recalls and VIN decoding",
            country="US",
            url="https://vpic.nhtsa.dot.gov/api/",
            supported_inputs=[DocumentType.VIN],
            requires_captcha=False,
            requires_browser=False,
            rate_limit_rpm=30,
        )

    def query(self, input: QueryInput) -> NhtsaResult:
        import httpx
        resp = httpx.get(
            f"https://vpic.nhtsa.dot.gov/api/vehicles/decodevin/{input.document_number}",
            params={"format": "json"},
        )
        data = resp.json()
        # Parse and return NhtsaResult...

The @register decorator automatically makes the source available in the CLI, API, and source listing.

Architecture

openquery/
├── core/
│   ├── browser.py    # Playwright browser management
│   ├── captcha.py    # Multi-engine CAPTCHA solvers (PaddleOCR, EasyOCR, Tesseract, voting)
│   ├── llm.py        # LLM QA chain (Ollama, HuggingFace, Anthropic, OpenAI)
│   ├── audit.py      # Evidence capture (screenshots, network logs, PDF reports)
│   ├── cache.py      # Caching backends (memory, Redis, SQLite)
│   └── rate_limit.py # Token-bucket rate limiting
├── sources/          # Data source plugins, organized by country
│   ├── base.py       # BaseSource ABC — implement this to add sources
│   ├── co/           # Colombia (72 sources)
│   ├── ec/           # Ecuador (6 sources)
│   ├── pe/           # Peru (5 sources)
│   ├── cl/           # Chile (3 sources)
│   ├── mx/           # Mexico (4 sources)
│   ├── ar/           # Argentina (3 sources)
│   ├── us/           # United States (5 sources)
│   └── intl/         # International (2 sources)
├── models/           # Pydantic response models, organized by country
├── server/           # FastAPI REST API
└── commands/         # Typer CLI commands

Development

git clone https://github.com/dacrypt/openquery.git
cd openquery
uv sync --all-extras
playwright install chromium

# Run tests
uv run pytest

# Lint
uv run ruff check src/ tests/

See CONTRIBUTING.md for detailed guidelines.

Documentation

Guide Description
Getting Started Installation, first query, engine setup
Sources Guide All 102 sources across 8 countries with field reference
CAPTCHA Guide OCR engines, voting, LLM backends, benchmarks
Audit Guide Evidence capture, PDF reports, compliance
API Guide REST endpoints, authentication, deployment
Adding Sources Step-by-step guide to create new source plugins
Test Results Real query results against live government services
Competitors Competitive landscape analysis (15 tools compared)
Changelog Version history

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openquery-0.7.0.tar.gz (598.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openquery-0.7.0-py3-none-any.whl (334.4 kB view details)

Uploaded Python 3

File details

Details for the file openquery-0.7.0.tar.gz.

File metadata

  • Download URL: openquery-0.7.0.tar.gz
  • Upload date:
  • Size: 598.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openquery-0.7.0.tar.gz
Algorithm Hash digest
SHA256 6c36c5f06281134d73618170e59ab3d11827e6a65a6a2248bb3765a4ebef8857
MD5 3420b1ec67cd65e6d5306becc1d6e211
BLAKE2b-256 779349473f218c8a46a6858a17231bf94c7b001bbf84d529897b92884fa4d5c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for openquery-0.7.0.tar.gz:

Publisher: publish.yml on dacrypt/openquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openquery-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: openquery-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 334.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for openquery-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 17049ae17c0e54ae122fea548a1e0f80ee2d625cbb3667bffd5a32d36e5a3151
MD5 3d27a5a622907bd5cd8b7ab1400d987c
BLAKE2b-256 b92ba6b1fe483cae23f5cab3d548fed516b7629802340784b172441b6cad206f

See more details on using hashes here.

Provenance

The following attestation bundles were made for openquery-0.7.0-py3-none-any.whl:

Publisher: publish.yml on dacrypt/openquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page