Skip to main content

Collect US business and professional license data from government websites across all 50 states

Project description

gov-websites-collector

PyPI version Python 3.10+ License: MIT

Collect US business and professional license data from government websites across all 50 states + DC.

A Python library and CLI tool that scrapes Secretary of State business registrations, professional licensing boards, and other government databases. Supports HTTP scraping, browser automation (Playwright), and anti-detection browsing (Camoufox) for sites with bot protection.

Features

  • 🏛️ 51 state collectors — all 50 states + DC
  • 📋 Business entities — LLC, Corporation, Nonprofit registrations from Secretary of State
  • 🪪 Professional licenses — Real estate, contractors, medical, and more
  • 🌐 HTTP + Browser — HTTP-first with Playwright/Camoufox fallback for JS-heavy sites
  • 🔄 Proxy support — Configurable HTTP/SOCKS5 proxy rotation
  • Async — Built on httpx with async/await throughout
  • 🛡️ Anti-detection — Camoufox integration bypasses Cloudflare/Incapsula
  • 📊 Structured data — Pydantic models with consistent schema across states
  • 🖥️ CLI includedgov-collect command for terminal use

Installation

# Core (HTTP scraping only)
pip install gov-websites-collector

# With browser automation (for JS-rendered sites)
pip install gov-websites-collector[browser]
playwright install chromium

# With Camoufox (anti-detection browser for protected sites)
pip install gov-websites-collector[camoufox]

# Everything
pip install gov-websites-collector[all]

Quick Start

Python API

import asyncio
from gov_collector import collect

# Search California real estate licenses
results = asyncio.run(collect("CA", "Smith"))
for r in results:
    print(f"{r.data.holder_name}{r.data.license_number} ({r.data.status})")

# Search with a proxy (for states with bot protection)
results = asyncio.run(collect("OR", "Portland", proxy="http://user:pass@host:port"))

# Use Camoufox for anti-detection browsing
results = asyncio.run(collect("FL", "Realty", use_camoufox=True))

# Search specific category
results = asyncio.run(collect("TX", "Smith", category="businesses"))

Advanced Usage

import asyncio
from gov_collector import get_collector, SearchParams, DataCategory

async def main():
    params = SearchParams(
        query="Smith",
        state="CA",
        max_results=50,
    )
    
    async with get_collector("CA", proxy=None, timeout=30.0) as collector:
        async for result in collector.collect(params):
            if result.category == "licenses":
                lic = result.data
                print(f"License: {lic.license_number}{lic.holder_name}")
            elif result.category == "businesses":
                biz = result.data
                print(f"Business: {biz.name}{biz.status}")

asyncio.run(main())

CLI

# Search California licenses
gov-collect search --state CA --query "Smith" --category licenses

# Search Texas businesses (JSON output)
gov-collect search --state TX --query "Acme" --category businesses --format json

# With proxy and Camoufox
gov-collect search --state FL --query "Realty" --camoufox --proxy "http://user:pass@host:port"

# List all available states
gov-collect states

# List states that support license lookups
gov-collect states --category licenses

# Show info about a specific state
gov-collect info CA

Supported States

HTTP-only (no browser needed)

State Categories Source
AL Licenses Alabama Real Estate Commission
AZ Licenses Arizona Dept of Real Estate
CA Licenses, Businesses* DRE + bizfile Online*
CO Licenses, Businesses DORA + Secretary of State
DE Licenses Delaware Professional Regulation
GA Licenses Georgia Real Estate Commission
ID Businesses Idaho Secretary of State API
IN Licenses, Businesses MyLicense + Secretary of State
KY Businesses Kentucky Secretary of State
LA Licenses Louisiana Real Estate Commission
ME Licenses Maine Real Estate Commission
MN Businesses Minnesota Secretary of State API
MS Licenses Mississippi Real Estate Commission
ND Businesses North Dakota FirstStop API
NJ Licenses New Jersey MyLicense
NY Businesses New York Dept of State
SC Businesses South Carolina Business Filings
TX Licenses, Businesses TREC + Comptroller

Browser required (Playwright or Camoufox)

State Categories Notes
AK Licenses Commerce license search
CT Licenses, Businesses eLicense + Concord SOTS
FL Licenses, Businesses SunBiz + DBPR
HI Licenses, Businesses PVL (Cloudflare) + HBE
IA Businesses Secretary of State
IL Businesses Illinois Secretary of State
MT Businesses Montana Secretary of State
NH Businesses New Hampshire QuickStart
OR Businesses Oregon Secretary of State (needs proxy)
TN Licenses, Businesses Verify TN + Secretary of State
VA Licenses, Businesses DPOR + SCC

* CA business search requires browser (JavaScript SPA)

Additional states

All remaining states have collector modules but may be blocked by CAPTCHAs, WAFs, or other anti-bot measures. See STATE_STATUS.md for detailed status of each state.

Configuration

Proxy

Many government sites block datacenter IPs or use Cloudflare/Incapsula. A residential proxy significantly improves success rates.

# Via function argument
results = asyncio.run(collect("OR", "Smith", proxy="http://user:pass@host:port"))

# Via environment variable
import os
os.environ["GOV_COLLECTOR_PROXY"] = "http://user:pass@host:port"
results = asyncio.run(collect("OR", "Smith"))
# CLI
gov-collect search --state OR --query "Smith" --proxy "http://user:pass@host:port"

# Or via env var
export GOV_COLLECTOR_PROXY="http://user:pass@host:port"
gov-collect search --state OR --query "Smith"

Browser Automation

For states with JavaScript-rendered pages:

# Playwright (standard browser)
results = asyncio.run(collect("FL", "Smith", use_browser=True))

# Camoufox (anti-detection — recommended for protected sites)
results = asyncio.run(collect("FL", "Smith", use_camoufox=True))

Rate Limiting

Built-in rate limiting prevents overwhelming government servers:

# Default: 1 second between requests
results = asyncio.run(collect("CA", "Smith", rate_limit=1.0))

# Slower for sensitive sites
results = asyncio.run(collect("CA", "Smith", rate_limit=2.0))

Data Models

All results use Pydantic models with a consistent schema:

CollectorResult

Top-level wrapper containing:

  • category — "licenses", "businesses", or "properties"
  • state — Two-letter state code
  • data — One of License, Business, or Property
  • collected_at — Timestamp
  • source — Data source name

License

  • license_number, license_type, status
  • holder_name, holder (Person)
  • business_name, address
  • issue_date, expiration_date

Business

  • name, entity_type, status
  • filing_number, formation_date
  • address, registered_agent
  • officers

Property

  • parcel_id, address
  • owner_name, assessed_value, market_value
  • year_built, square_feet

See models.py for full field definitions.

API Reference

collect(state, query, **kwargs)

High-level async function that returns a list of results.

Parameter Type Default Description
state str required Two-letter state code
query str required Search term
category str | None None "licenses", "businesses", or "properties"
proxy str | None None Proxy URL
timeout float 30.0 Request timeout (seconds)
rate_limit float 1.0 Min seconds between requests
use_browser bool False Enable Playwright
use_camoufox bool False Enable Camoufox
max_results int 100 Max results to return

get_collector(state, **kwargs)

Returns a collector instance for advanced usage with async with.

list_states()

Returns list of StateInfo objects for all registered collectors.

Contributing

  1. Fork the repo
  2. Create a feature branch
  3. Add or fix a state collector in gov_collector/states/
  4. Test: gov-collect search --state XX --query "test" -v
  5. Submit a PR

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gov_websites_collector-0.1.0.tar.gz (130.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gov_websites_collector-0.1.0-py3-none-any.whl (206.3 kB view details)

Uploaded Python 3

File details

Details for the file gov_websites_collector-0.1.0.tar.gz.

File metadata

  • Download URL: gov_websites_collector-0.1.0.tar.gz
  • Upload date:
  • Size: 130.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gov_websites_collector-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c405418e98ceee10912e74cb7aae8cdf621050db5ce69b2353b3b417c9ba9426
MD5 8a3e7b8ca9ba9da86ea0fa3db6499fe8
BLAKE2b-256 577a77133d508bd75c6e8d8817be01b999bf9d5e9d6d17528ce14113a9c252f9

See more details on using hashes here.

File details

Details for the file gov_websites_collector-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gov_websites_collector-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 114cf8576b0f74b13de6f6971b678761d30f8c9519ca1eaf7eac6e49e21ed317
MD5 68a6804bb555c0999cb63f9ec24cb6bf
BLAKE2b-256 3a8508fc8d0c47679293636291abb7168b5bb0671f76101308f234b2728f38d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page