Collect US business and professional license data from government websites across all 50 states
Project description
gov-websites-collector
Collect US business and professional license data from government websites across all 50 states + DC.
A Python library and CLI tool that scrapes Secretary of State business registrations, professional licensing boards, and other government databases. Supports HTTP scraping, browser automation (Playwright), and anti-detection browsing (Camoufox) for sites with bot protection.
Features
- 🏛️ 51 state collectors — all 50 states + DC
- 📋 Business entities — LLC, Corporation, Nonprofit registrations from Secretary of State
- 🪪 Professional licenses — Real estate, contractors, medical, and more
- 🌐 HTTP + Browser — HTTP-first with Playwright/Camoufox fallback for JS-heavy sites
- 🔄 Proxy support — Configurable HTTP/SOCKS5 proxy rotation
- ⚡ Async — Built on
httpxwith async/await throughout - 🛡️ Anti-detection — Camoufox integration bypasses Cloudflare/Incapsula
- 📊 Structured data — Pydantic models with consistent schema across states
- 🖥️ CLI included —
gov-collectcommand for terminal use
Installation
# Core (HTTP scraping only)
pip install gov-websites-collector
# With browser automation (for JS-rendered sites)
pip install gov-websites-collector[browser]
playwright install chromium
# With Camoufox (anti-detection browser for protected sites)
pip install gov-websites-collector[camoufox]
# Everything
pip install gov-websites-collector[all]
Quick Start
Python API
import asyncio
from gov_collector import collect
# Search California real estate licenses
results = asyncio.run(collect("CA", "Smith"))
for r in results:
print(f"{r.data.holder_name} — {r.data.license_number} ({r.data.status})")
# Search with a proxy (for states with bot protection)
results = asyncio.run(collect("OR", "Portland", proxy="http://user:pass@host:port"))
# Use Camoufox for anti-detection browsing
results = asyncio.run(collect("FL", "Realty", use_camoufox=True))
# Search specific category
results = asyncio.run(collect("TX", "Smith", category="businesses"))
Advanced Usage
import asyncio
from gov_collector import get_collector, SearchParams, DataCategory
async def main():
params = SearchParams(
query="Smith",
state="CA",
max_results=50,
)
async with get_collector("CA", proxy=None, timeout=30.0) as collector:
async for result in collector.collect(params):
if result.category == "licenses":
lic = result.data
print(f"License: {lic.license_number} — {lic.holder_name}")
elif result.category == "businesses":
biz = result.data
print(f"Business: {biz.name} — {biz.status}")
asyncio.run(main())
CLI
# Search California licenses
gov-collect search --state CA --query "Smith" --category licenses
# Search Texas businesses (JSON output)
gov-collect search --state TX --query "Acme" --category businesses --format json
# With proxy and Camoufox
gov-collect search --state FL --query "Realty" --camoufox --proxy "http://user:pass@host:port"
# List all available states
gov-collect states
# List states that support license lookups
gov-collect states --category licenses
# Show info about a specific state
gov-collect info CA
Supported States
HTTP-only (no browser needed)
| State | Categories | Source |
|---|---|---|
| AL | Licenses | Alabama Real Estate Commission |
| AZ | Licenses | Arizona Dept of Real Estate |
| CA | Licenses, Businesses* | DRE + bizfile Online* |
| CO | Licenses, Businesses | DORA + Secretary of State |
| DE | Licenses | Delaware Professional Regulation |
| GA | Licenses | Georgia Real Estate Commission |
| ID | Businesses | Idaho Secretary of State API |
| IN | Licenses, Businesses | MyLicense + Secretary of State |
| KY | Businesses | Kentucky Secretary of State |
| LA | Licenses | Louisiana Real Estate Commission |
| ME | Licenses | Maine Real Estate Commission |
| MN | Businesses | Minnesota Secretary of State API |
| MS | Licenses | Mississippi Real Estate Commission |
| ND | Businesses | North Dakota FirstStop API |
| NJ | Licenses | New Jersey MyLicense |
| NY | Businesses | New York Dept of State |
| SC | Businesses | South Carolina Business Filings |
| TX | Licenses, Businesses | TREC + Comptroller |
Browser required (Playwright or Camoufox)
| State | Categories | Notes |
|---|---|---|
| AK | Licenses | Commerce license search |
| CT | Licenses, Businesses | eLicense + Concord SOTS |
| FL | Licenses, Businesses | SunBiz + DBPR |
| HI | Licenses, Businesses | PVL (Cloudflare) + HBE |
| IA | Businesses | Secretary of State |
| IL | Businesses | Illinois Secretary of State |
| MT | Businesses | Montana Secretary of State |
| NH | Businesses | New Hampshire QuickStart |
| OR | Businesses | Oregon Secretary of State (needs proxy) |
| TN | Licenses, Businesses | Verify TN + Secretary of State |
| VA | Licenses, Businesses | DPOR + SCC |
* CA business search requires browser (JavaScript SPA)
Additional states
All remaining states have collector modules but may be blocked by CAPTCHAs, WAFs, or other anti-bot measures. See STATE_STATUS.md for detailed status of each state.
Configuration
Proxy
Many government sites block datacenter IPs or use Cloudflare/Incapsula. A residential proxy significantly improves success rates.
# Via function argument
results = asyncio.run(collect("OR", "Smith", proxy="http://user:pass@host:port"))
# Via environment variable
import os
os.environ["GOV_COLLECTOR_PROXY"] = "http://user:pass@host:port"
results = asyncio.run(collect("OR", "Smith"))
# CLI
gov-collect search --state OR --query "Smith" --proxy "http://user:pass@host:port"
# Or via env var
export GOV_COLLECTOR_PROXY="http://user:pass@host:port"
gov-collect search --state OR --query "Smith"
Browser Automation
For states with JavaScript-rendered pages:
# Playwright (standard browser)
results = asyncio.run(collect("FL", "Smith", use_browser=True))
# Camoufox (anti-detection — recommended for protected sites)
results = asyncio.run(collect("FL", "Smith", use_camoufox=True))
Rate Limiting
Built-in rate limiting prevents overwhelming government servers:
# Default: 1 second between requests
results = asyncio.run(collect("CA", "Smith", rate_limit=1.0))
# Slower for sensitive sites
results = asyncio.run(collect("CA", "Smith", rate_limit=2.0))
Data Models
All results use Pydantic models with a consistent schema:
CollectorResult
Top-level wrapper containing:
category— "licenses", "businesses", or "properties"state— Two-letter state codedata— One ofLicense,Business, orPropertycollected_at— Timestampsource— Data source name
License
license_number,license_type,statusholder_name,holder(Person)business_name,addressissue_date,expiration_date
Business
name,entity_type,statusfiling_number,formation_dateaddress,registered_agentofficers
Property
parcel_id,addressowner_name,assessed_value,market_valueyear_built,square_feet
See models.py for full field definitions.
API Reference
collect(state, query, **kwargs)
High-level async function that returns a list of results.
| Parameter | Type | Default | Description |
|---|---|---|---|
state |
str |
required | Two-letter state code |
query |
str |
required | Search term |
category |
str | None |
None |
"licenses", "businesses", or "properties" |
proxy |
str | None |
None |
Proxy URL |
timeout |
float |
30.0 |
Request timeout (seconds) |
rate_limit |
float |
1.0 |
Min seconds between requests |
use_browser |
bool |
False |
Enable Playwright |
use_camoufox |
bool |
False |
Enable Camoufox |
max_results |
int |
100 |
Max results to return |
get_collector(state, **kwargs)
Returns a collector instance for advanced usage with async with.
list_states()
Returns list of StateInfo objects for all registered collectors.
Contributing
- Fork the repo
- Create a feature branch
- Add or fix a state collector in
gov_collector/states/ - Test:
gov-collect search --state XX --query "test" -v - Submit a PR
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gov_websites_collector-0.1.0.tar.gz.
File metadata
- Download URL: gov_websites_collector-0.1.0.tar.gz
- Upload date:
- Size: 130.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c405418e98ceee10912e74cb7aae8cdf621050db5ce69b2353b3b417c9ba9426
|
|
| MD5 |
8a3e7b8ca9ba9da86ea0fa3db6499fe8
|
|
| BLAKE2b-256 |
577a77133d508bd75c6e8d8817be01b999bf9d5e9d6d17528ce14113a9c252f9
|
File details
Details for the file gov_websites_collector-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gov_websites_collector-0.1.0-py3-none-any.whl
- Upload date:
- Size: 206.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
114cf8576b0f74b13de6f6971b678761d30f8c9519ca1eaf7eac6e49e21ed317
|
|
| MD5 |
68a6804bb555c0999cb63f9ec24cb6bf
|
|
| BLAKE2b-256 |
3a8508fc8d0c47679293636291abb7168b5bb0671f76101308f234b2728f38d9
|