Harvest businesses from the Google Places API (New) and filter false positives with a configurable LLM backend.
Project description
placeharvest
Harvest businesses from the Google Places API (New) and strip false positives with a configurable LLM backend. The motivating use case is "all golf simulators in the US and Indonesia," but nothing is hard-coded to golf — the search terms, the regions, and the meaning of a "false positive" are all inputs.
Two stages with a durable NDJSON file in between, so re-running the (free) filter never re-spends the (billed) Google API:
fetch ─▶ data/raw/<region>.ndjson ─▶ filter ─▶ data/filtered/<region>.ndjson ─▶ report
Install
pip install placeharvest # core (fetch + cli)
pip install "placeharvest[anthropic]" # + Anthropic SDK for api/anthropic filtering (default)
pip install "placeharvest[openai]" # + OpenAI SDK for api/openai filtering
pip install "placeharvest[all]" # both SDKs
cli filter mode needs no extra — it shells out to a separately-installed
claude or codex binary.
Credentials (two independent domains)
| Stage | Reads | When |
|---|---|---|
| fetch | GOOGLE_MAPS_API_KEY |
always (the only secret the fetcher reads) |
filter api/anthropic |
ANTHROPIC_API_KEY |
default |
filter api/openai |
OPENAI_API_KEY |
--provider openai |
filter cli/anthropic |
ANTHROPIC_API_KEY, or a logged-in claude session (--no-cli-bare) |
--mode cli |
filter cli/openai |
CODEX_API_KEY (per-invocation), or a logged-in codex session |
--mode cli --provider openai |
Put them in a .env file (see .env.example) or export them.
The filter matrix (mode × provider)
| mode | provider | What runs | Auth |
|---|---|---|---|
cli |
anthropic |
claude -p subprocess (headless Claude Code) |
ANTHROPIC_API_KEY or logged-in session |
cli |
openai |
codex exec subprocess (headless Codex) |
CODEX_API_KEY or logged-in session |
api |
anthropic |
Anthropic Messages API via SDK | ANTHROPIC_API_KEY |
api |
openai |
OpenAI Responses API via SDK | OPENAI_API_KEY |
api mode is the default — self-contained, no external binary, right for CI.
cli mode exists for users who already run Claude Code or Codex on a paid plan
and want filtering to ride that session. Impossible combos fail at startup with a
message naming the exact missing piece.
The golf example, end to end
# 0. Estimate cost before spending anything.
placeharvest fetch --profile examples/golf_us_id.yaml --region indonesia --dry-run
# 1. Cheap run first (Indonesia is sparse) to validate end-to-end.
placeharvest fetch --profile examples/golf_us_id.yaml --region indonesia \
--out data/raw/id.ndjson
# 2. Filter false positives. Target description drives "what is a real match".
placeharvest filter --profile examples/golf_us_id.yaml \
--in data/raw/id.ndjson --out data/filtered/id.ndjson
# 3. Summarize + export keep.csv / uncertain.csv grouped by country.
placeharvest report --in data/filtered/id.ndjson --csv-dir data/exports/id
# 4. Then the expensive US run (resumable if interrupted).
placeharvest fetch --profile examples/golf_us_id.yaml --region us \
--out data/raw/us.ndjson --resume
Override anything from the profile on the command line:
placeharvest filter --in data/raw/id.ndjson --out data/filtered/id.ndjson \
--mode api --provider openai --model gpt-5.1 --batch-size 25 \
--target "indoor golf simulator venues; exclude courses, ranges, mini golf, shops"
Library use
from placeharvest import (
PlacesClient, load_region, resolve_queries, run_fetch,
make_backend, run_filter, build_report,
)
region = load_region("examples/regions/indonesia.yaml")
queries = ["golf simulator", "indoor golf"]
with PlacesClient(api_key="...") as client:
run_fetch(client, region, queries, "data/raw/id.ndjson")
backend = make_backend("api", "anthropic", "claude-sonnet-4-6")
run_filter(backend, "data/raw/id.ndjson", "data/filtered/id.ndjson",
target="indoor golf simulator venues; exclude courses and shops")
How it works
Fetch. A thin client against the raw places:searchText REST endpoint (no
maintained pip package exposes nextPageToken on the new Text Search endpoint).
Each region is a bounding box tiled into overlapping circles; every search term
runs against every tile; results dedup on place.id. The field mask is the cost
lever — rating/userRatingCount are included (cheap, help filtering) but
reviews are excluded from the bulk pull.
Adaptive subdivision. The API caps results at 60 per (query, tile) (20 ×
3 pages). When a tile returns a full 60 it's saturated, so it's split into four
half-radius sub-tiles and re-searched, up to --max-depth (default 3). This
approaches completeness on dense metros without guessing density up front.
Filter. The NDJSON is walked in batches (default 50). Each batch is sent to
the configured LLM with a system prompt parameterized by your --target. The
model returns a strict JSON verdict per place — keep / reject / uncertain
(three-way, so borderline cases aren't silently dropped). The runner validates
the contract defensively: count mismatches retry then split, hallucinated ids are
dropped, omitted ids become uncertain, invalid JSON retries at a smaller batch.
Cost & coverage caveats (don't ignore these)
- Completeness is asymptotic. The 60-result ceiling plus "bias, not restrict" location semantics mean some venues are missed even with subdivision. No setting guarantees 100%.
- Cost scales with grid density and term count, and subdivision fans out on
dense metros, so real cost can exceed the pre-subdivision
--dry-runestimate. Watch the live counter. - Caching terms: the dump is point-in-time; only
place_idis legal to retain indefinitely. - The filter is a heuristic over sparse fields — with no reviews in the bulk
pull, some judgments ride on name + type + website alone, hence
uncertain. For higher precision, add a targeted second pass that fetchesreviewsforuncertainplaces only and re-filters.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file placeharvest-0.1.0.tar.gz.
File metadata
- Download URL: placeharvest-0.1.0.tar.gz
- Upload date:
- Size: 46.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b72d2720254eab7393647252a5c409e41e0913aac366719dd6d0459212ebf61c
|
|
| MD5 |
113a67669315de9791a2de3f1fd750bc
|
|
| BLAKE2b-256 |
ace73f0a42e943cb62f884fce61fcff9cc0545a346a212650b6bd2b9a96266d1
|
File details
Details for the file placeharvest-0.1.0-py3-none-any.whl.
File metadata
- Download URL: placeharvest-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bccc97f5a22d8340c1ed28552739f92f0e0e9c0dba573dd04781a2314e83f2f9
|
|
| MD5 |
68949b2c4032d365183671809bc5499c
|
|
| BLAKE2b-256 |
2f27c806fef044c3f3cb1ec1ac2ab1dd3b9e521b14a60d88edee850ff3169739
|