Skip to main content

Harvest businesses from the Google Places API (New) and filter false positives with a configurable LLM backend.

Project description

placeharvest

Harvest businesses from the Google Places API (New) and strip false positives with a configurable LLM backend. The motivating use case is "all golf simulators in the US and Indonesia," but nothing is hard-coded to golf — the search terms, the regions, and the meaning of a "false positive" are all inputs.

Two stages with a durable NDJSON file in between, so re-running the (free) filter never re-spends the (billed) Google API:

fetch  ─▶  data/raw/<region>.ndjson  ─▶  filter  ─▶  data/filtered/<region>.ndjson  ─▶  report

Install

pip install placeharvest                # core (fetch + cli)
pip install "placeharvest[anthropic]"   # + Anthropic SDK for api/anthropic filtering (default)
pip install "placeharvest[openai]"      # + OpenAI SDK for api/openai filtering
pip install "placeharvest[all]"         # both SDKs

cli filter mode needs no extra — it shells out to a separately-installed claude or codex binary.

Credentials (two independent domains)

Stage Reads When
fetch GOOGLE_MAPS_API_KEY always (the only secret the fetcher reads)
filter api/anthropic ANTHROPIC_API_KEY default
filter api/openai OPENAI_API_KEY --provider openai
filter cli/anthropic ANTHROPIC_API_KEY, or a logged-in claude session (--no-cli-bare) --mode cli
filter cli/openai CODEX_API_KEY (per-invocation), or a logged-in codex session --mode cli --provider openai

Put them in a .env file (see .env.example) or export them.

The filter matrix (mode × provider)

mode provider What runs Auth
cli anthropic claude -p subprocess (headless Claude Code) ANTHROPIC_API_KEY or logged-in session
cli openai codex exec subprocess (headless Codex) CODEX_API_KEY or logged-in session
api anthropic Anthropic Messages API via SDK ANTHROPIC_API_KEY
api openai OpenAI Responses API via SDK OPENAI_API_KEY

api mode is the default — self-contained, no external binary, right for CI. cli mode exists for users who already run Claude Code or Codex on a paid plan and want filtering to ride that session. Impossible combos fail at startup with a message naming the exact missing piece.

The golf example, end to end

# 0. Estimate cost before spending anything.
placeharvest fetch --profile examples/golf_us_id.yaml --region indonesia --dry-run

# 1. Cheap run first (Indonesia is sparse) to validate end-to-end.
placeharvest fetch  --profile examples/golf_us_id.yaml --region indonesia \
    --out data/raw/id.ndjson

# 2. Filter false positives. Target description drives "what is a real match".
placeharvest filter --profile examples/golf_us_id.yaml \
    --in data/raw/id.ndjson --out data/filtered/id.ndjson

# 3. Summarize + export keep.csv / uncertain.csv grouped by country.
placeharvest report --in data/filtered/id.ndjson --csv-dir data/exports/id

# 4. Then the expensive US run (resumable if interrupted).
placeharvest fetch  --profile examples/golf_us_id.yaml --region us \
    --out data/raw/us.ndjson --resume

Override anything from the profile on the command line:

placeharvest filter --in data/raw/id.ndjson --out data/filtered/id.ndjson \
    --mode api --provider openai --model gpt-5.1 --batch-size 25 \
    --target "indoor golf simulator venues; exclude courses, ranges, mini golf, shops"

Library use

from placeharvest import (
    PlacesClient, load_region, resolve_queries, run_fetch,
    make_backend, run_filter, build_report,
)

region = load_region("examples/regions/indonesia.yaml")
queries = ["golf simulator", "indoor golf"]
with PlacesClient(api_key="...") as client:
    run_fetch(client, region, queries, "data/raw/id.ndjson")

backend = make_backend("api", "anthropic", "claude-sonnet-4-6")
run_filter(backend, "data/raw/id.ndjson", "data/filtered/id.ndjson",
           target="indoor golf simulator venues; exclude courses and shops")

How it works

Fetch. A thin client against the raw places:searchText REST endpoint (no maintained pip package exposes nextPageToken on the new Text Search endpoint). Each region is a bounding box tiled into overlapping circles; every search term runs against every tile; results dedup on place.id. The field mask is the cost lever — rating/userRatingCount are included (cheap, help filtering) but reviews are excluded from the bulk pull.

Adaptive subdivision. The API caps results at 60 per (query, tile) (20 × 3 pages). When a tile returns a full 60 it's saturated, so it's split into four half-radius sub-tiles and re-searched, up to --max-depth (default 3). This approaches completeness on dense metros without guessing density up front.

Filter. The NDJSON is walked in batches (default 50). Each batch is sent to the configured LLM with a system prompt parameterized by your --target. The model returns a strict JSON verdict per place — keep / reject / uncertain (three-way, so borderline cases aren't silently dropped). The runner validates the contract defensively: count mismatches retry then split, hallucinated ids are dropped, omitted ids become uncertain, invalid JSON retries at a smaller batch.

Cost & coverage caveats (don't ignore these)

  • Completeness is asymptotic. The 60-result ceiling plus "bias, not restrict" location semantics mean some venues are missed even with subdivision. No setting guarantees 100%.
  • Cost scales with grid density and term count, and subdivision fans out on dense metros, so real cost can exceed the pre-subdivision --dry-run estimate. Watch the live counter.
  • Caching terms: the dump is point-in-time; only place_id is legal to retain indefinitely.
  • The filter is a heuristic over sparse fields — with no reviews in the bulk pull, some judgments ride on name + type + website alone, hence uncertain. For higher precision, add a targeted second pass that fetches reviews for uncertain places only and re-filters.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

placeharvest-0.1.0.tar.gz (46.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

placeharvest-0.1.0-py3-none-any.whl (38.1 kB view details)

Uploaded Python 3

File details

Details for the file placeharvest-0.1.0.tar.gz.

File metadata

  • Download URL: placeharvest-0.1.0.tar.gz
  • Upload date:
  • Size: 46.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for placeharvest-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b72d2720254eab7393647252a5c409e41e0913aac366719dd6d0459212ebf61c
MD5 113a67669315de9791a2de3f1fd750bc
BLAKE2b-256 ace73f0a42e943cb62f884fce61fcff9cc0545a346a212650b6bd2b9a96266d1

See more details on using hashes here.

File details

Details for the file placeharvest-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: placeharvest-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for placeharvest-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bccc97f5a22d8340c1ed28552739f92f0e0e9c0dba573dd04781a2314e83f2f9
MD5 68949b2c4032d365183671809bc5499c
BLAKE2b-256 2f27c806fef044c3f3cb1ec1ac2ab1dd3b9e521b14a60d88edee850ff3169739

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page