Harvest businesses from the Google Places API (New) and filter false positives with a configurable LLM backend.

These details have not been verified by PyPI

Project links

Project description

placeharvest

Harvest businesses from the Google Places API (New) and strip false positives with a configurable LLM backend. The motivating use case is "all golf simulators in the US and Indonesia," but nothing is hard-coded to golf — the search terms, the regions, and the meaning of a "false positive" are all inputs.

Two stages with a durable NDJSON file in between, so re-running the (free) filter never re-spends the (billed) Google API:

fetch  ─▶  data/raw/<region>.ndjson  ─▶  filter  ─▶  data/filtered/<region>.ndjson  ─▶  report

Install

pip install placeharvest                # core (fetch + cli)
pip install "placeharvest[anthropic]"   # + Anthropic SDK for api/anthropic filtering (default)
pip install "placeharvest[openai]"      # + OpenAI SDK for api/openai filtering
pip install "placeharvest[all]"         # both SDKs

cli filter mode needs no extra — it shells out to a separately-installed claude or codex binary.

Credentials (two independent domains)

Stage	Reads	When
fetch	`GOOGLE_MAPS_API_KEY`	always (the only secret the fetcher reads)
filter `api/anthropic`	`ANTHROPIC_API_KEY`	default
filter `api/openai`	`OPENAI_API_KEY`	`--provider openai`
filter `cli/anthropic`	`ANTHROPIC_API_KEY`, or a logged-in `claude` session (`--no-cli-bare`)	`--mode cli`
filter `cli/openai`	`CODEX_API_KEY` (per-invocation), or a logged-in `codex` session	`--mode cli --provider openai`

Put them in a .env file (see .env.example) or export them.

The filter matrix (mode × provider)

mode	provider	What runs	Auth
`cli`	`anthropic`	`claude -p` subprocess (headless Claude Code)	`ANTHROPIC_API_KEY` or logged-in session
`cli`	`openai`	`codex exec` subprocess (headless Codex)	`CODEX_API_KEY` or logged-in session
`api`	`anthropic`	Anthropic Messages API via SDK	`ANTHROPIC_API_KEY`
`api`	`openai`	OpenAI Responses API via SDK	`OPENAI_API_KEY`

api mode is the default — self-contained, no external binary, right for CI. cli mode exists for users who already run Claude Code or Codex on a paid plan and want filtering to ride that session. Impossible combos fail at startup with a message naming the exact missing piece.

The golf example, end to end

# 0. Estimate cost before spending anything.
placeharvest fetch --profile examples/golf_us_id.yaml --region indonesia --dry-run

# 1. Cheap run first (Indonesia is sparse) to validate end-to-end.
placeharvest fetch  --profile examples/golf_us_id.yaml --region indonesia \
    --out data/raw/id.ndjson

# 2. Filter false positives. Target description drives "what is a real match".
placeharvest filter --profile examples/golf_us_id.yaml \
    --in data/raw/id.ndjson --out data/filtered/id.ndjson

# 3. Summarize + export keep.csv / uncertain.csv grouped by country.
placeharvest report --in data/filtered/id.ndjson --csv-dir data/exports/id

# 4. Then the expensive US run (resumable if interrupted).
placeharvest fetch  --profile examples/golf_us_id.yaml --region us \
    --out data/raw/us.ndjson --resume

Override anything from the profile on the command line:

placeharvest filter --in data/raw/id.ndjson --out data/filtered/id.ndjson \
    --mode api --provider openai --model gpt-5.1 --batch-size 25 \
    --target "indoor golf simulator venues; exclude courses, ranges, mini golf, shops"

Library use

from placeharvest import (
    PlacesClient, load_region, resolve_queries, run_fetch,
    make_backend, run_filter, build_report,
)

region = load_region("examples/regions/indonesia.yaml")
queries = ["golf simulator", "indoor golf"]
with PlacesClient(api_key="...") as client:
    run_fetch(client, region, queries, "data/raw/id.ndjson")

backend = make_backend("api", "anthropic", "claude-sonnet-4-6")
run_filter(backend, "data/raw/id.ndjson", "data/filtered/id.ndjson",
           target="indoor golf simulator venues; exclude courses and shops")

How it works

Fetch. A thin client against the raw places:searchText REST endpoint (no maintained pip package exposes nextPageToken on the new Text Search endpoint). Each region is a bounding box tiled into overlapping circles; every search term runs against every tile; results dedup on place.id. The field mask is the cost lever — rating/userRatingCount are included (cheap, help filtering) but reviews are excluded from the bulk pull.

Adaptive subdivision. The API caps results at 60 per (query, tile) (20 × 3 pages). When a tile returns a full 60 it's saturated, so it's split into four half-radius sub-tiles and re-searched, up to --max-depth (default 3). This approaches completeness on dense metros without guessing density up front.

Filter. The NDJSON is walked in batches (default 50). Each batch is sent to the configured LLM with a system prompt parameterized by your --target. The model returns a strict JSON verdict per place — keep / reject / uncertain (three-way, so borderline cases aren't silently dropped). The runner validates the contract defensively: count mismatches retry then split, hallucinated ids are dropped, omitted ids become uncertain, invalid JSON retries at a smaller batch.

Cost & coverage caveats (don't ignore these)

Completeness is asymptotic. The 60-result ceiling plus "bias, not restrict" location semantics mean some venues are missed even with subdivision. No setting guarantees 100%.
Cost scales with grid density and term count, and subdivision fans out on dense metros, so real cost can exceed the pre-subdivision --dry-run estimate. Watch the live counter.
Caching terms: the dump is point-in-time; only place_id is legal to retain indefinitely.
The filter is a heuristic over sparse fields — with no reviews in the bulk pull, some judgments ride on name + type + website alone, hence uncertain. For higher precision, add a targeted second pass that fetches reviews for uncertain places only and re-filters.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

placeharvest-0.1.0.tar.gz (46.5 kB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

placeharvest-0.1.0-py3-none-any.whl (38.1 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file placeharvest-0.1.0.tar.gz.

File metadata

Download URL: placeharvest-0.1.0.tar.gz
Upload date: Jun 22, 2026
Size: 46.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for placeharvest-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b72d2720254eab7393647252a5c409e41e0913aac366719dd6d0459212ebf61c`
MD5	`113a67669315de9791a2de3f1fd750bc`
BLAKE2b-256	`ace73f0a42e943cb62f884fce61fcff9cc0545a346a212650b6bd2b9a96266d1`

See more details on using hashes here.

File details

Details for the file placeharvest-0.1.0-py3-none-any.whl.

File metadata

Download URL: placeharvest-0.1.0-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 38.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for placeharvest-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bccc97f5a22d8340c1ed28552739f92f0e0e9c0dba573dd04781a2314e83f2f9`
MD5	`68949b2c4032d365183671809bc5499c`
BLAKE2b-256	`2f27c806fef044c3f3cb1ec1ac2ab1dd3b9e521b14a60d88edee850ff3169739`

See more details on using hashes here.

placeharvest 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

placeharvest

Install

Credentials (two independent domains)

The filter matrix (mode × provider)

The golf example, end to end

Library use

How it works

Cost & coverage caveats (don't ignore these)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes