The data-harvesting foundation for the AI era — a zero-boilerplate framework for AI agents and vibecoding
Project description
harvex
The data-harvesting foundation for the AI era — a zero-boilerplate framework for AI agents and vibecoding.
Let the AI (or your own vibecoding) write only the one thing it's good at — how to fetch, how to parse — and leave everything else to the framework: concurrent scheduling, field consolidation, deduplicated writes, run metadata, HTTP retries, logging, alerting, data-health checks, scheduling, a web browser UI, LLM translation/enrichment, and a TUI control panel.
pip install harvex # core, zero heavy deps
pip install "harvex[web,llm,browser,tui]" # opt into extras as needed
Why "a harvesting foundation for the AI era"
When an LLM writes a scraper, the model is great at "how do I turn this page/endpoint into structured data" and bad at — and most likely to get wrong — the surrounding engineering: retry/backoff, concurrency isolation, incremental dedup, schema drift, scheduling, observability. harvex turns all of that into a stable foundation and gives the AI a narrow, rock-solid contract surface:
from harvex import BaseSource, SourceProfile
class GithubTrending(BaseSource):
profile = SourceProfile(slug="gh_trending", name="GitHub Trending")
def fetch(self):
return self.ctx.http.get_json("https://api.example.com/trending")
def parse(self, raw):
for item in raw["items"]:
yield {"title": item["name"], "stars": item["stars"]}
The AI only needs to produce a class like this, and harvex run drives the whole chain: fetch → validate → dedup → store → run metadata → health check. New fields won't blow up your main table (they fold into an extra column automatically), one failing source won't take down the round, and dirty data is rejected before it hits the database.
Design principles
- Zero heavy core deps: the core depends only on
pydantic/httpx/tenacity.playwright,openai, web, and TUI are allextrasinstalled on demand. - Field consolidation as a contract:
HarvestRecord(pydantic v2) enforces the "don't let the main table become a sparse matrix" discipline — undeclared fields fold automatically, dirty data is caught before writing. - Fault isolation: one failing source never breaks the whole round.
- SQLite first, Sink abstraction reserved: works out of the box, with a clean extension point.
- Scheduling decoupled from the web: CLI + system launchd/cron, instead of parasitizing a timer thread inside the web process.
Layers
sources/*.py (you / AI write) BaseSource subclass: fetch() + parse()
↓ raw → list[dict]
core/pipeline validate(pydantic) → consolidate(extra fold) → store → metadata → health
↓
storage/sqlite_sink create/alter table + upsert dedup + per-round backup
↓
SQLite business DB + metadata DB
↓ (extras)
extras/web browse extras/llm translate extras/tui control panel
orchestration: core/runner (concurrency) + cli (harvex run / health / gen-launchd)
New project skeleton
my_project/
├── config.toml # source toggles / schedule / filters / notifications
├── .env.local # secrets (openai key, webhook url)
├── fields.py # your HarvestRecord subclass — standard fields
├── sources/ # one file per source, just fetch/parse
└── database/ logs/
A complete runnable template lives in templates/project/.
CLI
harvex list # list discovered sources
harvex run --all # run one round over all sources
harvex run gh_trending # run specific sources
harvex health # data-health check (zeroed-out / sharp drop)
harvex gen-launchd # generate a macOS launchd schedule
harvex gen-cron # generate a crontab line
harvex web # start the read-only browse UI (needs [web])
harvex tui # start the local control panel (needs [tui])
Development
uv venv && uv pip install -e ".[dev]"
uv run pytest
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file harvex-0.2.0.tar.gz.
File metadata
- Download URL: harvex-0.2.0.tar.gz
- Upload date:
- Size: 144.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a425c03ee7c06c90bc0b4834ee3c8cba2bc1aad68ff1f67aefb3f475d9447734
|
|
| MD5 |
4d7322d81bd35ebe1ba69984415f6684
|
|
| BLAKE2b-256 |
9d1ffa80c27d4752ae37804980aaa838e6207db4249f3ca49de292823c198550
|
File details
Details for the file harvex-0.2.0-py3-none-any.whl.
File metadata
- Download URL: harvex-0.2.0-py3-none-any.whl
- Upload date:
- Size: 105.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
878d032be32c9eebc1a7cbc1849236e68f136c97c11d2173cb18f7cb5b6b7c8f
|
|
| MD5 |
0482eb29f84f0d4633d8f981464684b7
|
|
| BLAKE2b-256 |
76807708f49b63304d801a2fab7729abfc7ad1dbb703db3eee043f5182dea786
|