Skip to main content

The data-harvesting foundation for the AI era — a zero-boilerplate framework for AI agents and vibecoding

Project description

harvex

English · 简体中文 · 日本語

The data-harvesting foundation for the AI era — a zero-boilerplate framework for AI agents and vibecoding.

Let the AI (or your own vibecoding) write only the one thing it's good at — how to fetch, how to parse — and leave everything else to the framework: concurrent scheduling, field consolidation, deduplicated writes, run metadata, HTTP retries, logging, alerting, data-health checks, scheduling, a web browser UI, LLM translation/enrichment, and a TUI control panel.

pip install harvex                              # core, zero heavy deps
pip install "harvex[web,llm,browser,tui]"       # opt into extras as needed

Why "a harvesting foundation for the AI era"

When an LLM writes a scraper, the model is great at "how do I turn this page/endpoint into structured data" and bad at — and most likely to get wrong — the surrounding engineering: retry/backoff, concurrency isolation, incremental dedup, schema drift, scheduling, observability. harvex turns all of that into a stable foundation and gives the AI a narrow, rock-solid contract surface:

from harvex import BaseSource, SourceProfile

class GithubTrending(BaseSource):
    profile = SourceProfile(slug="gh_trending", name="GitHub Trending")

    def fetch(self):
        return self.ctx.http.get_json("https://api.example.com/trending")

    def parse(self, raw):
        for item in raw["items"]:
            yield {"title": item["name"], "stars": item["stars"]}

The AI only needs to produce a class like this, and harvex run drives the whole chain: fetch → validate → dedup → store → run metadata → health check. New fields won't blow up your main table (they fold into an extra column automatically), one failing source won't take down the round, and dirty data is rejected before it hits the database.

Design principles

  • Zero heavy core deps: the core depends only on pydantic / httpx / tenacity. playwright, openai, web, and TUI are all extras installed on demand.
  • Field consolidation as a contract: HarvestRecord (pydantic v2) enforces the "don't let the main table become a sparse matrix" discipline — undeclared fields fold automatically, dirty data is caught before writing.
  • Fault isolation: one failing source never breaks the whole round.
  • SQLite first, Sink abstraction reserved: works out of the box, with a clean extension point.
  • Scheduling decoupled from the web: CLI + system launchd/cron, instead of parasitizing a timer thread inside the web process.

Layers

sources/*.py (you / AI write)   BaseSource subclass: fetch() + parse()
     ↓ raw → list[dict]
core/pipeline                   validate(pydantic) → consolidate(extra fold) → store → metadata → health
     ↓
storage/sqlite_sink             create/alter table + upsert dedup + per-round backup
     ↓
SQLite business DB + metadata DB
     ↓ (extras)
extras/web  browse    extras/llm  translate    extras/tui  control panel
orchestration: core/runner (concurrency) + cli (harvex run / health / gen-launchd)

New project skeleton

my_project/
├── config.toml         # source toggles / schedule / filters / notifications
├── .env.local          # secrets (openai key, webhook url)
├── fields.py           # your HarvestRecord subclass — standard fields
├── sources/            # one file per source, just fetch/parse
└── database/  logs/

A complete runnable template lives in templates/project/.

CLI

harvex list                 # list discovered sources
harvex run --all            # run one round over all sources
harvex run gh_trending      # run specific sources
harvex health               # data-health check (zeroed-out / sharp drop)
harvex gen-launchd          # generate a macOS launchd schedule
harvex gen-cron             # generate a crontab line
harvex web                  # start the read-only browse UI (needs [web])
harvex tui                  # start the local control panel (needs [tui])

Development

uv venv && uv pip install -e ".[dev]"
uv run pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harvex-0.1.1.tar.gz (123.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harvex-0.1.1-py3-none-any.whl (86.8 kB view details)

Uploaded Python 3

File details

Details for the file harvex-0.1.1.tar.gz.

File metadata

  • Download URL: harvex-0.1.1.tar.gz
  • Upload date:
  • Size: 123.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for harvex-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9064901105087d22113a965284eee37e97b5b93bc1134227cc0d992fb9c1f383
MD5 8e3142d0627105f9e2ce74842710de35
BLAKE2b-256 1da99384a3d0688a44a7f33d350e8571d13eba37891c3ee9b95b0adde75fbbdb

See more details on using hashes here.

File details

Details for the file harvex-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: harvex-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 86.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for harvex-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c8ae672fd574503bfb1556df447b5d11c35a9015dc0ddfeb9a78e7ed7642bbfa
MD5 714bf1559474d555759de480ee256157
BLAKE2b-256 86852f4902df515e70100ffda1869af8f92bf99a3fadda6b4fbaa6181312f937

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page