An accountable research-intake organ: ingest from scattered sources behind clean adapters, with a provenance receipt on every item and a witnessed digest out
Project description
Gather
Research lives behind awkward access. Captions and comments on a video, papers behind an arXiv gate, a library buried in a repo, an API with a credential wall, a fact that exists only across scattered fragments and has to be put together. Most tools handle one of those and break on the rest, and when they do reach something, you cannot tell later whether a line was pulled straight from the source or pieced together along the way.
Gather is the research-intake organ that handles all of it cohesively, and records how. It is the one place in the constellation where network access, third-party tools, and credentials are allowed to live, isolated behind source adapters, so the rest stays clean. Every item it brings back carries a provenance receipt, and a run emits a witnessed digest that index, refine, and the crucible consume.
Reach anywhere, and say how
The aim is to pull information from anywhere, including the extremely difficult: gated APIs, auth and paywalls, JavaScript-walled pages, scanned PDFs, audio, obscure formats, and information that is not sitting in one place but has to be synthesized from fragments. Many of these ship today: alongside video, web, feed, and docs, there are adapters for arXiv papers, PDFs, authenticated JSON APIs, JavaScript-rendered pages (a headless browser), scanned images (OCR), and audio (transcription). Each records HOW it reached the content, so the accountability is in place before the harder reach is trusted.
Two adapters are honest about their reach in their receipts. The web adapter reads the static
HTML a server returns and does not run JavaScript, so a client-rendered page yields only its
shell, and http-get says exactly that; the browser adapter runs a real headless browser and
records browser-extract, so you know JavaScript was executed. The browser is the most exposed
edge: its host guard covers only the first navigation, and a rendered page then follows its own
redirects and sub-requests unguarded, so do not point it at untrusted URLs where internal
services are reachable (see the threat model in ARCHITECTURE.md).
That accountability is one rule: the receipt records how each item was obtained. A
transcript read from captions, a page read through a browser, text recognized from a scan,
speech transcribed from audio, and a fact synthesized from fragments are all valid items,
but they are not equally direct. The method on every item keeps that on the record
(yt-dlp, browser-extract, ocr, transcribe, synthesized), so a quote is never
confused with an inference, and what was hard to get is never dressed up as if it were
lying in the open.
A derived item (one assembled or inferred from other items, rather than fetched) is the
sharp case, and the receipt is built for it. Its sha256 fingerprints the inference itself,
not its sources, because it is a new statement and can only witness itself; a derived_from
field records the content hash of each input, a re-checkable pointer back to the exact
source content. The digest seal folds in method and derived_from alongside the hash, so
relabelling an inference as a direct fetch, or quietly rewriting what it was built from,
breaks the seal exactly as altering the content does.
The honesty is mechanical where it can be. The method="synthesized" label is reachable
only through the Synthesizer seam: the bare gather.derive builder defaults to compiled
and refuses to stamp synthesized at all, so a bare call can never forge a synthesis. With no
edge wired in, the default NullSynthesizer performs a deterministic, extractive compilation:
it assembles inputs verbatim, labels them compiled, invents nothing. What the seam attests is
that the configured edge produced the text; that the edge is actually a model is the operator's
responsibility, the same trust as choosing the browser binary or the API token (point the seam at
cat and you get a verbatim echo labelled synthesized). And derived_from records the inputs
supplied to the edge, an upper bound: a model may ignore some or generate beyond them, so it
attests availability, not use.
The discipline
- One isolated impure edge. Each source is a small adapter behind a single
Sourceshape:fetch(target) -> list[Item]. The adapter can use the network, a tool, a credential, a browser, whatever the source demands; the rest of Gather imports none of that. Awkward access is an adapter problem, not a system problem. - A receipt on every item. Each
Itemcarries aProvenance(source, ref, method, time, and a sha256 of the content). Re-hash the content and you can confirm it is what was obtained, unaltered. - A witnessed digest out. A run folds its items' receipts into one re-checkable seal. Downstream organs consume the digest; the seal lets a reader confirm it was not altered.
- Scope to the work. A deterministic scope filter keeps what serves the theses and drops the rest, and records how many it dropped.
- A peer, not a feature. Gather is deliberately impure, so it is not part of index (which is zero-dependency, offline, and deterministic, and would forfeit exactly that if it grew a scraper). It composes through the digest seam, the way Forum does.
Install
pip install gather-engine
The distribution is gather-engine; it installs the gather command and the gather package
(import gather). The core is pure standard library; a few adapters call an external tool
(yt-dlp, pdftotext, a headless chromium, tesseract, whisper), which you install only if
you use that adapter.
Watch it work
examples/demo.py parses an already-harvested video (a yt-dlp info.json plus its .vtt
captions) into items, each with a provenance receipt, scope-filters them, folds them into a
witnessed digest, then tampers with one receipt to show the seal catch it. All offline, no
install, nothing downloaded:
python examples/demo.py # one video parsed, scoped, digested, then a receipt catches tampering
python examples/pipeline.py # the whole organ: run -> store -> verify -> recall, offline
parsed 3 items from one video, each with a receipt:
metadata abc123 sha256=40d9839ffb0e... verify=True
transcript abc123 sha256=f798cd2c334c... verify=True
comment c1 sha256=301cf39d5091... verify=True
scope to ['tile','monotile']: kept 3, dropped 0
witnessed digest: 3 receipts, seal 7da7dc456b11..., verified True
after tampering one receipt, digest verifies: False <- caught
(The hash and seal prefixes above are illustrative; the load-bearing facts are the verify
results, which the test suite pins.)
The gather CLI fetches live from each adapter; every fetch command takes the same --scope,
--json, and --store (the run and corpus commands are driven by a config file and
sub-actions instead). Of the commands shown here, web/feed/docs are pure standard library and only
video needs an external tool (yt-dlp); the harder adapters (pdf, browser, ocr, transcribe) each
shell out to their own external tool, never a Python dependency, as the module list notes:
gather docs ./research-notes --scope "rubik,group theory" # local files, offline
gather web "https://example.com/article" --store ./corpus # static page, kept in a corpus
gather feed "https://example.com/feed.xml" --json # RSS or Atom
gather arxiv "aperiodic monotile" --store ./corpus # papers (abstracts + metadata)
gather video "https://youtu.be/<id>" --comments --scope "rubik,group theory"
gather corpus verify ./corpus # re-hash every stored body
Any command takes --store DIR to persist what it gathered into a content-addressed corpus,
and gather corpus list|verify|digest|search DIR inspects it. verify re-hashes every stored
body against its receipt and exits non-zero if anything is missing or corrupt. search matches
its terms as case-insensitive substrings of title and body (so art also matches cartesian).
Write a corpus from one process at a time: the dedup is single-writer, and prune (which reads
the catalog then deletes unreferenced objects) must likewise run with no concurrent writer.
What's here
gather.item: anItemand itsProvenancereceipt (withderived_fromfor inferences);make_itemcomputes the receipt from the content.gather.source: theSourceadapter shape (the isolated impure edge) and aCatalogof what was gathered.gather.scope: the scope-to-telos filter, deterministic and order-preserving.gather.digest: the witnessed, provenance-stamped digest with a re-checkable seal (folds inmethodandderived_from).gather.derive: the derive seam, building a derived item withderived_from; aSynthesizerseam whoseNullSynthesizerdefault compiles verbatim (never fabricates a synthesis).gather.net: the single network transport (http_get, urllib.request, + puredecode_body). HTTP transport lives here and in adapter fetches, nowhere else; pure URL string-building (urllib.parse) may live in an adapter.gather.video: video intake viayt-dlp. Pure parsing, impure shell.gather.web: static web pages via http(s); pure HTML-to-text, no JavaScript.gather.feed: RSS and Atom feeds; pure parser handles both.gather.docs: local text files or a directory of them; the impure edge is the filesystem.gather.arxiv: papers from the arXiv API by id or query; pure parser, the Item carries the abstract and metadata.gather.pdf: text from a local PDF viapdftotext(an external tool, not a dependency); a best-effort reading, labelled as such.gather.store: a durable, content-addressedCorpus. Bodies are deduped by hash while every distinct receipt is kept (no provenance dropped); the catalog streams;verifyre-hashes every stored body (MATCH/MISSING/CORRUPT); the run history is kept too.gather.run: the witnessed gather session.gather_runorchestrates fetch, scope, optional synthesis, digest, and store into one re-checkableRunRecord(its own seal plus the items' digest seal); the scope and synthesizer are composition seams that default to Null so the run stands alone.gather.recall: aQueryover a stored corpus (substring scope terms, plus source/kind/method filters: OR within a filter, AND across) returning reconstructed items that are re-verified (missing or corrupt bodies are skipped and reported), so downstream organs draw scoped, trustworthy subsets.gather.credentials: the one place secrets enter, read from the environment by name, never logged, never put in a receipt or a URL.gather.api: an authenticated JSON-API adapter, the worked example of the credentials pattern (token from env, sent as a header, never witnessed).gather.browser: JavaScript-rendered pages via a headless browser; thebrowser-extractmethod records that JS was run.gather.ocr: text from a scanned image viatesseract; a machine reading, labelledocr.gather.transcribe: a transcript from audio via a Whisper-style CLI; a machine transcription, labelledtranscribe.gather.model: the real model edge for the synthesizer seam; shells to a model CLI (prompt on stdin), stamping a genuinesynthesizedinference,derived_fromset.gather.provenance: theProvenanceProviderseam, composing an external origin verdict (forged? re-encode? authentic?) per item; theNulldefault stands alone, a subprocess edge calls an external provenance organ. Verdicts are sealed into the run record.gather.method: the method ladder. Classifies a method as direct or derived, andmake_itemenforces it: a fetched item cannot carry a derivation chain and a synthesized one cannot lack it.gather.cli: agathercommand (parse/docs/pdfoffline,web/feed/video/arxiv/api/browser/ocr/transcribelive), every command takes--store DIR; plusrunandcorpus list/verify/digest/runs/search/stats/prune.gather.commands: the command implementations behind the CLI surface (split fromcliso no module exceeds the size budget).
The core is pure standard library. A source adapter may pull in whatever its source
demands, isolated behind the Source shape.
ARCHITECTURE.md is the design map (the seams, the receipt, the corpus, the run, the threat model); CHANGELOG.md is the version history.
Roadmap
Shipped:
- The provenance receipt, the scope filter, the witnessed digest with a re-checkable seal, the catalog.
- Adapters behind one
Sourceshape: video (yt-dlp), web (static http), feed (RSS/Atom), docs (local files), arXiv (papers), PDF (pdftotext), authenticated JSON APIs (env-isolated credentials). - The derive seam: the
Synthesizershape with an honest compiling default and a real model edge (gather.model); a model producessynthesized, the default producescompiled, nothing fabricates. - A durable, content-addressed corpus (
--store DIR): bodies deduped by hash, the catalog streamed, andcorpus verifyre-hashing every stored body against its receipt. - A witnessed gather run (
gather run config.json): orchestrates many sources, scope, and optional synthesis into one re-checkable record, kept in the corpus run history. - Recall over the corpus (
gather corpus search): query by scope terms and source/kind/method, returning re-verifiable items and a scoped digest. - Isolated credentials (env-only, never witnessed) with an authenticated-API adapter, and the method ladder enforced at construction (a fetch cannot claim inputs, a synthesis cannot lack them).
- The hard sources behind the same seam, as isolated external-tool edges: JavaScript pages (headless browser), scanned images (OCR), and audio (transcription).
- A real model edge for the synthesizer seam, and a provenance-composition seam that folds an external origin verdict per item into the witnessed run.
Gather reached its organic completion at 1.5.0: every planned source and seam is shipped, and the accountability claims hold end to end across a final whole-system review. The item below is a scale optimization, not missing function.
Possible future work (not required for the completion milestone):
- Corpus indexing so recall need not read every body at large scale.
License
Gather is fair-source: the code is open to read, run, and build on, with commercial use reserved so the project can fund its own development. Copyright stays with the author. See LICENSE for the exact terms.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gather_engine-1.5.0.tar.gz.
File metadata
- Download URL: gather_engine-1.5.0.tar.gz
- Upload date:
- Size: 75.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10a91acd220720ed0f2ac33f45c9d10427069f26e09a3f9303af5ad1ba5e1e32
|
|
| MD5 |
234abcb15efa0a9d27929cf0fa712999
|
|
| BLAKE2b-256 |
7ea77f156c688ebce565e2fdc095e76e18e1249c6dfaf65fa3f70133c4899496
|
Provenance
The following attestation bundles were made for gather_engine-1.5.0.tar.gz:
Publisher:
release.yml on HarperZ9/gather
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gather_engine-1.5.0.tar.gz -
Subject digest:
10a91acd220720ed0f2ac33f45c9d10427069f26e09a3f9303af5ad1ba5e1e32 - Sigstore transparency entry: 1957228364
- Sigstore integration time:
-
Permalink:
HarperZ9/gather@433b80758de6214f4c09872669666ed6308d8b6a -
Branch / Tag:
refs/tags/v1.5.0 - Owner: https://github.com/HarperZ9
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@433b80758de6214f4c09872669666ed6308d8b6a -
Trigger Event:
push
-
Statement type:
File details
Details for the file gather_engine-1.5.0-py3-none-any.whl.
File metadata
- Download URL: gather_engine-1.5.0-py3-none-any.whl
- Upload date:
- Size: 62.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66b3cd4eb384d9ef00ca1b6c9e823c1a062667a0f8f54b66ebbf9093f9a34f4d
|
|
| MD5 |
e5fc56113cb2fad738467f830cab9517
|
|
| BLAKE2b-256 |
af967b75b3cc907fc9bcf035b529242a471b12a4fc67a0e875e5c42c08d31963
|
Provenance
The following attestation bundles were made for gather_engine-1.5.0-py3-none-any.whl:
Publisher:
release.yml on HarperZ9/gather
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gather_engine-1.5.0-py3-none-any.whl -
Subject digest:
66b3cd4eb384d9ef00ca1b6c9e823c1a062667a0f8f54b66ebbf9093f9a34f4d - Sigstore transparency entry: 1957228499
- Sigstore integration time:
-
Permalink:
HarperZ9/gather@433b80758de6214f4c09872669666ed6308d8b6a -
Branch / Tag:
refs/tags/v1.5.0 - Owner: https://github.com/HarperZ9
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@433b80758de6214f4c09872669666ed6308d8b6a -
Trigger Event:
push
-
Statement type: