Skip to main content

Automated research paper tracking and knowledge synthesis

Project description

Swarm Notes Core Package

An autonomous, serverless, multi-agent system that tracks academic papers, extracts structured data, and weaves them into a local, interconnected Markdown knowledge graph — a Second Brain for ML research.
Built to eventually communicate with other identical systems, forming a decentralised Hive Mind.


Architecture

┌─────────────────────────────────────────────────────┐
│                  GitHub Actions CI                  │
│  (weekly schedule + workflow_dispatch)              │
└─────────────────────┬───────────────────────────────┘
                      │
         ┌────────────▼────────────┐
         │   Federation Agent      │  ← consumes external public_feed.json feeds
         └────────────┬────────────┘
                      │
         ┌────────────▼────────────┐
         │       Watcher           │  ← queries ArXiv API by keyword
         └────────────┬────────────┘
                      │  RawPaper[]
         ┌────────────▼────────────┐
         │    Router (Skill        │  ← routes each paper to a domain skill
         │    Registry)            │    (NLP, Vision, TimeSeries, …)
         └────────────┬────────────┘
                      │  Skill
         ┌────────────▼────────────┐
         │    Analyst              │  ← pydantic-ai structured extraction
         │    (pydantic-ai)        │    with taxonomy injection
         └────────────┬────────────┘
                      │  PaperAnalysis
         ┌────────────▼────────────┐
         │    Vault Writer         │  ← writes .md to tmp_vault/
         │                         │    generates concept stubs
         │                         │    updates public_feed.json
         └────────────┬────────────┘
                      │  atomic move
         ┌────────────▼────────────┐
         │       /vault            │  ← permanent, file-based knowledge graph
         │   papers/ concepts/     │
         │   datasets/             │
         └─────────────────────────┘

Directory Structure

research-cruise/
├── .github/
│   └── workflows/
│       └── autonomous-tracker.yml   # CI/CD pipeline
├── vault/
│   ├── papers/                      # One .md file per paper
│   ├── concepts/                    # Auto-generated concept stubs
│   └── datasets/                    # Dataset stubs
├── swarm_notes/
│   ├── config.py                    # Configuration & env vars
│   ├── vault_manager.py             # Staging pattern (tmp_vault → vault)
│   ├── watcher.py                   # Configurable paper-source watcher
│   ├── router.py                    # Skill registry router
│   ├── analyst.py                   # pydantic-ai extraction agent
│   ├── vault_writer.py              # Markdown writer + public_feed.json
│   ├── federation.py                # Hive Mind federation agent
│   └── main.py                      # Pipeline orchestrator
├── taxonomy.json                    # Controlled vocabulary (tags, domains)
├── public_feed.json                 # Rolling feed of last 20 papers (for federation)
└── requirements.txt

Quick Start

Prerequisites

  • Python 3.11+
  • An OpenAI-compatible API key

Local Run

# Install dependencies
pip install -r requirements.txt

# Set your API key
export LLM_API_KEY="sk-..."

# Optionally customise keywords
export PAPER_KEYWORDS="mamba,diffusion model,retrieval augmented generation"

# Optional: switch the watcher to Semantic Scholar
export PAPER_SOURCE="semantic_scholar"
export SEMANTIC_SCHOLAR_API_KEY="..."

# Run the pipeline
python -m swarm_notes.main

Configuration (Environment Variables)

Variable Default Description
LLM_API_KEY (required) API key for the LLM provider
LLM_MODEL openai:gpt-4o-mini pydantic-ai model string
PAPER_SOURCE arxiv Paper search backend: arxiv or semantic_scholar
PAPER_KEYWORDS See config.py Comma-separated search terms
PAPER_MAX_RESULTS_PER_KEYWORD 5 Papers fetched per keyword
PAPER_TOTAL_CAP 20 Hard cap on total papers per run
SEMANTIC_SCHOLAR_API_KEY (empty) Optional Semantic Scholar API key sent as x-api-key
FEDERATION_FEEDS (empty) Comma-separated external feed URLs
PUBLIC_FEED_MAX_ITEMS 20 Max entries kept in public_feed.json

When PAPER_SOURCE=semantic_scholar, the watcher queries Semantic Scholar's Graph API and keeps only results that can be mapped back to an ArXiv identifier. That preserves compatibility with the rest of the pipeline, which still stores papers by arxiv_id.

Legacy ARXIV_KEYWORDS, ARXIV_MAX_RESULTS_PER_KEYWORD, and ARXIV_TOTAL_CAP are still accepted for backward compatibility, but PAPER_* names are now canonical.

CI/CD Setup

1. Fork the repository

Click Fork on GitHub to create your own copy of this repository.

2. Add the required secret

The pipeline needs an OpenAI-compatible API key to run the LLM analyst step.

  1. Open your forked repository on GitHub.
  2. Go to Settings → Secrets and variables → Actions.
  3. Click New repository secret.
  4. Set Name to LLM_API_KEY and Secret to your API key (e.g. sk-...).
  5. Click Add secret.

Note: The workflow exposes LLM_API_KEY as both LLM_API_KEY and OPENAI_API_KEY so that pydantic-ai's OpenAI provider picks it up automatically.

3. (Optional) Override the model

By default the pipeline uses openai:gpt-4o-mini. To use a different model, add a second repository secret (or variable) named LLM_MODEL with the pydantic-ai model string, e.g. openai:gpt-4o or anthropic:claude-3-5-haiku.

You can also set LLM_MODEL in the workflow's env: block directly if you prefer not to use a secret.

4. Run the pipeline

  • Scheduled: the pipeline fires automatically every Monday at 06:00 UTC.
  • Manual: go to Actions → Autonomous Research Tracker → Run workflow, optionally override keywords, federation_feeds, and max_results in the dispatch form.

The Hive Mind (Federation)

Every successful run updates public_feed.json at the root of the repository with the metadata and summaries of the last 20 processed papers.

To subscribe to another agent's feed, pass their raw public_feed.json URL:

export FEDERATION_FEEDS="https://raw.githubusercontent.com/alice/research-cruise/main/public_feed.json,https://raw.githubusercontent.com/bob/research-cruise/main/public_feed.json"
python -m swarm_notes.main

Or set federation_feeds in the workflow_dispatch inputs.

Conflict resolution: If an external feed contains a review of a paper that already exists locally, the local metadata is preserved. The external summary is appended under a ### External Perspectives section:

### External Perspectives

> "Transformers are over-engineered for this dataset." - @Agent_alice
> *(Retrieved 2024-01-15)*

Vault File Format

Each paper note uses hybrid YAML frontmatter (CSL-compatible fields + custom fields):

---
# CSL-compatible fields
title: "Attention Is All You Need"
author:
  - literal: "Ashish Vaswani"
issued:
  date-parts:
    - [2017, 6, 12]
url: "https://arxiv.org/abs/1706.03762"

# Custom fields
arxiv_id: "1706.03762"
domain: "nlp"
tags:
  - "transformer"
  - "attention-mechanism"
architectures:
  - "encoder-decoder"
datasets:
  - "WMT 2014"
skill: "NLPSkill"
processed_at: "2024-01-15T06:00:00Z"
---

Body sections: Summary, Key Contributions, Key Concepts (with relative links to ../concepts/), Datasets, Limitations, Links.

Taxonomy

taxonomy.json contains the controlled vocabulary of tags, architectures, and domains injected into the analyst's system prompt. This prevents LLM hallucination and keeps metadata consistent. Edit taxonomy.json to add new terms.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarm_notes-0.1.0.tar.gz (34.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarm_notes-0.1.0-py3-none-any.whl (46.1 kB view details)

Uploaded Python 3

File details

Details for the file swarm_notes-0.1.0.tar.gz.

File metadata

  • Download URL: swarm_notes-0.1.0.tar.gz
  • Upload date:
  • Size: 34.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.15

File hashes

Hashes for swarm_notes-0.1.0.tar.gz
Algorithm Hash digest
SHA256 84776cffb0347cac0d215e195b77b1f9339fe4fdf917da00b4cdb9ff3d511dfe
MD5 cb75b755a9990a19ae5ed5fe8e65b328
BLAKE2b-256 018a2cdadf4868b6bb383584c0a50893a17d4e2e7c8398aa716bfcb5e2e24ea8

See more details on using hashes here.

File details

Details for the file swarm_notes-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for swarm_notes-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3bbfc8566024e89fc93f440c8530ae73135b465559fd05c96843f7df90be3e54
MD5 7bce4e417dd1ece9fadc7b8b18967ab2
BLAKE2b-256 5f5e132e478b44a7a6aa1b5e9667e0b1ed4ac19fc94c10d228f268d9ace63cb6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page