Skip to main content

Automated research paper tracking and knowledge synthesis

Project description

research-cruise ๐Ÿš€

An autonomous, serverless, multi-agent system that tracks academic papers, extracts structured data, and weaves them into a local, interconnected Markdown knowledge graph โ€” a Second Brain for ML research.
Built to eventually communicate with other identical systems, forming a decentralised Hive Mind.


Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  Triggers                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚   Federation Agent      โ”‚  โ† consumes external public_feed.json feeds
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚       Watcher           โ”‚  โ† queries ArXiv API by keyword
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚  RawPaper[]
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚    Router (Skill        โ”‚  โ† routes each paper to a domain skill
         โ”‚    Registry)            โ”‚    (NLP, Vision, TimeSeries, โ€ฆ)
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚  Skill
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚    Analyst              โ”‚  โ† pydantic-ai structured extraction
         โ”‚    (pydantic-ai)        โ”‚    with taxonomy injection
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚  PaperAnalysis
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚    Vault Writer         โ”‚  โ† writes .md to tmp_vault/
         โ”‚                         โ”‚    generates concept stubs
         โ”‚                         โ”‚    updates public_feed.json
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                      โ”‚  atomic move
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚       /vault            โ”‚  โ† permanent, file-based knowledge graph
         โ”‚   papers/ concepts/     โ”‚
         โ”‚   datasets/             โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Directory Structure

research-cruise/
โ”œโ”€โ”€ .github/
โ”‚   โ””โ”€โ”€ workflows/
โ”‚       โ””โ”€โ”€ autonomous-tracker.yml   # CI/CD pipeline
โ”œโ”€โ”€ vault/
โ”‚   โ”œโ”€โ”€ papers/                      # One .md file per paper
โ”‚   โ”œโ”€โ”€ concepts/                    # Auto-generated concept stubs
โ”‚   โ””โ”€โ”€ datasets/                    # Dataset stubs
โ”œโ”€โ”€ swarm_notes/
โ”‚   โ”œโ”€โ”€ config.py                    # Configuration & env vars
โ”‚   โ”œโ”€โ”€ vault_manager.py             # Staging pattern (tmp_vault โ†’ vault)
โ”‚   โ”œโ”€โ”€ watcher.py                   # Configurable paper-source watcher
โ”‚   โ”œโ”€โ”€ router.py                    # Skill registry router
โ”‚   โ”œโ”€โ”€ analyst.py                   # pydantic-ai extraction agent
โ”‚   โ”œโ”€โ”€ vault_writer.py              # Markdown writer + public_feed.json
โ”‚   โ”œโ”€โ”€ federation.py                # Hive Mind federation agent
โ”‚   โ””โ”€โ”€ main.py                      # Pipeline orchestrator

Quick Start

Prerequisites

  • Python 3.11+
  • An LLM API key

Local Dev Run

# Install dependencies
uv sync

# Set your API key in .env file
export LLM_API_KEY="sk-..."
export PAPER_SOURCE="semantic_scholar"
export SEMANTIC_SCHOLAR_API_KEY="..."

# prepare configs in configs/ folder
...

# Run the pipeline
python -m swarm_notes.main

Configuration (Environment Variables)

Use the example in configs folder to create your own version.

CI/CD Setup

Add the required secret

The pipeline needs an OpenAI-compatible API key to run the LLM analyst step.

  1. Open your forked repository on GitHub.
  2. Go to Settings โ†’ Secrets and variables โ†’ Actions.
  3. Click New repository secret.
  4. Set Name to LLM_API_KEY and Secret to your API key (e.g. sk-...).
  5. Click Add secret.

Note: The workflow exposes LLM_API_KEY as both LLM_API_KEY and OPENAI_API_KEY so that pydantic-ai's OpenAI provider picks it up automatically.

The Hive Mind (Federation)

Every successful run updates public_feed.json at the root of the repository with the metadata and summaries of the last 20 processed papers.

To subscribe to another agent's feed, pass their raw public_feed.json URL:

export FEDERATION_FEEDS="https://raw.githubusercontent.com/alice/research-cruise/main/public_feed.json,https://raw.githubusercontent.com/bob/research-cruise/main/public_feed.json"
python -m swarm_notes.main

Conflict resolution: If an external feed contains a review of a paper that already exists locally, the local metadata is preserved. The external summary is appended under a ### External Perspectives section:

### External Perspectives

> "Transformers are over-engineered for this dataset." - @Agent_alice
> *(Retrieved 2024-01-15)*

Vault File Format

Each paper note uses hybrid YAML frontmatter (CSL-compatible fields + custom fields):

---
# CSL-compatible fields
title: "Attention Is All You Need"
author:
  - literal: "Ashish Vaswani"
issued:
  date-parts:
    - [2017, 6, 12]
url: "https://arxiv.org/abs/1706.03762"

# Custom fields
arxiv_id: "1706.03762"
domain: "nlp"
tags:
  - "transformer"
  - "attention-mechanism"
architectures:
  - "encoder-decoder"
datasets:
  - "WMT 2014"
skill: "NLPSkill"
processed_at: "2024-01-15T06:00:00Z"
---

Body sections: Summary, Key Contributions, Key Concepts (with relative links to ../concepts/), Datasets, Limitations, Links.

Taxonomy

taxonomy.json contains the controlled vocabulary of tags, architectures, and domains injected into the analyst's system prompt. This prevents LLM hallucination and keeps metadata consistent. Edit taxonomy.json to add new terms.

License

MIT โ€” see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

swarm_notes-0.1.5.tar.gz (38.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

swarm_notes-0.1.5-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file swarm_notes-0.1.5.tar.gz.

File metadata

  • Download URL: swarm_notes-0.1.5.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.15

File hashes

Hashes for swarm_notes-0.1.5.tar.gz
Algorithm Hash digest
SHA256 dd51998e8d3bffbf524898cce0c2737f4412b6ae8b6659198f1ca8df94548bc2
MD5 e47f67576fb64a2d591f222e5799f84b
BLAKE2b-256 6ee17f45aa4e262f1af5ae818a29dedf0c68ce9a297821ef5a64b64d56fb0712

See more details on using hashes here.

File details

Details for the file swarm_notes-0.1.5-py3-none-any.whl.

File metadata

File hashes

Hashes for swarm_notes-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ce0990b4140ea1fbf83d1e7a57bf9eaf807be96840c290df8cfef89acf7800e8
MD5 e95d4046b9a27429b0b4a05d87b6b1b5
BLAKE2b-256 a44a21a7319ec051bbf4525e165b4e5decfb23b713220122da36b69d9918a460

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page