Skip to main content

Intelligent research paper analysis pipeline with LLM-driven categorization

Project description

Research Assistant

An intelligent pipeline for processing research papers using LLMs (Ollama or Gemini) with dynamic LLM-driven category generation, accurate PDF parsing, metadata extraction, multi-category relevance scoring, deduplication, and automated summarization.

Features

  • ๐Ÿค– Dynamic LLM-Driven Taxonomy: LLM generates categories from your research topic (no hardcoded categories!)
  • ๐Ÿ“Š Multi-Category Scoring: Papers scored across ALL categories simultaneously for best-fit placement
  • ๐ŸŽฏ Flexible LLM Support: Use local Ollama models or Google Gemini API
  • ๐Ÿ”ง Generic & Configurable: Runtime topic and directory configuration (no hardcoding)
  • ๐Ÿ“„ Accurate PDF Parsing: PyMuPDF + OCR fallback (ocrmypdf + Tesseract)
  • ๐Ÿ” LLM-Based Metadata Extraction: Extract titles, authors, abstracts, years using local or cloud LLMs
  • ๐Ÿ”„ Smart Deduplication: Exact (hash-based) and near-duplicate (MinHash-based) detection
  • ๐Ÿ“ Topic-Focused Summaries: Per-paper summaries with "how this helps your research"
  • ๐Ÿ’พ Resumable: SQLite cache for embeddings and OCR outputs, index-based resume logic
  • ๐Ÿ“ค Multiple Outputs: JSONL master index + CSV spreadsheet + Markdown summaries per category
  • โฑ๏ธ Rate Limiting: Smart Gemini API rate limiting (10 RPM, 500 RPD) with warnings and interactive prompts
  • โœ… Comprehensive Testing: 220+ unit and integration tests with 77% coverage

Pipeline Flow (8 Passes)

graph TD
    A[๐Ÿ“ Input: PDF Directory + Topic] --> B[๐Ÿค– PASS 1: LLM Taxonomy Generation]
    B -->|Generate categories from topic ONLY| C[๏ฟฝ PASS 2: Inventory PDFs]
    C -->|Discover all PDFs| D[๐Ÿ” PASS 3: Metadata + Classification]
    D -->|Extract metadata + Multi-category scoring| E{Readable?}
    E -->|No| F[๏ฟฝ Move to need_human_element/]
    E -->|Yes| G{Topic Relevance?}
    G -->|< threshold| H[๏ฟฝ Move to quarantined/]
    G -->|>= threshold| I[๐Ÿ“ PASS 4: Move to Best Category]
    I -->|Highest scoring category| J[๐Ÿ”„ PASS 5: Deduplication]
    J -->|MinHash LSH| K{Duplicate?}
    K -->|Yes| L[๏ฟฝ Move to repeated/]
    K -->|No| M[๐Ÿ“ PASS 6: Update Manifests]
    M --> N[โœ๏ธ PASS 7: LLM Summarization]
    N -->|Topic-focused summaries| O[๐Ÿ’พ PASS 8: Generate Index]
    O --> P[๐Ÿ“Š index.csv]
    O --> Q[๐Ÿ“‹ index.jsonl]
    O --> R[๐Ÿ“ summaries/*.md]
    O --> S[๐Ÿ“œ manifests/*.json]
    O --> T[๐Ÿ—‚๏ธ categories.json]
    
    style B fill:#e1f5ff
    style D fill:#e1f5ff
    style N fill:#e1f5ff
    style F fill:#ffe1e1
    style H fill:#ffe1e1
    style L fill:#ffe1e1
    style P fill:#e1ffe1
    style Q fill:#e1ffe1
    style R fill:#e1ffe1
    style S fill:#e1ffe1
    style T fill:#e1ffe1

Architecture

research_assistant/
โ”œโ”€โ”€ cli.py                  # Main CLI entry point (8-pass pipeline)
โ”œโ”€โ”€ config.py               # Configuration and settings
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ taxonomy.py         # ๐Ÿ†• LLM-based category generation from topic
โ”‚   โ”œโ”€โ”€ inventory.py        # Directory traversal and PDF discovery
โ”‚   โ”œโ”€โ”€ parser.py           # PDF text extraction (PyMuPDF + OCR)
โ”‚   โ”œโ”€โ”€ metadata.py         # LLM metadata extraction + multi-category scoring
โ”‚   โ”œโ”€โ”€ dedup.py            # MinHash near-duplicate detection
โ”‚   โ”œโ”€โ”€ embeddings.py       # Ollama embedding generation
โ”‚   โ”œโ”€โ”€ summarizer.py       # Topic-focused summary generation
โ”‚   โ”œโ”€โ”€ mover.py            # File moving with dynamic folder creation
โ”‚   โ”œโ”€โ”€ manifest.py         # Simplified category manifest tracking
โ”‚   โ””โ”€โ”€ outputs.py          # JSONL, CSV, and Markdown generation
โ”œโ”€โ”€ utils/
โ”‚   โ”œโ”€โ”€ cache_manager.py    # SQLite-based caching
โ”‚   โ”œโ”€โ”€ llm_provider.py     # Unified Ollama/Gemini interface
โ”‚   โ”œโ”€โ”€ gemini_client.py    # Google Gemini API client
โ”‚   โ”œโ”€โ”€ hash.py             # Content hashing utilities
โ”‚   โ””โ”€โ”€ text.py             # Text normalization and processing
โ””โ”€โ”€ tests/                  # 100+ unit and integration tests

Prerequisites

  • Python 3.12+
  • LLM Provider (choose one or both):
    • Ollama (local, free) with models:
      • deepseek-r1:8b (metadata extraction & classification)
      • nomic-embed-text (embeddings)
    • Google Gemini API (cloud, requires API key):
      • Set GEMINI_API_KEY environment variable
  • Tesseract (for OCR): brew install tesseract (macOS) or apt-get install tesseract-ocr (Linux)

Installation

From PyPI (Recommended)

# Install from PyPI
pip install research-assistant-llm

# Run interactive setup wizard (guides you through Ollama/Gemini setup)
research-assistant setup

# Or manual setup:
# Option 1: Use Ollama (local, free)
ollama pull deepseek-r1:8b
ollama pull nomic-embed-text

# Option 2: Use Gemini API (cloud-based)
export GEMINI_API_KEY="your_api_key_here"

From Source (Development)

# Clone repository
git clone https://github.com/rexmirak/research_assistant.git
cd research_assistant

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode
pip install -e .

# Install development dependencies
pip install -e ".[dev]"

API Key Setup

Gemini API (Cloud)

Option 1: Environment Variable (Recommended for CI/CD)

export GEMINI_API_KEY="your_api_key_here"
research-assistant process --llm-provider gemini --root-dir ./papers --topic "..."

Option 2: .env File (Convenient for local development)

# Create .env in your working directory
echo "GEMINI_API_KEY=your_api_key_here" > .env
research-assistant process --llm-provider gemini --root-dir ./papers --topic "..."

Option 3: Config File

# config.yaml
gemini:
  api_key: "${GEMINI_API_KEY}"  # References environment variable
  # OR
  api_key: "your_api_key_here"  # Direct (not recommended for version control)
research-assistant process --config-file config.yaml --root-dir ./papers --topic "..."

Get your Gemini API key: https://aistudio.google.com/app/apikey

Ollama (Local)

No API key needed! Just install Ollama and pull models:

# Install from https://ollama.com/download
ollama pull deepseek-r1:8b
ollama pull nomic-embed-text

research-assistant process --llm-provider ollama --root-dir ./papers --topic "..."

Quick Start

# View help
research-assistant --help
research-assistant process --help

# Basic usage with Gemini (recommended)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Prompt Injection Attacks in Large Language Models" \
  --llm-provider gemini \
  --workers 2

# With Ollama (local, requires models installed)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --llm-provider ollama \
  --workers 2

# Custom topic relevance threshold (default: 5/10)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --min-topic-relevance 7

# Resume from interrupted run (skips analyzed papers)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --resume

# Force regenerate categories (ignore cached taxonomy)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --force-regenerate-categories

# Dry-run (no file moves)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --dry-run

Configuration

Runtime configuration via CLI flags or config.yaml:

# config.yaml (optional)
llm_provider: gemini  # or 'ollama'

# Scoring thresholds
scoring:
  min_topic_relevance: 5  # Papers below this go to quarantined/ (1-10 scale)

# Deduplication
dedup:
  similarity_threshold: 0.95
  use_minhash: true
  num_perm: 128

# LLM providers
ollama:
  summarize_model: "deepseek-r1:8b"
  classify_model: "deepseek-r1:8b"
  embed_model: "nomic-embed-text"
  temperature: 0.1
  base_url: "http://localhost:11434"

gemini:
  api_key: null  # Set via GEMINI_API_KEY environment variable
  temperature: 0.1

# Rate limiting (Gemini API)
rate_limit:
  enabled: true
  rpm_limit: 10   # Requests per minute (Gemini free tier)
  rpd_limit: 500  # Requests per day (Gemini free tier)
  # Warnings at 50% (250 RPD) and 75% (375 RPD)
  # Interactive prompt at daily limit with options:
  #   1. Pause and resume tomorrow
  #   2. Switch to Ollama (local)
  #   3. Continue anyway (risky)

# Metadata enrichment
crossref:
  enabled: true
  email: "your.email@domain.com"  # Polite pool (optional)

# File organization
move:
  enabled: true
  track_manifest: true
  create_symlinks: false

# Processing
processing:
  workers: 2  # Parallel workers (recommend 2 for API rate limits)
  batch_size: 32

Rate Limiting (Gemini API)

Automatic rate limiting prevents API failures and quota exhaustion:

  • RPM Tracking: Enforces 10 requests per minute (Gemini free tier)

    • Automatically adds delays between requests to stay under limit
    • Thread-safe implementation for parallel workers
  • RPD Tracking: Monitors 500 requests per day limit

    • Warning at 50% usage (250 requests)
    • Warning at 75% usage (375 requests)
    • Interactive prompt at limit with options:
      1. Pause: Stop processing, resume tomorrow (preserves progress)
      2. Switch to Ollama: Continue with local LLM (no API costs)
      3. Continue anyway: Risk API errors (not recommended)
  • Persistent State: Tracks usage across runs in cache/rate_limit_state.json

  • Disable: Set rate_limit.enabled: false in config to disable

Example output:

โš ๏ธ  WARNING: 75% of daily Gemini quota used (375/500 requests)
Consider switching to Ollama to preserve remaining quota.

๐Ÿ›‘ Daily Gemini API limit reached (500/500 requests)
Options:
  1. Pause processing and resume tomorrow
  2. Switch to Ollama (local, no API costs)
  3. Continue anyway (may fail)

Dynamic Category Generation

How it works:

  1. LLM generates categories from topic ONLY (no papers analyzed yet)

    • Example topic: "Prompt Injection Attacks in Large Language Models"
    • LLM generates 10-15 relevant categories with definitions
    • Cached in outputs/categories.json and cache/categories.json
  2. Multi-category scoring for each paper:

    • Paper scored against ALL categories simultaneously (1-10 scale)
    • Returns: topic_relevance, category_scores dict, reasoning
    • Paper placed in highest-scoring category
  3. Topic relevance filtering:

    • Papers with topic_relevance < threshold โ†’ quarantined/
    • Configurable via --min-topic-relevance (default: 5/10)

Example Categories Generated:

{
  "attack_vectors": "Papers describing methods to perform prompt injection...",
  "defense_mechanisms": "Papers proposing techniques to defend against...",
  "detection_methods": "Papers focusing on identifying attacks...",
  "robustness_evaluation": "Papers developing metrics and benchmarks..."
}

Manifest System & Resume Logic

Manifest Structure (per category):

  • Tracks all papers in this category
  • Stores classification reasoning and scores
  • Enables resume functionality

Manifest Entry:

{
  "paper_id": "abc123def456...",
  "title": "Defending Against Prompt Injection Attacks",
  "path": "defense_mechanisms/smith2023.pdf",
  "content_hash": "sha256:...",
  "classification_reasoning": "Paper focuses on input validation...",
  "relevance_score": 9,
  "topic_relevance": 8,
  "analyzed": true
}

Resume Logic:

  • Checks index.jsonl for papers with analyzed: true
  • Skips re-processing, loads from cache
  • More efficient than re-running entire pipeline

Output Structure

outputs/
โ”œโ”€โ”€ categories.json          # ๐Ÿ†• LLM-generated taxonomy with definitions
โ”œโ”€โ”€ index.jsonl              # Full machine-readable index
โ”œโ”€โ”€ index.csv                # Spreadsheet with all metadata
โ”œโ”€โ”€ summaries/
โ”‚   โ”œโ”€โ”€ attack_vectors.md    # ๐Ÿ†• Dynamic category names
โ”‚   โ”œโ”€โ”€ defense_mechanisms.md
โ”‚   โ”œโ”€โ”€ quarantined.md
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ logs/
โ”‚   โ””โ”€โ”€ pipeline_YYYYMMDD_HHMMSS.log  # Detailed execution log
โ””โ”€โ”€ manifests/
    โ”œโ”€โ”€ attack_vectors.manifest.json  # ๐Ÿ†• Dynamic categories
    โ”œโ”€โ”€ defense_mechanisms.manifest.json
    โ”œโ”€โ”€ quarantined.manifest.json
    โ”œโ”€โ”€ repeated.manifest.json
    โ””โ”€โ”€ need_human_element.manifest.json

Index Fields (JSONL/CSV)

New fields:

  • paper_id: Unique identifier (content hash)
  • title, authors, year, venue, doi, bibtex
  • category: Final category (best-fit from LLM scoring)
  • topic_relevance: 1-10 relevance to research topic
  • category_scores: JSON dict with scores for ALL categories
  • reasoning: LLM explanation for categorization
  • duplicate_of: Paper ID if duplicate
  • is_duplicate: Boolean flag
  • path: Current file path
  • summary_file: Link to markdown summary
  • analyzed: Boolean (true when processing complete)

Removed fields (from old system):

  • original_category - No longer tracked (papers start in flat directory)
  • status - Replaced by explicit category placement
  • include - Replaced by topic_relevance threshold

Advanced Usage

Custom topic relevance threshold

# Stricter filtering (only highly relevant papers)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --min-topic-relevance 7

# More permissive (include more papers)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --min-topic-relevance 3

Working with cached categories

# Use cached taxonomy (fast)
research-assistant process --root-dir ./papers --topic "..." --resume

# Force regenerate taxonomy (if topic changed)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --force-regenerate-categories

Parallel processing

# More workers (caution: rate limiter adds delays)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --workers 4

# Recommended for Gemini free tier (rate limiter enforces 10 RPM)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --workers 2

Troubleshooting

OCR failing

# Verify Tesseract installation
tesseract --version

# Install additional language packs if needed
brew install tesseract-lang

Ollama connection issues

# Check Ollama is running
ollama list

# Restart Ollama service
brew services restart ollama

Performance Tips

  • Parallel processing: Set --workers 2-4 for multiprocessing (rate limiter handles coordination)
  • Rate limit awareness: Gemini free tier enforces 10 RPM (automatically managed)
  • Cache warming: Run inventory + parsing first, then scoring/summarization
  • Selective OCR: Skip OCR for born-digital PDFs (auto-detected)
  • Batch embeddings: Automatically batched in groups of 64
  • Resume capability: Use --resume to skip already-analyzed papers

Testing & Quality

# Run full test suite
pytest

# Run with coverage
pytest --cov=core --cov=utils --cov-report=html

# Run specific test file
pytest tests/test_metadata.py -v

# Type checking
mypy core/ utils/ --explicit-package-bases --ignore-missing-imports

# Linting
flake8 core/ utils/ tests/

# Security scanning
pip-audit --requirement requirements.txt
bandit -r core/ utils/ -ll

CI/CD: GitHub Actions runs all quality checks on Python 3.12 & 3.13

  • โœ… Linting (flake8)
  • โœ… Type checking (mypy)
  • โœ… Security scanning (pip-audit, bandit)
  • โœ… Tests (pytest)
  • โœ… Documentation checks
  • โœ… Build verification

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

research_assistant_llm-0.1.1.tar.gz (84.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

research_assistant_llm-0.1.1-py3-none-any.whl (57.9 kB view details)

Uploaded Python 3

File details

Details for the file research_assistant_llm-0.1.1.tar.gz.

File metadata

  • Download URL: research_assistant_llm-0.1.1.tar.gz
  • Upload date:
  • Size: 84.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for research_assistant_llm-0.1.1.tar.gz
Algorithm Hash digest
SHA256 46e848dc66114955b75b07425b8f7d332f0d9694485b7d6a7769476bc7a354ef
MD5 02a4fdac4d12da3d6152ba14019c0504
BLAKE2b-256 fae54fad9d36b30c2555e2dceb866ef15086b5ef8a63a5c91203da6cda3dda9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for research_assistant_llm-0.1.1.tar.gz:

Publisher: publish.yml on rexmirak/research_assistant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file research_assistant_llm-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for research_assistant_llm-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 40ea710fe2b98a3c32e4c7e99c442c46540ac7c31840e843fcbd62a729d4e735
MD5 b6e1be34c281ca17a5344cfd66a9e48d
BLAKE2b-256 b36f1b94744253a2f41ca05db52a7740d5a62aace605141bfd59a933f5abe855

See more details on using hashes here.

Provenance

The following attestation bundles were made for research_assistant_llm-0.1.1-py3-none-any.whl:

Publisher: publish.yml on rexmirak/research_assistant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page