Intelligent research paper analysis pipeline with LLM-driven categorization

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Research Assistant

An intelligent pipeline for processing research papers using LLMs (Ollama or Gemini) with dynamic LLM-driven category generation, accurate PDF parsing, metadata extraction, multi-category relevance scoring, deduplication, and automated summarization.

Features

🤖 Dynamic LLM-Driven Taxonomy: LLM generates categories from your research topic (no hardcoded categories!)
📊 Multi-Category Scoring: Papers scored across ALL categories simultaneously for best-fit placement
🎯 Flexible LLM Support: Use local Ollama models or Google Gemini API
🔧 Generic & Configurable: Runtime topic and directory configuration (no hardcoding)
📄 Accurate PDF Parsing: PyMuPDF + OCR fallback (ocrmypdf + Tesseract)
🔍 LLM-Based Metadata Extraction: Extract titles, authors, abstracts, years using local or cloud LLMs
🔄 Smart Deduplication: Exact (hash-based) and near-duplicate (MinHash-based) detection
📝 Topic-Focused Summaries: Per-paper summaries with "how this helps your research"
💾 Resumable: SQLite cache for embeddings and OCR outputs, index-based resume logic
📤 Multiple Outputs: JSONL master index + CSV spreadsheet + Markdown summaries per category
⏱️ Rate Limiting: Smart Gemini API rate limiting (10 RPM, 500 RPD) with warnings and interactive prompts
✅ Comprehensive Testing: 220+ unit and integration tests with 77% coverage

Pipeline Flow (8 Passes)

graph TD
    A[📁 Input: PDF Directory + Topic] --> B[🤖 PASS 1: LLM Taxonomy Generation]
    B -->|Generate categories from topic ONLY| C[� PASS 2: Inventory PDFs]
    C -->|Discover all PDFs| D[🔍 PASS 3: Metadata + Classification]
    D -->|Extract metadata + Multi-category scoring| E{Readable?}
    E -->|No| F[� Move to need_human_element/]
    E -->|Yes| G{Topic Relevance?}
    G -->|< threshold| H[� Move to quarantined/]
    G -->|>= threshold| I[📁 PASS 4: Move to Best Category]
    I -->|Highest scoring category| J[🔄 PASS 5: Deduplication]
    J -->|MinHash LSH| K{Duplicate?}
    K -->|Yes| L[� Move to repeated/]
    K -->|No| M[📝 PASS 6: Update Manifests]
    M --> N[✍️ PASS 7: LLM Summarization]
    N -->|Topic-focused summaries| O[💾 PASS 8: Generate Index]
    O --> P[📊 index.csv]
    O --> Q[📋 index.jsonl]
    O --> R[📝 summaries/*.md]
    O --> S[📜 manifests/*.json]
    O --> T[🗂️ categories.json]
    
    style B fill:#e1f5ff
    style D fill:#e1f5ff
    style N fill:#e1f5ff
    style F fill:#ffe1e1
    style H fill:#ffe1e1
    style L fill:#ffe1e1
    style P fill:#e1ffe1
    style Q fill:#e1ffe1
    style R fill:#e1ffe1
    style S fill:#e1ffe1
    style T fill:#e1ffe1

Architecture

research_assistant/
├── cli.py                  # Main CLI entry point (8-pass pipeline)
├── config.py               # Configuration and settings
├── core/
│   ├── taxonomy.py         # 🆕 LLM-based category generation from topic
│   ├── inventory.py        # Directory traversal and PDF discovery
│   ├── parser.py           # PDF text extraction (PyMuPDF + OCR)
│   ├── metadata.py         # LLM metadata extraction + multi-category scoring
│   ├── dedup.py            # MinHash near-duplicate detection
│   ├── embeddings.py       # Ollama embedding generation
│   ├── summarizer.py       # Topic-focused summary generation
│   ├── mover.py            # File moving with dynamic folder creation
│   ├── manifest.py         # Simplified category manifest tracking
│   └── outputs.py          # JSONL, CSV, and Markdown generation
├── utils/
│   ├── cache_manager.py    # SQLite-based caching
│   ├── llm_provider.py     # Unified Ollama/Gemini interface
│   ├── gemini_client.py    # Google Gemini API client
│   ├── hash.py             # Content hashing utilities
│   └── text.py             # Text normalization and processing
└── tests/                  # 100+ unit and integration tests

Prerequisites

Python 3.12+
LLM Provider (choose one or both):
- Ollama (local, free) with models:
  - deepseek-r1:8b (metadata extraction & classification)
  - nomic-embed-text (embeddings)
- Google Gemini API (cloud, requires API key):
  - Set GEMINI_API_KEY environment variable
Tesseract (for OCR): brew install tesseract (macOS) or apt-get install tesseract-ocr (Linux)

Installation

From PyPI (Recommended)

# Install from PyPI
pip install research-assistant-llm

# Run interactive setup wizard (guides you through Ollama/Gemini setup)
research-assistant setup

# Or manual setup:
# Option 1: Use Ollama (local, free)
ollama pull deepseek-r1:8b
ollama pull nomic-embed-text

# Option 2: Use Gemini API (cloud-based)
export GEMINI_API_KEY="your_api_key_here"

From Source (Development)

# Clone repository
git clone https://github.com/rexmirak/research_assistant.git
cd research_assistant

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode
pip install -e .

# Install development dependencies
pip install -e ".[dev]"

API Key Setup

Gemini API (Cloud)

Option 1: Environment Variable (Recommended for CI/CD)

export GEMINI_API_KEY="your_api_key_here"
research-assistant process --llm-provider gemini --root-dir ./papers --topic "..."

Option 2: .env File (Convenient for local development)

# Create .env in your working directory
echo "GEMINI_API_KEY=your_api_key_here" > .env
research-assistant process --llm-provider gemini --root-dir ./papers --topic "..."

Option 3: Config File

# config.yaml
gemini:
  api_key: "${GEMINI_API_KEY}"  # References environment variable
  # OR
  api_key: "your_api_key_here"  # Direct (not recommended for version control)

research-assistant process --config-file config.yaml --root-dir ./papers --topic "..."

Get your Gemini API key: https://aistudio.google.com/app/apikey

Ollama (Local)

No API key needed! Just install Ollama and pull models:

# Install from https://ollama.com/download
ollama pull deepseek-r1:8b
ollama pull nomic-embed-text

research-assistant process --llm-provider ollama --root-dir ./papers --topic "..."

Quick Start

# View help
research-assistant --help
research-assistant process --help

# Basic usage with Gemini (recommended)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Prompt Injection Attacks in Large Language Models" \
  --llm-provider gemini \
  --workers 2

# With Ollama (local, requires models installed)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --llm-provider ollama \
  --workers 2

# Custom topic relevance threshold (default: 5/10)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --min-topic-relevance 7

# Resume from interrupted run (skips analyzed papers)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --resume

# Force regenerate categories (ignore cached taxonomy)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --force-regenerate-categories

# Dry-run (no file moves)
research-assistant process \
  --root-dir /path/to/papers \
  --topic "Your research topic" \
  --dry-run

Configuration

Runtime configuration via CLI flags or config.yaml:

# config.yaml (optional)
llm_provider: gemini  # or 'ollama'

# Scoring thresholds
scoring:
  min_topic_relevance: 5  # Papers below this go to quarantined/ (1-10 scale)

# Deduplication
dedup:
  similarity_threshold: 0.95
  use_minhash: true
  num_perm: 128

# LLM providers
ollama:
  summarize_model: "deepseek-r1:8b"
  classify_model: "deepseek-r1:8b"
  embed_model: "nomic-embed-text"
  temperature: 0.1
  base_url: "http://localhost:11434"

gemini:
  api_key: null  # Set via GEMINI_API_KEY environment variable
  temperature: 0.1

# Rate limiting (Gemini API)
rate_limit:
  enabled: true
  rpm_limit: 10   # Requests per minute (Gemini free tier)
  rpd_limit: 500  # Requests per day (Gemini free tier)
  # Warnings at 50% (250 RPD) and 75% (375 RPD)
  # Interactive prompt at daily limit with options:
  #   1. Pause and resume tomorrow
  #   2. Switch to Ollama (local)
  #   3. Continue anyway (risky)

# Metadata enrichment
crossref:
  enabled: true
  email: "your.email@domain.com"  # Polite pool (optional)

# File organization
move:
  enabled: true
  track_manifest: true
  create_symlinks: false

# Processing
processing:
  workers: 2  # Parallel workers (recommend 2 for API rate limits)
  batch_size: 32

Rate Limiting (Gemini API)

Automatic rate limiting prevents API failures and quota exhaustion:

RPM Tracking: Enforces 10 requests per minute (Gemini free tier)
- Automatically adds delays between requests to stay under limit
- Thread-safe implementation for parallel workers
RPD Tracking: Monitors 500 requests per day limit
- Warning at 50% usage (250 requests)
- Warning at 75% usage (375 requests)
- Interactive prompt at limit with options:
  1. Pause: Stop processing, resume tomorrow (preserves progress)
  2. Switch to Ollama: Continue with local LLM (no API costs)
  3. Continue anyway: Risk API errors (not recommended)
Persistent State: Tracks usage across runs in cache/rate_limit_state.json
Disable: Set rate_limit.enabled: false in config to disable

Example output:

⚠️  WARNING: 75% of daily Gemini quota used (375/500 requests)
Consider switching to Ollama to preserve remaining quota.

🛑 Daily Gemini API limit reached (500/500 requests)
Options:
  1. Pause processing and resume tomorrow
  2. Switch to Ollama (local, no API costs)
  3. Continue anyway (may fail)

Dynamic Category Generation

How it works:

LLM generates categories from topic ONLY (no papers analyzed yet)
- Example topic: "Prompt Injection Attacks in Large Language Models"
- LLM generates 10-15 relevant categories with definitions
- Cached in outputs/categories.json and cache/categories.json
Multi-category scoring for each paper:
- Paper scored against ALL categories simultaneously (1-10 scale)
- Returns: topic_relevance, category_scores dict, reasoning
- Paper placed in highest-scoring category
Topic relevance filtering:
- Papers with topic_relevance < threshold → quarantined/
- Configurable via --min-topic-relevance (default: 5/10)

Example Categories Generated:

{
  "attack_vectors": "Papers describing methods to perform prompt injection...",
  "defense_mechanisms": "Papers proposing techniques to defend against...",
  "detection_methods": "Papers focusing on identifying attacks...",
  "robustness_evaluation": "Papers developing metrics and benchmarks..."
}

Manifest System & Resume Logic

Manifest Structure (per category):

Tracks all papers in this category
Stores classification reasoning and scores
Enables resume functionality

Manifest Entry:

{
  "paper_id": "abc123def456...",
  "title": "Defending Against Prompt Injection Attacks",
  "path": "defense_mechanisms/smith2023.pdf",
  "content_hash": "sha256:...",
  "classification_reasoning": "Paper focuses on input validation...",
  "relevance_score": 9,
  "topic_relevance": 8,
  "analyzed": true
}

Resume Logic:

Checks index.jsonl for papers with analyzed: true
Skips re-processing, loads from cache
More efficient than re-running entire pipeline

Output Structure

outputs/
├── categories.json          # 🆕 LLM-generated taxonomy with definitions
├── index.jsonl              # Full machine-readable index
├── index.csv                # Spreadsheet with all metadata
├── summaries/
│   ├── attack_vectors.md    # 🆕 Dynamic category names
│   ├── defense_mechanisms.md
│   ├── quarantined.md
│   └── ...
├── logs/
│   └── pipeline_YYYYMMDD_HHMMSS.log  # Detailed execution log
└── manifests/
    ├── attack_vectors.manifest.json  # 🆕 Dynamic categories
    ├── defense_mechanisms.manifest.json
    ├── quarantined.manifest.json
    ├── repeated.manifest.json
    └── need_human_element.manifest.json

Index Fields (JSONL/CSV)

New fields:

paper_id: Unique identifier (content hash)
title, authors, year, venue, doi, bibtex
category: Final category (best-fit from LLM scoring)
topic_relevance: 1-10 relevance to research topic
category_scores: JSON dict with scores for ALL categories
reasoning: LLM explanation for categorization
duplicate_of: Paper ID if duplicate
is_duplicate: Boolean flag
path: Current file path
summary_file: Link to markdown summary
analyzed: Boolean (true when processing complete)

Removed fields (from old system):

original_category - No longer tracked (papers start in flat directory)
status - Replaced by explicit category placement
include - Replaced by topic_relevance threshold

Advanced Usage

Custom topic relevance threshold

# Stricter filtering (only highly relevant papers)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --min-topic-relevance 7

# More permissive (include more papers)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --min-topic-relevance 3

Working with cached categories

# Use cached taxonomy (fast)
research-assistant process --root-dir ./papers --topic "..." --resume

# Force regenerate taxonomy (if topic changed)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --force-regenerate-categories

Parallel processing

# More workers (caution: rate limiter adds delays)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --workers 4

# Recommended for Gemini free tier (rate limiter enforces 10 RPM)
research-assistant process \
  --root-dir ./papers \
  --topic "..." \
  --workers 2

Troubleshooting

OCR failing

# Verify Tesseract installation
tesseract --version

# Install additional language packs if needed
brew install tesseract-lang

Ollama connection issues

# Check Ollama is running
ollama list

# Restart Ollama service
brew services restart ollama

Performance Tips

Parallel processing: Set --workers 2-4 for multiprocessing (rate limiter handles coordination)
Rate limit awareness: Gemini free tier enforces 10 RPM (automatically managed)
Cache warming: Run inventory + parsing first, then scoring/summarization
Selective OCR: Skip OCR for born-digital PDFs (auto-detected)
Batch embeddings: Automatically batched in groups of 64
Resume capability: Use --resume to skip already-analyzed papers

Testing & Quality

# Run full test suite
pytest

# Run with coverage
pytest --cov=core --cov=utils --cov-report=html

# Run specific test file
pytest tests/test_metadata.py -v

# Type checking
mypy core/ utils/ --explicit-package-bases --ignore-missing-imports

# Linting
flake8 core/ utils/ tests/

# Security scanning
pip-audit --requirement requirements.txt
bandit -r core/ utils/ -ll

CI/CD: GitHub Actions runs all quality checks on Python 3.12 & 3.13

✅ Linting (flake8)
✅ Type checking (mypy)
✅ Security scanning (pip-audit, bandit)
✅ Tests (pytest)
✅ Documentation checks
✅ Build verification

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rexmirak

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.0

Dec 6, 2025

0.1.3

Nov 15, 2025

This version

0.1.1

Nov 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

research_assistant_llm-0.1.1.tar.gz (84.5 kB view details)

Uploaded Nov 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

research_assistant_llm-0.1.1-py3-none-any.whl (57.9 kB view details)

Uploaded Nov 15, 2025 Python 3

File details

Details for the file research_assistant_llm-0.1.1.tar.gz.

File metadata

Download URL: research_assistant_llm-0.1.1.tar.gz
Upload date: Nov 15, 2025
Size: 84.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for research_assistant_llm-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`46e848dc66114955b75b07425b8f7d332f0d9694485b7d6a7769476bc7a354ef`
MD5	`02a4fdac4d12da3d6152ba14019c0504`
BLAKE2b-256	`fae54fad9d36b30c2555e2dceb866ef15086b5ef8a63a5c91203da6cda3dda9b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for research_assistant_llm-0.1.1.tar.gz:

Publisher: publish.yml on rexmirak/research_assistant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: research_assistant_llm-0.1.1.tar.gz
- Subject digest: 46e848dc66114955b75b07425b8f7d332f0d9694485b7d6a7769476bc7a354ef
- Sigstore transparency entry: 702078661
- Sigstore integration time: Nov 15, 2025
Source repository:
- Permalink: rexmirak/research_assistant@fe9f5c92a09d474d34af2736dac77d4483c8643d
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/rexmirak
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@fe9f5c92a09d474d34af2736dac77d4483c8643d
- Trigger Event: release

File details

Details for the file research_assistant_llm-0.1.1-py3-none-any.whl.

File metadata

Download URL: research_assistant_llm-0.1.1-py3-none-any.whl
Upload date: Nov 15, 2025
Size: 57.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for research_assistant_llm-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`40ea710fe2b98a3c32e4c7e99c442c46540ac7c31840e843fcbd62a729d4e735`
MD5	`b6e1be34c281ca17a5344cfd66a9e48d`
BLAKE2b-256	`b36f1b94744253a2f41ca05db52a7740d5a62aace605141bfd59a933f5abe855`

See more details on using hashes here.

Provenance

The following attestation bundles were made for research_assistant_llm-0.1.1-py3-none-any.whl:

Publisher: publish.yml on rexmirak/research_assistant

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: research_assistant_llm-0.1.1-py3-none-any.whl
- Subject digest: 40ea710fe2b98a3c32e4c7e99c442c46540ac7c31840e843fcbd62a729d4e735
- Sigstore transparency entry: 702078670
- Sigstore integration time: Nov 15, 2025
Source repository:
- Permalink: rexmirak/research_assistant@fe9f5c92a09d474d34af2736dac77d4483c8643d
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/rexmirak
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@fe9f5c92a09d474d34af2736dac77d4483c8643d
- Trigger Event: release

research-assistant-llm 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Research Assistant

Features

Pipeline Flow (8 Passes)

Architecture

Prerequisites

Installation

From PyPI (Recommended)

From Source (Development)

API Key Setup

Gemini API (Cloud)

Ollama (Local)

Quick Start

Configuration

Rate Limiting (Gemini API)

Dynamic Category Generation

Manifest System & Resume Logic

Output Structure

Index Fields (JSONL/CSV)

Advanced Usage

Custom topic relevance threshold

Working with cached categories

Parallel processing

Troubleshooting

OCR failing

Ollama connection issues

Performance Tips

Testing & Quality

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance