Intelligent research paper analysis pipeline with LLM-driven categorization
Project description
Research Assistant
An intelligent pipeline for processing research papers using LLMs (Ollama or Gemini) with dynamic LLM-driven category generation, accurate PDF parsing, metadata extraction, multi-category relevance scoring, deduplication, and automated summarization.
Features
- ๐ค Dynamic LLM-Driven Taxonomy: LLM generates categories from your research topic (no hardcoded categories!)
- ๐ Multi-Category Scoring: Papers scored across ALL categories simultaneously for best-fit placement
- ๐ฏ Flexible LLM Support: Use local Ollama models or Google Gemini API
- ๐ง Generic & Configurable: Runtime topic and directory configuration (no hardcoding)
- ๐ Accurate PDF Parsing: PyMuPDF + OCR fallback (ocrmypdf + Tesseract)
- ๐ LLM-Based Metadata Extraction: Extract titles, authors, abstracts, years using local or cloud LLMs
- ๐ Smart Deduplication: Exact (hash-based) and near-duplicate (MinHash-based) detection
- ๐ Topic-Focused Summaries: Per-paper summaries with "how this helps your research"
- ๐พ Resumable: SQLite cache for embeddings and OCR outputs, index-based resume logic
- ๐ค Multiple Outputs: JSONL master index + CSV spreadsheet + Markdown summaries per category
- โฑ๏ธ Rate Limiting: Smart Gemini API rate limiting (10 RPM, 500 RPD) with warnings and interactive prompts
- โ Comprehensive Testing: 220+ unit and integration tests with 77% coverage
Pipeline Flow (8 Passes)
graph TD
A[๐ Input: PDF Directory + Topic] --> B[๐ค PASS 1: LLM Taxonomy Generation]
B -->|Generate categories from topic ONLY| C[๏ฟฝ PASS 2: Inventory PDFs]
C -->|Discover all PDFs| D[๐ PASS 3: Metadata + Classification]
D -->|Extract metadata + Multi-category scoring| E{Readable?}
E -->|No| F[๏ฟฝ Move to need_human_element/]
E -->|Yes| G{Topic Relevance?}
G -->|< threshold| H[๏ฟฝ Move to quarantined/]
G -->|>= threshold| I[๐ PASS 4: Move to Best Category]
I -->|Highest scoring category| J[๐ PASS 5: Deduplication]
J -->|MinHash LSH| K{Duplicate?}
K -->|Yes| L[๏ฟฝ Move to repeated/]
K -->|No| M[๐ PASS 6: Update Manifests]
M --> N[โ๏ธ PASS 7: LLM Summarization]
N -->|Topic-focused summaries| O[๐พ PASS 8: Generate Index]
O --> P[๐ index.csv]
O --> Q[๐ index.jsonl]
O --> R[๐ summaries/*.md]
O --> S[๐ manifests/*.json]
O --> T[๐๏ธ categories.json]
style B fill:#e1f5ff
style D fill:#e1f5ff
style N fill:#e1f5ff
style F fill:#ffe1e1
style H fill:#ffe1e1
style L fill:#ffe1e1
style P fill:#e1ffe1
style Q fill:#e1ffe1
style R fill:#e1ffe1
style S fill:#e1ffe1
style T fill:#e1ffe1
Architecture
research_assistant/
โโโ cli.py # Main CLI entry point (8-pass pipeline)
โโโ config.py # Configuration and settings
โโโ core/
โ โโโ taxonomy.py # LLM-based category generation from topic
โ โโโ inventory.py # Directory traversal and PDF discovery
โ โโโ parser.py # PDF text extraction (PyMuPDF + OCR)
โ โโโ metadata.py # LLM metadata extraction + multi-category scoring
โ โโโ dedup.py # MinHash near-duplicate detection
โ โโโ embeddings.py # Ollama embedding generation
โ โโโ summarizer.py # Topic-focused summary generation
โ โโโ mover.py # File moving with dynamic folder creation
โ โโโ manifest.py # Category manifest tracking
โ โโโ outputs.py # JSONL, CSV, and Markdown generation
โโโ utils/
โ โโโ cache_manager.py # SQLite-based caching
โ โโโ llm_provider.py # Unified Ollama/Gemini interface
โ โโโ gemini_client.py # Google Gemini API client
โ โโโ hash.py # Content hashing utilities
โ โโโ text.py # Text normalization and processing
โโโ tests/ # 100+ unit and integration tests
Prerequisites
- Python 3.12+
- LLM Provider (choose one or both):
- Ollama (local) with models:
deepseek-r1:8b(metadata extraction & classification)nomic-embed-text(embeddings)
- Google Gemini API (cloud, requires API key):
- Set
GEMINI_API_KEYenvironment variable
- Set
- Ollama (local) with models:
- Tesseract (for OCR):
brew install tesseract(macOS) orapt-get install tesseract-ocr(Linux)
Installation
From PyPI (Recommended)
# Install from PyPI
pip install research-assistant-llm
# Run interactive setup wizard (guides you through Ollama/Gemini setup)
research-assistant setup
# Or manual setup:
# Option 1: Use Ollama (local, free)
ollama pull deepseek-r1:8b
ollama pull nomic-embed-text
# Option 2: Use Gemini API (cloud-based)
export GEMINI_API_KEY="your_api_key_here"
From Source (Development)
# Clone repository
git clone https://github.com/rexmirak/research_assistant.git
cd research_assistant
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in editable mode
pip install -e .
# Install development dependencies
pip install -e ".[dev]"
API Key Setup
Gemini API (Cloud)
Option 1: Environment Variable (Recommended for CI/CD)
export GEMINI_API_KEY="your_api_key_here"
research-assistant process --llm-provider gemini --root-dir ./papers --topic "..."
Option 2: .env File (Convenient for local development)
# Create .env in your working directory
echo "GEMINI_API_KEY=your_api_key_here" > .env
research-assistant process --llm-provider gemini --root-dir ./papers --topic "..."
Option 3: Config File
# config.yaml
gemini:
api_key: "${GEMINI_API_KEY}" # References environment variable
# OR
api_key: "your_api_key_here" # Direct (not recommended for version control)
research-assistant process --config-file config.yaml --root-dir ./papers --topic "..."
Get your Gemini API key: https://aistudio.google.com/app/apikey
Ollama (Local)
No API key needed! Just install Ollama and pull models:
# Install from https://ollama.com/download
ollama pull deepseek-r1:8b
ollama pull nomic-embed-text
research-assistant process --llm-provider ollama --root-dir ./papers --topic "..."
Quick Start
# View help
research-assistant --help
research-assistant process --help
# Basic usage with Gemini (recommended)
research-assistant process \
--root-dir /path/to/papers \
--topic "Prompt Injection Attacks in Large Language Models" \
--llm-provider gemini \
--workers 2
# With Ollama (local, requires models installed)
research-assistant process \
--root-dir /path/to/papers \
--topic "Your research topic" \
--llm-provider ollama \
--workers 2
# Custom topic relevance threshold (default: 5/10)
research-assistant process \
--root-dir /path/to/papers \
--topic "Your research topic" \
--min-topic-relevance 7
# Resume from interrupted run (skips analyzed papers)
research-assistant process \
--root-dir /path/to/papers \
--topic "Your research topic" \
--resume
# Force regenerate categories (ignore cached taxonomy)
research-assistant process \
--root-dir /path/to/papers \
--topic "Your research topic" \
--force-regenerate-categories
# Dry-run (no file moves)
research-assistant process \
--root-dir /path/to/papers \
--topic "Your research topic" \
--dry-run
Configuration
Runtime configuration via CLI flags or config.yaml:
# config.yaml (optional)
llm_provider: gemini # or 'ollama'
# Scoring thresholds
scoring:
min_topic_relevance: 5 # Papers below this go to quarantined/ (1-10 scale)
# Deduplication
dedup:
similarity_threshold: 0.95
use_minhash: true
num_perm: 128
# LLM providers
ollama:
summarize_model: "deepseek-r1:8b"
classify_model: "deepseek-r1:8b"
embed_model: "nomic-embed-text"
temperature: 0.1
base_url: "http://localhost:11434"
gemini:
api_key: null # Set via GEMINI_API_KEY environment variable
temperature: 0.1
# Rate limiting (Gemini API)
rate_limit:
enabled: true
rpm_limit: 10 # Requests per minute (Gemini free tier)
rpd_limit: 500 # Requests per day (Gemini free tier)
# Warnings at 50% (250 RPD) and 75% (375 RPD)
# Interactive prompt at daily limit with options:
# 1. Pause and resume tomorrow
# 2. Switch to Ollama (local)
# 3. Continue anyway (risky)
# Metadata enrichment
crossref:
enabled: true
email: "your.email@domain.com" # Polite pool (optional)
# File organization
move:
enabled: true
track_manifest: true
create_symlinks: false
# Processing
processing:
workers: 2 # Parallel workers (recommend 2 for API rate limits)
batch_size: 32
Rate Limiting (Gemini API)
Automatic rate limiting prevents API failures and quota exhaustion:
-
RPM Tracking: Enforces 10 requests per minute (Gemini free tier)
- Automatically adds delays between requests to stay under limit
- Thread-safe implementation for parallel workers
-
RPD Tracking: Monitors 500 requests per day limit
- Warning at 50% usage (250 requests)
- Warning at 75% usage (375 requests)
- Interactive prompt at limit with options:
- Pause: Stop processing, resume tomorrow (preserves progress)
- Switch to Ollama: Continue with local LLM (no API costs)
- Continue anyway: Risk API errors (not recommended)
-
Persistent State: Tracks usage across runs in
cache/rate_limit_state.json -
Disable: Set
rate_limit.enabled: falsein config to disable
Example output:
โ ๏ธ WARNING: 75% of daily Gemini quota used (375/500 requests)
Consider switching to Ollama to preserve remaining quota.
๐ Daily Gemini API limit reached (500/500 requests)
Options:
1. Pause processing and resume tomorrow
2. Switch to Ollama (local, no API costs)
3. Continue anyway (may fail)
Dynamic Category Generation
How it works:
-
LLM generates categories from topic ONLY (no papers analyzed yet)
- Example topic: "Prompt Injection Attacks in Large Language Models"
- LLM generates 10-15 relevant categories with definitions
- Cached in
outputs/categories.jsonandcache/categories.json
-
Multi-category scoring for each paper:
- Paper scored against ALL categories simultaneously (1-10 scale)
- Returns:
topic_relevance,category_scoresdict,reasoning - Paper placed in highest-scoring category
-
Topic relevance filtering:
- Papers with
topic_relevance < thresholdโquarantined/ - Configurable via
--min-topic-relevance(default: 5/10)
- Papers with
Example Categories Generated:
{
"attack_vectors": "Papers describing methods to perform prompt injection...",
"defense_mechanisms": "Papers proposing techniques to defend against...",
"detection_methods": "Papers focusing on identifying attacks...",
"robustness_evaluation": "Papers developing metrics and benchmarks..."
}
Manifest System & Resume Logic
Manifest Structure (per category):
- Tracks all papers in this category
- Stores classification reasoning and scores
- Enables resume functionality
Manifest Entry:
{
"paper_id": "abc123def456...",
"title": "Defending Against Prompt Injection Attacks",
"path": "defense_mechanisms/smith2023.pdf",
"content_hash": "sha256:...",
"classification_reasoning": "Paper focuses on input validation...",
"relevance_score": 9,
"topic_relevance": 8,
"analyzed": true
}
Resume Logic:
- Checks
index.jsonlfor papers withanalyzed: true - Skips re-processing, loads from cache
- More efficient than re-running entire pipeline
Output Structure
outputs/
โโโ categories.json # LLM-generated taxonomy with definitions
โโโ index.jsonl # Full machine-readable index
โโโ index.csv # Spreadsheet with all metadata
โโโ summaries/
โ โโโ attack_vectors.md # Dynamic category names
โ โโโ defense_mechanisms.md
โ โโโ quarantined.md
โ โโโ ...
โโโ logs/
โ โโโ pipeline_YYYYMMDD_HHMMSS.log # Detailed execution log
โโโ manifests/
โโโ attack_vectors.manifest.json # Dynamic categories
โโโ defense_mechanisms.manifest.json
โโโ quarantined.manifest.json
โโโ repeated.manifest.json
โโโ need_human_element.manifest.json
Index Fields (JSONL/CSV)
paper_id: Unique identifier (content hash)title,authors,year,venue,doi,bibtexcategory: Final category (best-fit from LLM scoring)topic_relevance: 1-10 relevance to research topiccategory_scores: JSON dict with scores for ALL categoriesreasoning: LLM explanation for categorizationduplicate_of: Paper ID if duplicateis_duplicate: Boolean flagpath: Current file pathsummary_file: Link to markdown summaryanalyzed: Boolean (true when processing complete)
Advanced Usage
Custom topic relevance threshold
# Stricter filtering (only highly relevant papers)
research-assistant process \
--root-dir ./papers \
--topic "..." \
--min-topic-relevance 7
# More permissive (include more papers)
research-assistant process \
--root-dir ./papers \
--topic "..." \
--min-topic-relevance 3
Working with cached categories
# Use cached taxonomy (fast)
research-assistant process --root-dir ./papers --topic "..." --resume
# Force regenerate taxonomy (if topic changed)
research-assistant process \
--root-dir ./papers \
--topic "..." \
--force-regenerate-categories
Parallel processing
# More workers (caution: rate limiter adds delays)
research-assistant process \
--root-dir ./papers \
--topic "..." \
--workers 4
# Recommended for Gemini free tier (rate limiter enforces 10 RPM)
research-assistant process \
--root-dir ./papers \
--topic "..." \
--workers 2
Troubleshooting
OCR failing
# Verify Tesseract installation
tesseract --version
# Install additional language packs if needed
brew install tesseract-lang
Ollama connection issues
# Check Ollama is running
ollama list
# Restart Ollama service
brew services restart ollama
Performance Tips
- Parallel processing: Set
--workers 2-4for multiprocessing (rate limiter handles coordination) - Rate limit awareness: Gemini free tier enforces 10 RPM (automatically managed)
- Cache warming: Run inventory + parsing first, then scoring/summarization
- Selective OCR: Skip OCR for born-digital PDFs (auto-detected)
- Batch embeddings: Automatically batched in groups of 64
- Resume capability: Use
--resumeto skip already-analyzed papers
Testing & Quality
# Run full test suite
pytest
# Run with coverage
pytest --cov=core --cov=utils --cov-report=html
# Run specific test file
pytest tests/test_metadata.py -v
# Type checking
mypy core/ utils/ --explicit-package-bases --ignore-missing-imports
# Linting
flake8 core/ utils/ tests/
# Security scanning
pip-audit --requirement requirements.txt
bandit -r core/ utils/ -ll
CI/CD: GitHub Actions runs all quality checks on Python 3.12 & 3.13
- โ Linting (flake8)
- โ Type checking (mypy)
- โ Security scanning (pip-audit, bandit)
- โ Tests (pytest)
- โ Documentation checks
- โ Build verification
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file research_assistant_llm-0.2.0.tar.gz.
File metadata
- Download URL: research_assistant_llm-0.2.0.tar.gz
- Upload date:
- Size: 84.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b1f193f0cc5b95dfd5e8d225f42e170d7cb8f072f0a760be5868cac704e94808
|
|
| MD5 |
505662e05e3a28aeea6502b6194d0d78
|
|
| BLAKE2b-256 |
955773b8b7634781f6ed516eb1a465b9694a313bd364acf7d38dc6fd0c431910
|
Provenance
The following attestation bundles were made for research_assistant_llm-0.2.0.tar.gz:
Publisher:
publish.yml on rexmirak/research_assistant
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
research_assistant_llm-0.2.0.tar.gz -
Subject digest:
b1f193f0cc5b95dfd5e8d225f42e170d7cb8f072f0a760be5868cac704e94808 - Sigstore transparency entry: 747156931
- Sigstore integration time:
-
Permalink:
rexmirak/research_assistant@c421caf854803c4f52dd7c93ac1a45cecc2cb9be -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/rexmirak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c421caf854803c4f52dd7c93ac1a45cecc2cb9be -
Trigger Event:
release
-
Statement type:
File details
Details for the file research_assistant_llm-0.2.0-py3-none-any.whl.
File metadata
- Download URL: research_assistant_llm-0.2.0-py3-none-any.whl
- Upload date:
- Size: 58.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eaccc4699dc2b698b4cba5023b3707a7eeabf2cf54182e9b45bda0301108b6bc
|
|
| MD5 |
feb0dbb0bb65ea65fb5a3257b1e2b361
|
|
| BLAKE2b-256 |
5e4baccaa75e4ec0e58911bbeabf6daf20a586d3d414c1331cffc15a5d4171f4
|
Provenance
The following attestation bundles were made for research_assistant_llm-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on rexmirak/research_assistant
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
research_assistant_llm-0.2.0-py3-none-any.whl -
Subject digest:
eaccc4699dc2b698b4cba5023b3707a7eeabf2cf54182e9b45bda0301108b6bc - Sigstore transparency entry: 747156934
- Sigstore integration time:
-
Permalink:
rexmirak/research_assistant@c421caf854803c4f52dd7c93ac1a45cecc2cb9be -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/rexmirak
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c421caf854803c4f52dd7c93ac1a45cecc2cb9be -
Trigger Event:
release
-
Statement type: