AI-Powered Citation Validation for Academic Papers
Project description
ValiRef
AI-Powered Citation Validation for Academic Papers
Features • Installation • Usage • How It Works • Benchmark
Overview
ValiRef is an intelligent tool designed to detect hallucinated citations in academic papers. With the rise of AI-generated content, Large Language Models (LLMs) sometimes generate plausible-sounding but non-existent references. ValiRef helps researchers, reviewers, and publishers verify the authenticity of citations in PDF documents.
What ValiRef Detects
| Hallucination Type | Description | Example |
|---|---|---|
| 🔮 Fabrication | Completely fake paper that doesn't exist | A paper with a convincing title but no actual publication |
| 👤 Attribution Error | Real paper, wrong authors | Citing "Attention is All You Need" by someone other than Vaswani et al. |
| 📄 Irrelevance | Real paper, but claim doesn't match content | Citing a paper about NLP for a claim about computer vision |
| 🔄 Counterfactual | Real paper, opposite conclusion | Claiming a paper supports X when it actually argues against X |
Features
- 🔍 Multi-Source Verification - Cross-references citations against ArXiv, Google Scholar, Semantic Scholar, OpenReview, OpenAlex, and DuckDuckGo
- 🤖 AI-Powered Detection - Uses DeepSeek LLM with ReAct reasoning to analyze search results
- ⚡ Async-First Architecture - Concurrent validation of multiple references for optimal performance
- 📊 Rich CLI Output - Beautiful terminal interface with progress bars, real-time metrics, and detailed reports
- 📈 Benchmark Suite - Built-in dataset generation and evaluation framework
- 🛡️ Resilient API Handling - Token bucket rate limiting + circuit breaker pattern for reliable external API calls
- 🎯 High Accuracy - 72%+ accuracy on 100-sample benchmark with confidence scoring and detailed reasoning
Installation
Prerequisites
- Python 3.12 or higher
- uv package manager (recommended) or pip
Install from PyPI (Recommended)
pip install valiref
Install from Source
# Clone the repository
git clone https://github.com/Gianthard-cyh/ValiRef.git
cd ValiRef
# Install dependencies
uv sync
# Set up environment variables
cp .env.example .env
# Edit .env and add your DeepSeek API key
Environment Configuration
Create a .env file with your API keys:
DEEPSEEK_API_KEY=your_deepseek_api_key_here
# Optional: for enhanced search capabilities
SERPAPI_API_KEY=your_serpapi_key
SEMANTIC_SCHOLAR_API_KEY=your_semantic_scholar_key
# Optional: LangSmith tracing
LANGCHAIN_TRACING_V2=false
LANGCHAIN_API_KEY=your_langchain_key
LANGCHAIN_PROJECT=ValiRef
Usage
Validate References in a PDF
# Basic usage
uv run python -m src.cli validate paper.pdf
# With concurrent workers (default: 5)
uv run python -m src.cli validate paper.pdf --workers 10
# Output as JSON
uv run python -m src.cli validate paper.pdf --json
# Enable verbose logging
uv run python -m src.cli validate paper.pdf --verbose
Example Output
Validation Summary for paper.pdf
Total References: 12
Validated: 12
Duration: 15.34s
┌─────────────────────────────────────────────────────────────────────┐
│ ✅ Reference #1 - REAL REFERENCE │
├─────────────────────────────────────────────────────────────────────┤
│ Title: Attention Is All You Need │
│ Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. │
│ Confidence: 0.98 │
│ │
│ Reasoning: │
│ Found exact match on ArXiv (arxiv.org/abs/1706.03762). Title, │
│ authors, and venue (NIPS 2017) all match the citation. │
│ │
│ Evidence / Sources: │
│ - https://arxiv.org/abs/1706.03762 │
└─────────────────────────────────────────────────────────────────────┘
How It Works
ValiRef employs a sophisticated multi-step validation pipeline:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐
│ PDF Input │ → │ Extract │ → │ Search │ → │ Validate │
│ │ │ References │ │ Multi-Source│ │ with LLM │
└─────────────┘ └──────────────┘ └──────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Report │
│ Results │
└─────────────┘
1. Reference Extraction
- Parses PDF documents using PyMuPDF
- Uses LLM to intelligently extract structured reference data from bibliography sections
- Handles various citation formats (APA, MLA, Chicago, etc.)
2. Multi-Source Search
Simultaneously queries multiple academic databases:
- ArXiv - Preprint server with full-text access
- Google Scholar - Broad academic search
- Semantic Scholar - AI-powered academic search
- OpenReview - Peer-reviewed conference papers
- OpenAlex - Open academic graph
- DuckDuckGo - Web search fallback
3. AI Validation
The HallucinationDetector uses a ReAct (Reasoning + Acting) agent powered by DeepSeek LLM:
- Analyzes search results from all sources
- Compares paper metadata (title, authors, abstract, venue)
- Evaluates claims against actual paper content
- Provides confidence scores with detailed reasoning
Resilient API Architecture
ValiRef implements a production-grade resilience layer for external API calls:
┌─────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ SearchTool │────▶│ ToolRequestQueue│────▶│ Token Bucket │
│ (per source)│ │ (rate limiter) │ │ (smooth flow) │
└─────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Circuit Breaker │
│ (fail-fast for │
│ unhealthy APIs)│
└─────────────────┘
Features:
- Token Bucket Rate Limiting - Smooth request flow with configurable burst capacity per source
- Circuit Breaker Pattern - Automatically stops requests to failing services (3 failures → OPEN, 15s recovery timeout)
- Real-time Metrics - Live display of API call statistics, active requests, and circuit states
- Graceful Degradation - Failed sources are marked unavailable but don't block other sources
Benchmark
ValiRef includes a comprehensive benchmark suite for evaluating hallucination detection performance.
Performance Results
On a 100-sample mixed dataset:
| Metric | Value |
|---|---|
| Accuracy | 72.0% |
| Precision | 1.0000 |
| Recall | 0.2800 (Counterfactual) / 1.0000 (Fabrication) |
| F1 Score | 0.4375 (Counterfactual) / 1.0000 (Fabrication) |
| Throughput | ~0.09 samples/sec |
| Duration | ~18 min (100 samples) |
Per-Type Performance
| Hallucination Type | Accuracy | Precision | Recall | F1 Score | Samples |
|---|---|---|---|---|---|
| Fabrication | 100% | 1.0000 | 1.0000 | 1.0000 | 19 |
| AttributionError | 100% | 1.0000 | 1.0000 | 1.0000 | 19 |
| Irrelevance | 74% | 1.0000 | 0.7368 | 0.8485 | 19 |
| Counterfactual | 28% | 1.0000 | 0.2800 | 0.4375 | 25 |
| Real Papers | 72% | 0.0000 | 0.0000 | 0.0000 | 18 |
Generate Benchmark Dataset
uv run python scripts/generate_dataset.py \
--topic cs.CL \
--count 1000 \
--output data/dataset.csv
Dataset Composition
The benchmark dataset combines real ArXiv papers with synthetic hallucinations:
| Category | Description | Percentage |
|---|---|---|
| Real | Genuine papers from ArXiv | 50% |
| Fabrication | AI-generated fake papers | 12.5% |
| Attribution Error | Real papers with wrong authors | 12.5% |
| Irrelevance | Real papers with mismatched claims | 12.5% |
| Counterfactual | Real papers with inverted claims | 12.5% |
Running Tests
# Run unit tests (fast, no external APIs)
uv run pytest
# Run integration tests (slow, requires API keys)
uv run pytest -m integration
# Run specific test
uv run pytest tests/core/test_tools.py -v
Architecture
valiref/
├── src/
│ ├── cli.py # Typer-based CLI interface
│ ├── cli_callbacks.py # Progress callbacks and Live display
│ ├── core/ # Core validation engine
│ │ ├── pipeline.py # Async validation orchestration
│ │ ├── detector.py # LLM-based hallucination detection
│ │ ├── extract.py # PDF/text extraction
│ │ ├── tools.py # Academic search tools with rate limiting
│ │ ├── search_queue.py # Token bucket + circuit breaker
│ │ ├── tool_monitor.py # Real-time metrics via blinker signals
│ │ ├── config.py # Configuration management
│ │ └── logger.py # Rich-based logging
│ ├── bench/ # Benchmark framework
│ │ ├── crawler.py # ArXiv paper crawler
│ │ ├── dataset.py # Hallucination injection
│ │ ├── bench.py # Benchmark runner with live metrics
│ │ └── schema.py # Pydantic data models
│ └── api/ # API interface (future)
├── scripts/
│ └── generate_dataset.py # Dataset generation script
├── tests/ # Test suite
└── data/ # Benchmark datasets
Configuration
Key settings in src/core/config.py:
| Setting | Default | Description |
|---|---|---|
LLM_MODEL |
deepseek-chat | LLM for validation |
LLM_TEMPERATURE |
0.7 | Creativity vs determinism |
DETECTOR_TEMPERATURE |
0.1 | Lower for consistent reasoning |
EXTRACTION_CHAR_LIMIT |
20000 | Max chars from PDF references |
MAX_WORKERS |
5 | Concurrent validation threads |
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
# Install dev dependencies
uv sync --dev
# Run linting
uv run ruff check .
uv run ruff format .
# Run tests
uv run pytest
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built with LangChain and LangGraph
- Powered by DeepSeek LLM
- Academic search via ArXiv, Semantic Scholar, OpenReview, and OpenAlex
- CLI powered by Typer and Rich
Built with ❤️ for the research community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file valiref-0.1.0.tar.gz.
File metadata
- Download URL: valiref-0.1.0.tar.gz
- Upload date:
- Size: 42.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ec619f1ed9f78a68f6c6d653295f178b7ef89b3df278e523ae96def142fe1f2
|
|
| MD5 |
f386319e6a6772a249ef21797924766b
|
|
| BLAKE2b-256 |
054505d26d6579bbf05fd2516f2fa680f4f0afaf449d620e86e9be8a9d4ad43b
|
File details
Details for the file valiref-0.1.0-py3-none-any.whl.
File metadata
- Download URL: valiref-0.1.0-py3-none-any.whl
- Upload date:
- Size: 43.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d4ccd536de1aab04c1a5f9d8fcf7ec18996a364630fe46d446a767b3f570f55b
|
|
| MD5 |
3e36d05695344e3e0fae4bd2dfe6a341
|
|
| BLAKE2b-256 |
7ff8acaa2633a7d247938580330bcb995f6a0fc430489374e64894f714629b89
|