Extracts citations from PDF, URLs and local media files in CSL-JSON.

Project description

Citation Extractor Logo

🔍 Citation Extractor

We're living in an era where AI can write beautifully, but can't cite properly.
Because every claim deserves a source, and every source deserves proper citation.

Why This Matters • Features • Quick Start • Usage • Contributing

Python 3.12+ License: MIT PyPI version PyPI downloads

🚨 Why This Matters

We're living in an era where AI can write beautifully, but can't cite properly.

Large Language Models (LLMs) like ChatGPT, Claude, and Gemini are incredible at generating human-like text, but they have a fundamental flaw: they lack reliable citation mechanisms. When an LLM tells you about a scientific study, historical event, or technical concept, you're left wondering:

📚 Where did this information come from?
🔍 How can I verify these claims?
📝 How do I properly cite this in my research?

This creates a trust gap that undermines the reliability of AI-generated content, especially in academic, professional, and research contexts.

Citation Extractor exists to fill this gap.

While LLMs struggle with proper citations, this tool excels at extracting structured, verifiable citation data from any source. It's the missing piece that makes AI-generated content trustworthy and academically sound.

🌟 Features

🎯 Universal Source Support

📄 Document Versatility: Handles .pdf, .docx, .djvu, .epub, and more.
🌐 Web & Media: Extracts citations directly from URLs and media files (.mp4, .mp3).

🧠 AI-Powered Intelligence

Smart Document Classification: Automatically detects if a source is a book, journal article, thesis, or chapter.
Advanced, Multilingual OCR: Accurately processes scanned documents, including those with vertical text layouts (e.g., Chinese, Japanese).
Smarter Language Detection: Intelligently skips blank cover pages to find the first page with text, ensuring the correct language is used for OCR.
Automatic OCR Error Correction: Proactively fixes common OCR mistakes (e.g., 郭庆沙 → 郭庆藩) before extraction for higher accuracy.
Flexible LLM Backend: Works with Ollama (local) or cloud APIs (Gemini, OpenAI).

📚 Research-Grade Output

CSL-JSON Standard: Compatible with Zotero, Mendeley, and all major reference managers.
Multiple Citation Styles: Instantly format in Chicago, APA, MLA, or any other CSL style.
Rich, Structured Metadata: Captures author, title, date, DOI, ISBN, and even complex author details like historical dynasties ([清]).

⚡ Optimized Performance

Smart Page Selection: Processes only the most relevant pages for speed.
Iterative Extraction: Stops as soon as all essential citation fields are found.
Batch Processing: Handle multiple documents efficiently.

🚀 Quick Start

Installation

pip install cite-extractor

System Dependencies

# Ubuntu/Debian
sudo apt-get install tesseract-ocr mediainfo

# macOS
brew install tesseract mediainfo

# For local LLM support (optional)
# Install Ollama: https://ollama.ai/

First Citation

# Extract from a PDF
citation "path/to/research-paper.pdf"

# Extract from a URL
citation "https://www.nature.com/articles/s41586-023-06627-7"

# Extract from a document with vertical text
citation "path/to/vertical-text-document.pdf" --text-direction vertical

📖 Usage

Command Line Interface

# Basic usage
citation "document.pdf"

# Specify document type
citation "thesis.pdf" --type thesis

# Use different LLM
citation "paper.pdf" --llm gemini/gemini-1.5-flash

# Custom output directory
citation "book.pdf" --output-dir ./citations

# Specific page range for large documents
citation "book.pdf" --page-range "1-5, -3"

# Different citation style
citation "article.pdf" --citation-style apa

Python API

from citation.main import CitationExtractor
from citation.citation_style import format_bibliography

# Initialize with your preferred LLM
extractor = CitationExtractor(llm_model="ollama/qwen3")

# Extract citation data
csl_data = extractor.extract_citation("research-paper.pdf")

if csl_data:
    # Format as bibliography
    bibliography, in_text = format_bibliography([csl_data], "chicago-author-date")
    
    print("📚 Bibliography:")
    print(bibliography)
    
    print("\n📝 In-text citation:")
    print(in_text)

Advanced Configuration

# For non-English documents, let the tool auto-detect the language
citation "chinese-paper.pdf" --lang auto

# Or specify manually
citation "another-paper.pdf" --lang chi_sim+eng

# Verbose output for debugging
citation "document.pdf" --verbose

🤝 Contributing

We're thrilled to have you join this mission! 🎉

This project addresses a fundamental need in our AI-driven world, and we believe it can make a real difference in how we handle information credibility. Whether you're a developer, researcher, or just someone who cares about proper attribution, there's a place for you here.

🚀 How to Contribute

🐛 Report Issues: Found a bug or have a feature request?
💡 Suggest Improvements: Ideas for better citation extraction?
🔧 Submit Code: Bug fixes, new features, or optimizations
📚 Improve Documentation: Help others understand and use the tool
🌍 Add Language Support: Extend OCR and extraction to new languages
🎨 Citation Styles: Add support for more academic citation styles

💻 Development Setup

git clone https://github.com/your-username/citation-extractor.git
cd citation-extractor

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black .

🏆 Acknowledgments

This project stands on the shoulders of giants:

DSPy: For flexible LLM integration
Tesseract: For OCR capabilities
citeproc-py: For citation formatting
The Open Source Community: For making tools like this possible

📄 License

MIT License - feel free to use this in your projects, commercial or otherwise.

Made with ❤️ for the research community
Because every claim deserves a source, and every source deserves respect.

⭐ Star this repo if you find it useful! ⭐

Project details

Release history Release notifications | RSS feed

This version

0.10.8

Aug 3, 2025

0.10.7

Jul 26, 2025

0.10.6

Jul 25, 2025

0.10.5

Jul 25, 2025

0.10.4

Jul 21, 2025

0.10.3

Jul 21, 2025

0.10.2

Jul 20, 2025

0.10.1

Jul 19, 2025

0.10.0

Jul 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cite_extractor-0.10.8.tar.gz (1.6 MB view details)

Uploaded Aug 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cite_extractor-0.10.8-py3-none-any.whl (74.2 kB view details)

Uploaded Aug 3, 2025 Python 3

File details

Details for the file cite_extractor-0.10.8.tar.gz.

File metadata

Download URL: cite_extractor-0.10.8.tar.gz
Upload date: Aug 3, 2025
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.13.1

File hashes

Hashes for cite_extractor-0.10.8.tar.gz
Algorithm	Hash digest
SHA256	`58f9ebdfb0adaf99f7c7771b1cd580f546925dc543ab842d0ae3a9cbcf41eb02`
MD5	`95e87c07309de339518a28511a8c6d95`
BLAKE2b-256	`08ec423ff6a4fae48ae7cb2f370f0bd472038c4b8baadd06b1df7cec6cb21892`

See more details on using hashes here.

File details

Details for the file cite_extractor-0.10.8-py3-none-any.whl.

File metadata

Download URL: cite_extractor-0.10.8-py3-none-any.whl
Upload date: Aug 3, 2025
Size: 74.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.13.1

File hashes

Hashes for cite_extractor-0.10.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`61946d67f5ea987e1dda88d4b6231f4ac6adb5119eb5eee608704ae529be8499`
MD5	`bd9fa91a069d3d5926f6a2a27a8e7ae9`
BLAKE2b-256	`96c183e4e4b26ac7622291a7254c2e848ac2e9402fe27b5a0753e0398d2331cc`

See more details on using hashes here.

cite-extractor 0.10.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

🔍 Citation Extractor

🚨 Why This Matters

🌟 Features

🎯 Universal Source Support

🧠 AI-Powered Intelligence

📚 Research-Grade Output

⚡ Optimized Performance

🚀 Quick Start

Installation

System Dependencies

First Citation

📖 Usage

Command Line Interface

Python API

Advanced Configuration

🤝 Contributing

🚀 How to Contribute

💻 Development Setup

🏆 Acknowledgments

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes