AI-powered scientific PDF extraction using Ollama
Project description
scixtract
๐ AI-powered scientific PDF extraction using Ollama
Transform your academic PDFs into structured, searchable knowledge with cutting-edge AI
A comprehensive library for extracting text from academic PDFs using AI, with advanced knowledge tracking and search capabilities. Specifically optimized for scientific literature with features like chemical formula preservation, citation integrity, and intelligent content classification.
๐ฏ Why scixtract?
- ๐ค AI-First Approach: Leverages local Ollama models for privacy-preserving extraction
- ๐ฌ Science-Optimized: Preserves chemical formulas, equations, and academic formatting
- ๐ Knowledge Graphs: Builds searchable networks of concepts and relationships
- โก High Performance: Batch processing with 95%+ test coverage
- ๐ Privacy-Focused: All processing happens locally - no data leaves your machine
๐ Table of Contents
- ๐ฏ Why scixtract?
- โก Quick Start
- ๐ฆ Installation
- โจ Features
- ๐ Usage
- ๐ API Reference
- ๐ก Examples
- ๐ค Contributing
- ๐ License
๐ฆ Installation
From PyPI (recommended)
pip install scixtract
From Source
git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e .
Development Installation
git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e ".[dev]"
โก Quick Start
Get up and running in under 5 minutes!
1. Setup Ollama (AI Engine)
# Install Ollama (macOS)
brew install ollama
# Start Ollama service
ollama serve &
# Install recommended model
ollama pull qwen2.5:32b-instruct-q4_K_M
# Or use the setup helper
scixtract-setup-ollama
2. Extract PDF with AI
# Basic extraction
scixtract extract paper.pdf
# With specific model
scixtract extract paper.pdf --model qwen2.5:32b-instruct-q4_K_M
# With bibliography integration
scixtract extract paper.pdf --bib-file references.bib --update-knowledge
3. Search Knowledge Base
# Search for keywords
scixtract knowledge --search "catalysis"
# Find related concepts
scixtract knowledge --related "ammonia"
# View statistics
scixtract knowledge --stats
โจ Features
๐ค AI-Powered Processing
- Multi-pass analysis with keyword extraction โ classification โ enhancement
- Intelligent text fixing that preserves chemical formulas and citations
- Content classification (abstract, methods, results, discussion, etc.)
- Advanced prompting strategies optimized for academic papers
๐ Knowledge Management
- SQLite database for fast, searchable knowledge indexing
- Cross-document concept networks and relationship mapping
- Author tracking and citation networks
- Knowledge graph export for visualization
๐ฌ Academic Optimization
- Chemical formula preservation (NOโ, NHโ, HโO, etc.)
- Citation integrity maintenance
- Bibliography integration from BibTeX files
- Research context linking between processed content and bibliography
๐ Multiple Output Formats
- Structured JSON with comprehensive metadata
- Enhanced Markdown with AI-generated summaries
- Keyword indices for fast searching
- Knowledge graphs for visualization
๐ Professional Tools
- Command-line interface for batch processing
- Python API for integration
- Comprehensive testing with 95%+ coverage
- Type hints throughout
๐ Usage
Command Line Interface
PDF Extraction
# Basic extraction
scixtract extract paper.pdf
# Advanced options
scixtract extract paper.pdf \
--model qwen2.5:32b-instruct-q4_K_M \
--output-dir ./extractions \
--bib-file references.bib \
--update-knowledge
# Batch processing
for pdf in papers/*.pdf; do
scixtract extract "$pdf" --update-knowledge
done
Knowledge Management
# Search for concepts
scixtract knowledge --search "electrochemical conversion"
# Find related concepts
scixtract knowledge --related "NOx reduction"
# Export knowledge graph
scixtract knowledge --export-graph knowledge_graph.json
# View database statistics
scixtract knowledge --stats
Ollama Setup
# Check Ollama status
scixtract-setup-ollama --check-only
# List available models
scixtract-setup-ollama --list-models
# Complete setup with recommended model
scixtract-setup-ollama --model qwen2.5:32b-instruct-q4_K_M
Python API
Basic Usage
from scixtract import AdvancedPDFProcessor
from pathlib import Path
# Initialize processor
processor = AdvancedPDFProcessor(
model="qwen2.5:32b-instruct-q4_K_M",
bib_file=Path("references.bib")
)
# Process PDF
result = processor.process_pdf(Path("paper.pdf"))
# Access results
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")
print(f"Keywords: {', '.join(result.all_keywords[:10])}")
print(f"Processing time: {result.metadata.processing_time:.1f}s")
Knowledge Tracking
from scixtract import KnowledgeTracker
# Initialize tracker
tracker = KnowledgeTracker()
# Add extraction result
tracker.add_extraction_result(result.to_dict(), "paper.pdf")
# Search knowledge base
results = tracker.search_keywords("catalysis")
for result in results:
print(f"{result['cite_key']}: {result['context']}")
# Get statistics
stats = tracker.get_document_stats()
print(f"Documents: {stats['document_count']}")
print(f"Keywords: {stats['unique_keywords']}")
Advanced Processing
from scixtract import OllamaAIProcessor
# Custom AI processor
ai = OllamaAIProcessor("custom-model")
# Extract keywords
keywords = ai.extract_keywords_and_concepts("Your text here")
print(keywords["technical_keywords"])
# Classify content
content_type = ai.classify_content_type("Abstract text", 1, 10)
print(f"Content type: {content_type}")
# Fix text spacing
fixed_text = ai.fix_text_spacing("Textwithnospaces")
print(f"Fixed: {fixed_text}")
๐ API Reference
Core Classes
AdvancedPDFProcessor
Main processor for PDF extraction with AI enhancement.
processor = AdvancedPDFProcessor(
model: str = "llama3.2",
bib_file: Optional[Path] = None
)
result = processor.process_pdf(
pdf_path: Path,
bib_file: Optional[Path] = None
) -> ExtractionResult
KnowledgeTracker
Knowledge indexing and search system.
tracker = KnowledgeTracker(db_path: Optional[Path] = None)
tracker.add_extraction_result(result_data: Dict, file_path: str)
results = tracker.search_keywords(query: str, limit: int = 20)
stats = tracker.get_document_stats()
OllamaAIProcessor
AI processing engine using Ollama.
ai = OllamaAIProcessor(
model: str = "llama3.2",
base_url: str = "http://localhost:11434"
)
keywords = ai.extract_keywords_and_concepts(text: str)
content_type = ai.classify_content_type(text: str, page_num: int, total_pages: int)
fixed_text = ai.fix_text_spacing(text: str)
Data Models
ExtractionResult
Complete extraction result with metadata, pages, and analysis.
DocumentMetadata
Document metadata including title, authors, keywords, and processing info.
PageContent
Individual page content with classification and keywords.
๐ก Examples
See the examples/ directory for complete examples:
- basic_extraction.py - Simple PDF processing
- batch_processing.py - Process multiple PDFs
- knowledge_analysis.py - Knowledge base analysis
- custom_processing.py - Advanced customization
๐ฏ Recommended Models
Based on extensive testing with academic papers:
๐ก Pro Tip: Start with
qwen2.5:32b-instruct-q4_K_Mfor the best balance of accuracy and performance
Best Overall: qwen2.5:32b-instruct-q4_K_M
- โ Perfect JSON output - No parsing errors
- โ Excellent keyword extraction - High accuracy
- โ Academic content optimized - Understands research papers
- ๐ฆ Size: 19GB
Lightweight: llama3.2
- โ Good performance - Reliable results
- โ Small size - Only 2GB
- โ Fast processing - Quick turnaround
- โ Occasional JSON issues - May need post-processing
High-End: qwen2:72b
- โ Superior accuracy - Best quality results
- โ Complex reasoning - Handles difficult papers
- โ Large size - 40GB storage required
- โ Slow processing - Higher compute requirements
๐ Requirements
System Requirements
- Python: 3.10+ (3.11+ recommended)
- Memory: 8GB RAM minimum (16GB+ for large models)
- Storage: 20GB+ free space for models
- OS: macOS, Linux, Windows (WSL2)
Core Dependencies
- Ollama: For AI processing
- PyMuPDF: For PDF text extraction
- SQLite: For knowledge indexing (included with Python)
Optional Dependencies
- bibtexparser: For bibliography integration
- pdfplumber: Alternative PDF processing
- unstructured: Advanced document parsing
๐ค Contributing
We welcome contributions! ๐ Please see CONTRIBUTING.md for guidelines.
Quick Contribution Guide
- ๐ด Fork the repository
- ๐ฟ Create a feature branch:
git checkout -b feature/amazing-feature - โ
Test your changes:
pytest - ๐ Commit your changes:
git commit -m 'Add amazing feature' - ๐ Push to branch:
git push origin feature/amazing-feature - ๐ฌ Open a Pull Request
Development Setup
git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e ".[dev]"
pre-commit install
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=scixtract --cov-report=html
# Run specific test file
pytest tests/test_extractor.py -v
Code Quality
# Format code
black src/ tests/
isort src/ tests/
# Lint code
flake8 src/ tests/
mypy src/
# Security scan
bandit -r src/
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Roadmap
- Multi-language support for international papers
- Web interface for non-technical users
- Cloud deployment options
- Advanced visualization tools
- Integration with reference managers (Zotero, Mendeley)
- Collaborative features for research teams
Citation
If you use this software in your research, please cite:
@software{stamm2024_scixtract,
author = {Stamm, Reto},
title = {scixtract: AI-powered scientific PDF extraction using Ollama},
year = {2024},
url = {https://github.com/retostamm/scixtract},
version = {1.0.0}
}
๐ Acknowledgments
- ๐ Research Context: Developed for NOx to Ammonia catalysis research at University of Limerick
- ๐จโ๐ซ Supervision: Prof. Matthias Vandichel
- ๐ค AI Engine: Built with Ollama for local AI processing
- ๐ PDF Processing: Powered by PyMuPDF
- ๐ Community: Thanks to all contributors and users!
๐ Stats
- ๐งช Test Coverage: 95%+
- ๐ฆ Dependencies: Minimal and well-maintained
- ๐ CI/CD: Automated testing and deployment
- ๐ Documentation: Comprehensive guides and examples
Made with โค๏ธ for the research community
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scixtract-1.0.0.tar.gz.
File metadata
- Download URL: scixtract-1.0.0.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
671513b1a2cf9395c0225dbf3ba24b39ec62dd478fcdf5764371deafc2cfa400
|
|
| MD5 |
29834b1dfb3624b72596b7d157dda8af
|
|
| BLAKE2b-256 |
5b82af8468954e13d2d2e0089d97db2822ea9b53b183329b100ab19dafeb7090
|
Provenance
The following attestation bundles were made for scixtract-1.0.0.tar.gz:
Publisher:
pypi_publish.yml on retospect/scixtract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scixtract-1.0.0.tar.gz -
Subject digest:
671513b1a2cf9395c0225dbf3ba24b39ec62dd478fcdf5764371deafc2cfa400 - Sigstore transparency entry: 660755907
- Sigstore integration time:
-
Permalink:
retospect/scixtract@56d7bb88cd8191f19d67355bcace69c9832e60d0 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/retospect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_publish.yml@56d7bb88cd8191f19d67355bcace69c9832e60d0 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file scixtract-1.0.0-py3-none-any.whl.
File metadata
- Download URL: scixtract-1.0.0-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
775bc1a0abaafa6e620e53917ce7f919b93fb677bf59a0f98c1503b2d7f1a784
|
|
| MD5 |
3d1379de2ae167e75ae60a583d08f916
|
|
| BLAKE2b-256 |
0661021459630781035b66c6804cae0a03452a31054860e03e25d654837e6c90
|
Provenance
The following attestation bundles were made for scixtract-1.0.0-py3-none-any.whl:
Publisher:
pypi_publish.yml on retospect/scixtract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scixtract-1.0.0-py3-none-any.whl -
Subject digest:
775bc1a0abaafa6e620e53917ce7f919b93fb677bf59a0f98c1503b2d7f1a784 - Sigstore transparency entry: 660755908
- Sigstore integration time:
-
Permalink:
retospect/scixtract@56d7bb88cd8191f19d67355bcace69c9832e60d0 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/retospect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_publish.yml@56d7bb88cd8191f19d67355bcace69c9832e60d0 -
Trigger Event:
workflow_dispatch
-
Statement type: