AI-assisted scientific PDF text extraction using local Ollama models
Project description
scixtract
AI-assisted scientific PDF text extraction
Scixtract solves the problem that PDF text extraction is messy and full of artifacts. This tool uses AI assistance to clean up extracted text from scientific PDFs, preserving important formatting like chemical formulas and citations while removing common extraction artifacts.
Designed specifically for academic and scientific literature, scixtract provides clean, structured text output that maintains the integrity of your research content.
What scixtract does
- Cleans messy PDF text: Removes spacing artifacts, broken words, and formatting issues
- Preserves scientific content: Maintains chemical formulas (HโO, COโ), equations, and citations
- Local AI processing: Uses local AI models to fix text while preserving meaning
- Privacy-focused: All processing happens on your machine - no data sent to external services
- Batch processing: Handle multiple PDFs
- Knowledge indexing: Build searchable databases of extracted content
Prerequisites
Before using scixtract, you need to install and set up Ollama:
1. Install Ollama
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Windows: Download from ollama.ai
2. Start Ollama service
ollama serve
3. Install a model
For scientific PDFs:
# Default: Good balance for most users (4.4GB)
ollama pull qwen2.5:7b
# Alternative: Larger, more accurate model (19GB)
ollama pull qwen2.5:32b-instruct-q4_K_M
Installation
Install scixtract from PyPI:
pip install scixtract
Quick Start
Basic PDF extraction
# Extract a single PDF
scixtract extract paper.pdf
# Use specific model
scixtract extract paper.pdf --model qwen2.5:7b
# Process multiple PDFs
scixtract extract papers/*.pdf
Python API
from scixtract import AdvancedPDFProcessor
from pathlib import Path
# Initialize processor
processor = AdvancedPDFProcessor(
model="qwen2.5:7b"
)
# Process PDF
result = processor.process_pdf(Path("paper.pdf"))
# Access cleaned text
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")
# Get page content
for page in result.pages:
print(f"Page {page.page_number}: {page.content[:200]}...")
Knowledge management
Build a searchable database of your extracted content:
# Extract and add to knowledge base (with bibliography for author name recognition)
scixtract extract paper.pdf --bib-file references.bib --update-knowledge
# Search your knowledge base
scixtract knowledge --search "catalysis"
# View statistics
scixtract knowledge --stats
Output formats
Scixtract provides multiple output formats:
- JSON: Structured data with metadata, page content, and extracted keywords
- Markdown: Clean, readable text with AI-generated summaries
- Knowledge database: SQLite database for searching across multiple documents
Model recommendations
Based on testing with scientific papers:
Default: qwen2.5:7b
- Good balance of performance and size
- Reliable JSON output
- Size: 4.4GB
High-performance option: qwen2.5:32b-instruct-q4_K_M
- Better accuracy for complex scientific content
- Larger model with more capabilities
- Size: 19GB
System requirements
- Python: 3.10 or higher
- Memory: 8GB RAM minimum (16GB+ recommended for large models)
- Storage: 20GB+ free space for AI models
- Ollama: Required for AI processing
Help and setup
Use the built-in setup helper:
# Check if Ollama is properly configured
scixtract-setup-ollama --check-only
# List available models
scixtract-setup-ollama --list-models
# Complete setup with default model
scixtract-setup-ollama --model qwen2.5:7b
License
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
Support
For technical documentation, API reference, and development information, see MAINTAINER_README.md.
<<<<<<< HEAD
Process PDF
result = processor.process_pdf(Path("paper.pdf"))
Access results
print(f"Title: {result.metadata.title}") print(f"Authors: {', '.join(result.metadata.authors)}") print(f"Keywords: {', '.join(result.all_keywords[:10])}") print(f"Processing time: {result.metadata.processing_time:.1f}s")
#### Knowledge Tracking
```python
from scixtract import KnowledgeTracker
# Initialize tracker
tracker = KnowledgeTracker()
# Add extraction result
tracker.add_extraction_result(result.to_dict(), "paper.pdf")
# Search knowledge base
results = tracker.search_keywords("catalysis")
for result in results:
print(f"{result['cite_key']}: {result['context']}")
# Get statistics
stats = tracker.get_document_stats()
print(f"Documents: {stats['document_count']}")
print(f"Keywords: {stats['unique_keywords']}")
Advanced Processing
from scixtract import OllamaAIProcessor
# Custom AI processor
ai = OllamaAIProcessor("custom-model")
# Extract keywords
keywords = ai.extract_keywords_and_concepts("Your text here")
print(keywords["technical_keywords"])
# Classify content
content_type = ai.classify_content_type("Abstract text", 1, 10)
print(f"Content type: {content_type}")
# Fix text spacing
fixed_text = ai.fix_text_spacing("Textwithnospaces")
print(f"Fixed: {fixed_text}")
๐ API Reference
Core Classes
AdvancedPDFProcessor
Main processor for PDF extraction with AI enhancement.
processor = AdvancedPDFProcessor(
model: str = "llama3.2",
bib_file: Optional[Path] = None
)
result = processor.process_pdf(
pdf_path: Path,
bib_file: Optional[Path] = None
) -> ExtractionResult
KnowledgeTracker
Knowledge indexing and search system.
tracker = KnowledgeTracker(db_path: Optional[Path] = None)
tracker.add_extraction_result(result_data: Dict, file_path: str)
results = tracker.search_keywords(query: str, limit: int = 20)
stats = tracker.get_document_stats()
OllamaAIProcessor
AI processing engine using Ollama.
ai = OllamaAIProcessor(
model: str = "llama3.2",
base_url: str = "http://localhost:11434"
)
keywords = ai.extract_keywords_and_concepts(text: str)
content_type = ai.classify_content_type(text: str, page_num: int, total_pages: int)
fixed_text = ai.fix_text_spacing(text: str)
Data Models
ExtractionResult
Complete extraction result with metadata, pages, and analysis.
DocumentMetadata
Document metadata including title, authors, keywords, and processing info.
PageContent
Individual page content with classification and keywords.
๐ก Examples
See the examples/ directory for complete examples:
- basic_extraction.py - Simple PDF processing
- batch_processing.py - Process multiple PDFs
- knowledge_analysis.py - Knowledge base analysis
- custom_processing.py - Advanced customization
๐ฏ Recommended Models
Based on extensive testing with academic papers:
๐ก Pro Tip: Start with
qwen2.5:32b-instruct-q4_K_Mfor the best balance of accuracy and performance
Best Overall: qwen2.5:32b-instruct-q4_K_M
- โ Perfect JSON output - No parsing errors
- โ Excellent keyword extraction - High accuracy
- โ Academic content optimized - Understands research papers
- ๐ฆ Size: 19GB
Lightweight: llama3.2
- โ Good performance - Reliable results
- โ Small size - Only 2GB
- โ Fast processing - Quick turnaround
- โ Occasional JSON issues - May need post-processing
High-End: qwen2:72b
- โ Superior accuracy - Best quality results
- โ Complex reasoning - Handles difficult papers
- โ Large size - 40GB storage required
- โ Slow processing - Higher compute requirements
๐ Requirements
System Requirements
- Python: 3.10+ (3.11+ recommended)
- Memory: 8GB RAM minimum (16GB+ for large models)
- Storage: 20GB+ free space for models
- OS: macOS, Linux, Windows (WSL2)
Core Dependencies
- Ollama: For AI processing
- PyMuPDF: For PDF text extraction
- SQLite: For knowledge indexing (included with Python)
Optional Dependencies
- bibtexparser: For bibliography integration
- pdfplumber: Alternative PDF processing
- unstructured: Advanced document parsing
๐ค Contributing
We welcome contributions! ๐ Please see CONTRIBUTING.md for guidelines.
Quick Contribution Guide
- ๐ด Fork the repository
- ๐ฟ Create a feature branch:
git checkout -b feature/amazing-feature - โ
Test your changes:
pytest - ๐ Commit your changes:
git commit -m 'Add amazing feature' - ๐ Push to branch:
git push origin feature/amazing-feature - ๐ฌ Open a Pull Request
Development Setup
git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e ".[dev]"
pre-commit install
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=scixtract --cov-report=html
# Run specific test file
pytest tests/test_extractor.py -v
Code Quality
# Format code
black src/ tests/
isort src/ tests/
# Lint code
flake8 src/ tests/
mypy src/
# Security scan
bandit -r src/
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Roadmap
- Multi-language support for international papers
- Web interface for non-technical users
- Cloud deployment options
- Advanced visualization tools
- Integration with reference managers (Zotero, Mendeley)
- Collaborative features for research teams
Citation
If you use this software in your research, please cite:
@software{stamm2024_scixtract,
author = {Stamm, Reto},
title = {scixtract: AI-powered scientific PDF extraction using Ollama},
year = {2024},
url = {https://github.com/retostamm/scixtract},
version = {1.0.3}
}
๐ Acknowledgments
- ๐ Research Context: Developed for NOx to Ammonia catalysis research at University of Limerick
- ๐จโ๐ซ Supervision: Prof. Matthias Vandichel
- ๐ค AI Engine: Built with Ollama for local AI processing
- ๐ PDF Processing: Powered by PyMuPDF
- ๐ Community: Thanks to all contributors and users!
๐ Stats
- ๐งช Test Coverage: 95%+
- ๐ฆ Dependencies: Minimal and well-maintained
- ๐ CI/CD: Automated testing and deployment
- ๐ Documentation: Comprehensive guides and examples
For issues and questions, please visit the GitHub repository.
Built with Windsurf.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scixtract-1.0.3.tar.gz.
File metadata
- Download URL: scixtract-1.0.3.tar.gz
- Upload date:
- Size: 30.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1ed820d90b90122fbd896d61c8051428dab6ede3502b728c9d8a5194b2cfbcf
|
|
| MD5 |
182d700c95ff24d8a5783490451c2086
|
|
| BLAKE2b-256 |
b16a06250833032a3eb9e5c11cc1dae9fdc16df47049acb79ca7e8fe686f9861
|
Provenance
The following attestation bundles were made for scixtract-1.0.3.tar.gz:
Publisher:
pypi_publish.yml on retospect/scixtract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scixtract-1.0.3.tar.gz -
Subject digest:
c1ed820d90b90122fbd896d61c8051428dab6ede3502b728c9d8a5194b2cfbcf - Sigstore transparency entry: 660762673
- Sigstore integration time:
-
Permalink:
retospect/scixtract@05b680aeb30806588146c76d34d78217ebfaeac8 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/retospect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_publish.yml@05b680aeb30806588146c76d34d78217ebfaeac8 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file scixtract-1.0.3-py3-none-any.whl.
File metadata
- Download URL: scixtract-1.0.3-py3-none-any.whl
- Upload date:
- Size: 29.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c2081e04e1473e8f5398772224d0a539c8a290c5fa8efa7d51d64bd9e87cfae
|
|
| MD5 |
a946abd675293071a9e3d75d55d87586
|
|
| BLAKE2b-256 |
8bb22673b57dc5bf936256cbeaf45ab14fe97c5a7d70e27571ab4a03fd15ec5e
|
Provenance
The following attestation bundles were made for scixtract-1.0.3-py3-none-any.whl:
Publisher:
pypi_publish.yml on retospect/scixtract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scixtract-1.0.3-py3-none-any.whl -
Subject digest:
7c2081e04e1473e8f5398772224d0a539c8a290c5fa8efa7d51d64bd9e87cfae - Sigstore transparency entry: 660762676
- Sigstore integration time:
-
Permalink:
retospect/scixtract@05b680aeb30806588146c76d34d78217ebfaeac8 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/retospect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_publish.yml@05b680aeb30806588146c76d34d78217ebfaeac8 -
Trigger Event:
workflow_dispatch
-
Statement type: