AI-powered scientific PDF extraction using Ollama

These details have not been verified by PyPI

Project links

Project description

scixtract

🚀 AI-powered scientific PDF extraction using Ollama

Transform your academic PDFs into structured, searchable knowledge with cutting-edge AI

A comprehensive library for extracting text from academic PDFs using AI, with advanced knowledge tracking and search capabilities. Specifically optimized for scientific literature with features like chemical formula preservation, citation integrity, and intelligent content classification.

🎯 Why scixtract?

🤖 AI-First Approach: Leverages local Ollama models for privacy-preserving extraction
🔬 Science-Optimized: Preserves chemical formulas, equations, and academic formatting
📊 Knowledge Graphs: Builds searchable networks of concepts and relationships
⚡ High Performance: Batch processing with 95%+ test coverage
🔒 Privacy-Focused: All processing happens locally - no data leaves your machine

📦 Installation

From PyPI (recommended)

pip install scixtract

From Source

git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e .

Development Installation

git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e ".[dev]"

⚡ Quick Start

Get up and running in under 5 minutes!

1. Setup Ollama (AI Engine)

# Install Ollama (macOS)
brew install ollama

# Start Ollama service
ollama serve &

# Install recommended model
ollama pull qwen2.5:32b-instruct-q4_K_M

# Or use the setup helper
scixtract-setup-ollama

2. Extract PDF with AI

# Basic extraction
scixtract extract paper.pdf

# With specific model
scixtract extract paper.pdf --model qwen2.5:32b-instruct-q4_K_M

# With bibliography integration
scixtract extract paper.pdf --bib-file references.bib --update-knowledge

3. Search Knowledge Base

# Search for keywords
scixtract knowledge --search "catalysis"

# Find related concepts
scixtract knowledge --related "ammonia"

# View statistics
scixtract knowledge --stats

✨ Features

🤖 AI-Powered Processing

Multi-pass analysis with keyword extraction → classification → enhancement
Intelligent text fixing that preserves chemical formulas and citations
Content classification (abstract, methods, results, discussion, etc.)
Advanced prompting strategies optimized for academic papers

📚 Knowledge Management

SQLite database for fast, searchable knowledge indexing
Cross-document concept networks and relationship mapping
Author tracking and citation networks
Knowledge graph export for visualization

🔬 Academic Optimization

Chemical formula preservation (NOₓ, NH₃, H₂O, etc.)
Citation integrity maintenance
Bibliography integration from BibTeX files
Research context linking between processed content and bibliography

📄 Multiple Output Formats

Structured JSON with comprehensive metadata
Enhanced Markdown with AI-generated summaries
Keyword indices for fast searching
Knowledge graphs for visualization

🛠 Professional Tools

Command-line interface for batch processing
Python API for integration
Comprehensive testing with 95%+ coverage
Type hints throughout

🛠 Usage

Command Line Interface

PDF Extraction

# Basic extraction
scixtract extract paper.pdf

# Advanced options
scixtract extract paper.pdf \
    --model qwen2.5:32b-instruct-q4_K_M \
    --output-dir ./extractions \
    --bib-file references.bib \
    --update-knowledge

# Batch processing
for pdf in papers/*.pdf; do
    scixtract extract "$pdf" --update-knowledge
done

Knowledge Management

# Search for concepts
scixtract knowledge --search "electrochemical conversion"

# Find related concepts
scixtract knowledge --related "NOx reduction"

# Export knowledge graph
scixtract knowledge --export-graph knowledge_graph.json

# View database statistics
scixtract knowledge --stats

Ollama Setup

# Check Ollama status
scixtract-setup-ollama --check-only

# List available models
scixtract-setup-ollama --list-models

# Complete setup with recommended model
scixtract-setup-ollama --model qwen2.5:32b-instruct-q4_K_M

Python API

Basic Usage

from scixtract import AdvancedPDFProcessor
from pathlib import Path

# Initialize processor
processor = AdvancedPDFProcessor(
    model="qwen2.5:32b-instruct-q4_K_M",
    bib_file=Path("references.bib")
)

# Process PDF
result = processor.process_pdf(Path("paper.pdf"))

# Access results
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")
print(f"Keywords: {', '.join(result.all_keywords[:10])}")
print(f"Processing time: {result.metadata.processing_time:.1f}s")

Knowledge Tracking

from scixtract import KnowledgeTracker

# Initialize tracker
tracker = KnowledgeTracker()

# Add extraction result
tracker.add_extraction_result(result.to_dict(), "paper.pdf")

# Search knowledge base
results = tracker.search_keywords("catalysis")
for result in results:
    print(f"{result['cite_key']}: {result['context']}")

# Get statistics
stats = tracker.get_document_stats()
print(f"Documents: {stats['document_count']}")
print(f"Keywords: {stats['unique_keywords']}")

Advanced Processing

from scixtract import OllamaAIProcessor

# Custom AI processor
ai = OllamaAIProcessor("custom-model")

# Extract keywords
keywords = ai.extract_keywords_and_concepts("Your text here")
print(keywords["technical_keywords"])

# Classify content
content_type = ai.classify_content_type("Abstract text", 1, 10)
print(f"Content type: {content_type}")

# Fix text spacing
fixed_text = ai.fix_text_spacing("Textwithnospaces")
print(f"Fixed: {fixed_text}")

📚 API Reference

Core Classes

`AdvancedPDFProcessor`

Main processor for PDF extraction with AI enhancement.

processor = AdvancedPDFProcessor(
    model: str = "llama3.2",
    bib_file: Optional[Path] = None
)

result = processor.process_pdf(
    pdf_path: Path,
    bib_file: Optional[Path] = None
) -> ExtractionResult

`KnowledgeTracker`

Knowledge indexing and search system.

tracker = KnowledgeTracker(db_path: Optional[Path] = None)

tracker.add_extraction_result(result_data: Dict, file_path: str)
results = tracker.search_keywords(query: str, limit: int = 20)
stats = tracker.get_document_stats()

`OllamaAIProcessor`

AI processing engine using Ollama.

ai = OllamaAIProcessor(
    model: str = "llama3.2",
    base_url: str = "http://localhost:11434"
)

keywords = ai.extract_keywords_and_concepts(text: str)
content_type = ai.classify_content_type(text: str, page_num: int, total_pages: int)
fixed_text = ai.fix_text_spacing(text: str)

Data Models

`ExtractionResult`

Complete extraction result with metadata, pages, and analysis.

`DocumentMetadata`

Document metadata including title, authors, keywords, and processing info.

`PageContent`

Individual page content with classification and keywords.

💡 Examples

See the examples/ directory for complete examples:

basic_extraction.py - Simple PDF processing
batch_processing.py - Process multiple PDFs
knowledge_analysis.py - Knowledge base analysis
custom_processing.py - Advanced customization

🎯 Recommended Models

Based on extensive testing with academic papers:

💡 Pro Tip: Start with qwen2.5:32b-instruct-q4_K_M for the best balance of accuracy and performance

Best Overall: qwen2.5:32b-instruct-q4_K_M

✅ Perfect JSON output - No parsing errors
✅ Excellent keyword extraction - High accuracy
✅ Academic content optimized - Understands research papers
📦 Size: 19GB

Lightweight: llama3.2

✅ Good performance - Reliable results
✅ Small size - Only 2GB
✅ Fast processing - Quick turnaround
⚠ Occasional JSON issues - May need post-processing

High-End: qwen2:72b

✅ Superior accuracy - Best quality results
✅ Complex reasoning - Handles difficult papers
❌ Large size - 40GB storage required
❌ Slow processing - Higher compute requirements

📋 Requirements

System Requirements

Python: 3.10+ (3.11+ recommended)
Memory: 8GB RAM minimum (16GB+ for large models)
Storage: 20GB+ free space for models
OS: macOS, Linux, Windows (WSL2)

Core Dependencies

Ollama: For AI processing
PyMuPDF: For PDF text extraction
SQLite: For knowledge indexing (included with Python)

Optional Dependencies

bibtexparser: For bibliography integration
pdfplumber: Alternative PDF processing
unstructured: Advanced document parsing

🤝 Contributing

We welcome contributions! 🎉 Please see CONTRIBUTING.md for guidelines.

Quick Contribution Guide

🍴 Fork the repository
🌿 Create a feature branch: git checkout -b feature/amazing-feature
✅ Test your changes: pytest
📝 Commit your changes: git commit -m 'Add amazing feature'
🚀 Push to branch: git push origin feature/amazing-feature
📬 Open a Pull Request

Development Setup

git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e ".[dev]"
pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=scixtract --cov-report=html

# Run specific test file
pytest tests/test_extractor.py -v

Code Quality

# Format code
black src/ tests/
isort src/ tests/

# Lint code
flake8 src/ tests/
mypy src/

# Security scan
bandit -r src/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🚀 Roadmap

Multi-language support for international papers
Web interface for non-technical users
Cloud deployment options
Advanced visualization tools
Integration with reference managers (Zotero, Mendeley)
Collaborative features for research teams

Citation

If you use this software in your research, please cite:

@software{stamm2024_scixtract,
  author = {Stamm, Reto},
  title = {scixtract: AI-powered scientific PDF extraction using Ollama},
  year = {2024},
  url = {https://github.com/retostamm/scixtract},
  version = {1.0.0}
}

🙏 Acknowledgments

🎓 Research Context: Developed for NOx to Ammonia catalysis research at University of Limerick
👨‍🏫 Supervision: Prof. Matthias Vandichel
🤖 AI Engine: Built with Ollama for local AI processing
📄 PDF Processing: Powered by PyMuPDF
🌟 Community: Thanks to all contributors and users!

📈 Stats

🧪 Test Coverage: 95%+
📦 Dependencies: Minimal and well-maintained
🔄 CI/CD: Automated testing and deployment
📚 Documentation: Comprehensive guides and examples

Made with ❤️ for the research community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.1

Jan 18, 2026

1.1.0

Jan 18, 2026

1.0.5

Nov 2, 2025

1.0.3

Nov 1, 2025

1.0.2

Nov 1, 2025

1.0.1

Nov 1, 2025

This version

1.0.0

Nov 1, 2025

0.3.0

Nov 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scixtract-1.0.0.tar.gz (30.3 kB view details)

Uploaded Nov 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scixtract-1.0.0-py3-none-any.whl (29.7 kB view details)

Uploaded Nov 1, 2025 Python 3

File details

Details for the file scixtract-1.0.0.tar.gz.

File metadata

Download URL: scixtract-1.0.0.tar.gz
Upload date: Nov 1, 2025
Size: 30.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`671513b1a2cf9395c0225dbf3ba24b39ec62dd478fcdf5764371deafc2cfa400`
MD5	`29834b1dfb3624b72596b7d157dda8af`
BLAKE2b-256	`5b82af8468954e13d2d2e0089d97db2822ea9b53b183329b100ab19dafeb7090`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.0.0.tar.gz:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scixtract-1.0.0.tar.gz
- Subject digest: 671513b1a2cf9395c0225dbf3ba24b39ec62dd478fcdf5764371deafc2cfa400
- Sigstore transparency entry: 660755907
- Sigstore integration time: Nov 1, 2025
Source repository:
- Permalink: retospect/scixtract@56d7bb88cd8191f19d67355bcace69c9832e60d0
- Branch / Tag: refs/heads/main
- Owner: https://github.com/retospect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi_publish.yml@56d7bb88cd8191f19d67355bcace69c9832e60d0
- Trigger Event: workflow_dispatch

File details

Details for the file scixtract-1.0.0-py3-none-any.whl.

File metadata

Download URL: scixtract-1.0.0-py3-none-any.whl
Upload date: Nov 1, 2025
Size: 29.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`775bc1a0abaafa6e620e53917ce7f919b93fb677bf59a0f98c1503b2d7f1a784`
MD5	`3d1379de2ae167e75ae60a583d08f916`
BLAKE2b-256	`0661021459630781035b66c6804cae0a03452a31054860e03e25d654837e6c90`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.0.0-py3-none-any.whl:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scixtract-1.0.0-py3-none-any.whl
- Subject digest: 775bc1a0abaafa6e620e53917ce7f919b93fb677bf59a0f98c1503b2d7f1a784
- Sigstore transparency entry: 660755908
- Sigstore integration time: Nov 1, 2025
Source repository:
- Permalink: retospect/scixtract@56d7bb88cd8191f19d67355bcace69c9832e60d0
- Branch / Tag: refs/heads/main
- Owner: https://github.com/retospect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi_publish.yml@56d7bb88cd8191f19d67355bcace69c9832e60d0
- Trigger Event: workflow_dispatch

scixtract 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scixtract

🎯 Why scixtract?

📋 Table of Contents

📦 Installation

From PyPI (recommended)

From Source

Development Installation

⚡ Quick Start

1. Setup Ollama (AI Engine)

2. Extract PDF with AI

3. Search Knowledge Base

✨ Features

🤖 AI-Powered Processing

📚 Knowledge Management

🔬 Academic Optimization

📄 Multiple Output Formats

🛠 Professional Tools

🛠 Usage

Command Line Interface

PDF Extraction

Knowledge Management

Ollama Setup

Python API

Basic Usage

Knowledge Tracking

Advanced Processing

📚 API Reference

Core Classes

AdvancedPDFProcessor

KnowledgeTracker

OllamaAIProcessor

Data Models

ExtractionResult

DocumentMetadata

PageContent

💡 Examples

🎯 Recommended Models

Best Overall: qwen2.5:32b-instruct-q4_K_M

Lightweight: llama3.2

High-End: qwen2:72b

📋 Requirements

System Requirements

Core Dependencies

Optional Dependencies

🤝 Contributing

Quick Contribution Guide

Development Setup

Running Tests

Code Quality

📄 License

🚀 Roadmap

Citation

🙏 Acknowledgments

📈 Stats

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

`AdvancedPDFProcessor`

`KnowledgeTracker`

`OllamaAIProcessor`

`ExtractionResult`

`DocumentMetadata`

`PageContent`