Skip to main content

AI-powered scientific PDF extraction using Ollama

Project description

scixtract

Python PyPI version License Tests Coverage Downloads

๐Ÿš€ AI-powered scientific PDF extraction using Ollama

Transform your academic PDFs into structured, searchable knowledge with cutting-edge AI

A comprehensive library for extracting text from academic PDFs using AI, with advanced knowledge tracking and search capabilities. Specifically optimized for scientific literature with features like chemical formula preservation, citation integrity, and intelligent content classification.

๐ŸŽฏ Why scixtract?

  • ๐Ÿค– AI-First Approach: Leverages local Ollama models for privacy-preserving extraction
  • ๐Ÿ”ฌ Science-Optimized: Preserves chemical formulas, equations, and academic formatting
  • ๐Ÿ“Š Knowledge Graphs: Builds searchable networks of concepts and relationships
  • โšก High Performance: Batch processing with 95%+ test coverage
  • ๐Ÿ”’ Privacy-Focused: All processing happens locally - no data leaves your machine

๐Ÿ“‹ Table of Contents

๐Ÿ“ฆ Installation

From PyPI (recommended)

pip install scixtract

From Source

git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e .

Development Installation

git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e ".[dev]"

โšก Quick Start

Get up and running in under 5 minutes!

1. Setup Ollama (AI Engine)

# Install Ollama (macOS)
brew install ollama

# Start Ollama service
ollama serve &

# Install recommended model
ollama pull qwen2.5:32b-instruct-q4_K_M

# Or use the setup helper
scixtract-setup-ollama

2. Extract PDF with AI

# Basic extraction
scixtract extract paper.pdf

# With specific model
scixtract extract paper.pdf --model qwen2.5:32b-instruct-q4_K_M

# With bibliography integration
scixtract extract paper.pdf --bib-file references.bib --update-knowledge

3. Search Knowledge Base

# Search for keywords
scixtract knowledge --search "catalysis"

# Find related concepts
scixtract knowledge --related "ammonia"

# View statistics
scixtract knowledge --stats

โœจ Features

๐Ÿค– AI-Powered Processing

  • Multi-pass analysis with keyword extraction โ†’ classification โ†’ enhancement
  • Intelligent text fixing that preserves chemical formulas and citations
  • Content classification (abstract, methods, results, discussion, etc.)
  • Advanced prompting strategies optimized for academic papers

๐Ÿ“š Knowledge Management

  • SQLite database for fast, searchable knowledge indexing
  • Cross-document concept networks and relationship mapping
  • Author tracking and citation networks
  • Knowledge graph export for visualization

๐Ÿ”ฌ Academic Optimization

  • Chemical formula preservation (NOโ‚“, NHโ‚ƒ, Hโ‚‚O, etc.)
  • Citation integrity maintenance
  • Bibliography integration from BibTeX files
  • Research context linking between processed content and bibliography

๐Ÿ“„ Multiple Output Formats

  • Structured JSON with comprehensive metadata
  • Enhanced Markdown with AI-generated summaries
  • Keyword indices for fast searching
  • Knowledge graphs for visualization

๐Ÿ›  Professional Tools

  • Command-line interface for batch processing
  • Python API for integration
  • Comprehensive testing with 95%+ coverage
  • Type hints throughout

๐Ÿ›  Usage

Command Line Interface

PDF Extraction

# Basic extraction
scixtract extract paper.pdf

# Advanced options
scixtract extract paper.pdf \
    --model qwen2.5:32b-instruct-q4_K_M \
    --output-dir ./extractions \
    --bib-file references.bib \
    --update-knowledge

# Batch processing
for pdf in papers/*.pdf; do
    scixtract extract "$pdf" --update-knowledge
done

Knowledge Management

# Search for concepts
scixtract knowledge --search "electrochemical conversion"

# Find related concepts
scixtract knowledge --related "NOx reduction"

# Export knowledge graph
scixtract knowledge --export-graph knowledge_graph.json

# View database statistics
scixtract knowledge --stats

Ollama Setup

# Check Ollama status
scixtract-setup-ollama --check-only

# List available models
scixtract-setup-ollama --list-models

# Complete setup with recommended model
scixtract-setup-ollama --model qwen2.5:32b-instruct-q4_K_M

Python API

Basic Usage

from scixtract import AdvancedPDFProcessor
from pathlib import Path

# Initialize processor
processor = AdvancedPDFProcessor(
    model="qwen2.5:32b-instruct-q4_K_M",
    bib_file=Path("references.bib")
)

# Process PDF
result = processor.process_pdf(Path("paper.pdf"))

# Access results
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")
print(f"Keywords: {', '.join(result.all_keywords[:10])}")
print(f"Processing time: {result.metadata.processing_time:.1f}s")

Knowledge Tracking

from scixtract import KnowledgeTracker

# Initialize tracker
tracker = KnowledgeTracker()

# Add extraction result
tracker.add_extraction_result(result.to_dict(), "paper.pdf")

# Search knowledge base
results = tracker.search_keywords("catalysis")
for result in results:
    print(f"{result['cite_key']}: {result['context']}")

# Get statistics
stats = tracker.get_document_stats()
print(f"Documents: {stats['document_count']}")
print(f"Keywords: {stats['unique_keywords']}")

Advanced Processing

from scixtract import OllamaAIProcessor

# Custom AI processor
ai = OllamaAIProcessor("custom-model")

# Extract keywords
keywords = ai.extract_keywords_and_concepts("Your text here")
print(keywords["technical_keywords"])

# Classify content
content_type = ai.classify_content_type("Abstract text", 1, 10)
print(f"Content type: {content_type}")

# Fix text spacing
fixed_text = ai.fix_text_spacing("Textwithnospaces")
print(f"Fixed: {fixed_text}")

๐Ÿ“š API Reference

Core Classes

AdvancedPDFProcessor

Main processor for PDF extraction with AI enhancement.

processor = AdvancedPDFProcessor(
    model: str = "llama3.2",
    bib_file: Optional[Path] = None
)

result = processor.process_pdf(
    pdf_path: Path,
    bib_file: Optional[Path] = None
) -> ExtractionResult

KnowledgeTracker

Knowledge indexing and search system.

tracker = KnowledgeTracker(db_path: Optional[Path] = None)

tracker.add_extraction_result(result_data: Dict, file_path: str)
results = tracker.search_keywords(query: str, limit: int = 20)
stats = tracker.get_document_stats()

OllamaAIProcessor

AI processing engine using Ollama.

ai = OllamaAIProcessor(
    model: str = "llama3.2",
    base_url: str = "http://localhost:11434"
)

keywords = ai.extract_keywords_and_concepts(text: str)
content_type = ai.classify_content_type(text: str, page_num: int, total_pages: int)
fixed_text = ai.fix_text_spacing(text: str)

Data Models

ExtractionResult

Complete extraction result with metadata, pages, and analysis.

DocumentMetadata

Document metadata including title, authors, keywords, and processing info.

PageContent

Individual page content with classification and keywords.

๐Ÿ’ก Examples

See the examples/ directory for complete examples:

๐ŸŽฏ Recommended Models

Based on extensive testing with academic papers:

๐Ÿ’ก Pro Tip: Start with qwen2.5:32b-instruct-q4_K_M for the best balance of accuracy and performance

Best Overall: qwen2.5:32b-instruct-q4_K_M

  • โœ… Perfect JSON output - No parsing errors
  • โœ… Excellent keyword extraction - High accuracy
  • โœ… Academic content optimized - Understands research papers
  • ๐Ÿ“ฆ Size: 19GB

Lightweight: llama3.2

  • โœ… Good performance - Reliable results
  • โœ… Small size - Only 2GB
  • โœ… Fast processing - Quick turnaround
  • โš  Occasional JSON issues - May need post-processing

High-End: qwen2:72b

  • โœ… Superior accuracy - Best quality results
  • โœ… Complex reasoning - Handles difficult papers
  • โŒ Large size - 40GB storage required
  • โŒ Slow processing - Higher compute requirements

๐Ÿ“‹ Requirements

System Requirements

  • Python: 3.10+ (3.11+ recommended)
  • Memory: 8GB RAM minimum (16GB+ for large models)
  • Storage: 20GB+ free space for models
  • OS: macOS, Linux, Windows (WSL2)

Core Dependencies

  • Ollama: For AI processing
  • PyMuPDF: For PDF text extraction
  • SQLite: For knowledge indexing (included with Python)

Optional Dependencies

  • bibtexparser: For bibliography integration
  • pdfplumber: Alternative PDF processing
  • unstructured: Advanced document parsing

๐Ÿค Contributing

We welcome contributions! ๐ŸŽ‰ Please see CONTRIBUTING.md for guidelines.

Quick Contribution Guide

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch: git checkout -b feature/amazing-feature
  3. โœ… Test your changes: pytest
  4. ๐Ÿ“ Commit your changes: git commit -m 'Add amazing feature'
  5. ๐Ÿš€ Push to branch: git push origin feature/amazing-feature
  6. ๐Ÿ“ฌ Open a Pull Request

Development Setup

git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e ".[dev]"
pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=scixtract --cov-report=html

# Run specific test file
pytest tests/test_extractor.py -v

Code Quality

# Format code
black src/ tests/
isort src/ tests/

# Lint code
flake8 src/ tests/
mypy src/

# Security scan
bandit -r src/

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿš€ Roadmap

  • Multi-language support for international papers
  • Web interface for non-technical users
  • Cloud deployment options
  • Advanced visualization tools
  • Integration with reference managers (Zotero, Mendeley)
  • Collaborative features for research teams

Citation

If you use this software in your research, please cite:

@software{stamm2024_scixtract,
  author = {Stamm, Reto},
  title = {scixtract: AI-powered scientific PDF extraction using Ollama},
  year = {2024},
  url = {https://github.com/retostamm/scixtract},
  version = {0.3.0}
}

๐Ÿ™ Acknowledgments

  • ๐ŸŽ“ Research Context: Developed for NOx to Ammonia catalysis research at University of Limerick
  • ๐Ÿ‘จโ€๐Ÿซ Supervision: Prof. Matthias Vandichel
  • ๐Ÿค– AI Engine: Built with Ollama for local AI processing
  • ๐Ÿ“„ PDF Processing: Powered by PyMuPDF
  • ๐ŸŒŸ Community: Thanks to all contributors and users!

๐Ÿ“ˆ Stats

  • ๐Ÿงช Test Coverage: 95%+
  • ๐Ÿ“ฆ Dependencies: Minimal and well-maintained
  • ๐Ÿ”„ CI/CD: Automated testing and deployment
  • ๐Ÿ“š Documentation: Comprehensive guides and examples

Made with โค๏ธ for the research community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scixtract-0.3.0.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scixtract-0.3.0-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file scixtract-0.3.0.tar.gz.

File metadata

  • Download URL: scixtract-0.3.0.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-0.3.0.tar.gz
Algorithm Hash digest
SHA256 83c6a9fde49e43cd44c3505085344c966d989e5060c24acca78aae71370e1a14
MD5 0608d855c21f95eb7670ed3efbea7f86
BLAKE2b-256 e0ce7c7c300505f0b96c1c6d511c09febb7cdb89e70cc979982d4b679820740f

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-0.3.0.tar.gz:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scixtract-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: scixtract-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 29.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 82782b300d1611c4473128f7bb18bf321a4a76cb8673aa63db34d1977b3a4918
MD5 1d047b7486a20919ec234f61debe73c4
BLAKE2b-256 5dcd7af6c987674baf2da4b8ede1288a8e45050249d0609762d7266f779daa50

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-0.3.0-py3-none-any.whl:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page