Skip to main content

AI-assisted scientific PDF text extraction using local Ollama models

Project description

scixtract

Python PyPI version License Tests

AI-assisted scientific PDF text extraction

Scixtract solves the problem that PDF text extraction is messy and full of artifacts. This tool uses AI assistance to clean up extracted text from scientific PDFs, preserving important formatting like chemical formulas and citations while removing common extraction artifacts.

Designed specifically for academic and scientific literature, scixtract provides clean, structured text output that maintains the integrity of your research content.

What scixtract does

  • Cleans messy PDF text: Removes spacing artifacts, broken words, and formatting issues
  • Preserves scientific content: Maintains chemical formulas (Hโ‚‚O, COโ‚‚), equations, and citations
  • Local AI processing: Uses local AI models to fix text while preserving meaning
  • Privacy-focused: All processing happens on your machine - no data sent to external services
  • Batch processing: Handle multiple PDFs
  • Knowledge indexing: Build searchable databases of extracted content

Prerequisites

Before using scixtract, you need to install and set up Ollama:

1. Install Ollama

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download from ollama.ai

2. Start Ollama service

ollama serve

3. Install a model

For scientific PDFs:

# Recommended: Balance of speed and accuracy (19GB)
ollama pull qwen2.5:32b-instruct-q4_K_M

# Alternative: Smaller model (2GB)
ollama pull llama3.2

Installation

Install scixtract from PyPI:

pip install scixtract

Quick Start

Basic PDF extraction

# Extract a single PDF
scixtract extract paper.pdf

# Use specific model
scixtract extract paper.pdf --model qwen2.5:32b-instruct-q4_K_M

# Process multiple PDFs
scixtract extract papers/*.pdf

Python API

from scixtract import AdvancedPDFProcessor
from pathlib import Path

# Initialize processor
processor = AdvancedPDFProcessor(
    model="qwen2.5:32b-instruct-q4_K_M"
)

# Process PDF
result = processor.process_pdf(Path("paper.pdf"))

# Access cleaned text
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")

# Get page content
for page in result.pages:
    print(f"Page {page.page_number}: {page.content[:200]}...")

Knowledge management

Build a searchable database of your extracted content:

# Extract and add to knowledge base (with bibliography for author name recognition)
scixtract extract paper.pdf --bib-file references.bib --update-knowledge

# Search your knowledge base
scixtract knowledge --search "catalysis"

# View statistics
scixtract knowledge --stats

Output formats

Scixtract provides multiple output formats:

  • JSON: Structured data with metadata, page content, and extracted keywords
  • Markdown: Clean, readable text with AI-generated summaries
  • Knowledge database: SQLite database for searching across multiple documents

Model recommendations

Based on testing with scientific papers:

Recommended: qwen2.5:32b-instruct-q4_K_M

  • Good accuracy for scientific content
  • Reliable JSON output
  • Size: 19GB

Lightweight option: llama3.2

  • Adequate performance for most papers
  • Faster processing
  • Size: 2GB

System requirements

  • Python: 3.10 or higher
  • Memory: 8GB RAM minimum (16GB+ recommended for large models)
  • Storage: 20GB+ free space for AI models
  • Ollama: Required for AI processing

Help and setup

Use the built-in setup helper:

# Check if Ollama is properly configured
scixtract-setup-ollama --check-only

# List available models
scixtract-setup-ollama --list-models

# Complete setup with recommended model
scixtract-setup-ollama --model qwen2.5:32b-instruct-q4_K_M

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

Support

For technical documentation, API reference, and development information, see MAINTAINER_README.md.

<<<<<<< HEAD

Process PDF

result = processor.process_pdf(Path("paper.pdf"))

Access results

print(f"Title: {result.metadata.title}") print(f"Authors: {', '.join(result.metadata.authors)}") print(f"Keywords: {', '.join(result.all_keywords[:10])}") print(f"Processing time: {result.metadata.processing_time:.1f}s")


#### Knowledge Tracking

```python
from scixtract import KnowledgeTracker

# Initialize tracker
tracker = KnowledgeTracker()

# Add extraction result
tracker.add_extraction_result(result.to_dict(), "paper.pdf")

# Search knowledge base
results = tracker.search_keywords("catalysis")
for result in results:
    print(f"{result['cite_key']}: {result['context']}")

# Get statistics
stats = tracker.get_document_stats()
print(f"Documents: {stats['document_count']}")
print(f"Keywords: {stats['unique_keywords']}")

Advanced Processing

from scixtract import OllamaAIProcessor

# Custom AI processor
ai = OllamaAIProcessor("custom-model")

# Extract keywords
keywords = ai.extract_keywords_and_concepts("Your text here")
print(keywords["technical_keywords"])

# Classify content
content_type = ai.classify_content_type("Abstract text", 1, 10)
print(f"Content type: {content_type}")

# Fix text spacing
fixed_text = ai.fix_text_spacing("Textwithnospaces")
print(f"Fixed: {fixed_text}")

๐Ÿ“š API Reference

Core Classes

AdvancedPDFProcessor

Main processor for PDF extraction with AI enhancement.

processor = AdvancedPDFProcessor(
    model: str = "llama3.2",
    bib_file: Optional[Path] = None
)

result = processor.process_pdf(
    pdf_path: Path,
    bib_file: Optional[Path] = None
) -> ExtractionResult

KnowledgeTracker

Knowledge indexing and search system.

tracker = KnowledgeTracker(db_path: Optional[Path] = None)

tracker.add_extraction_result(result_data: Dict, file_path: str)
results = tracker.search_keywords(query: str, limit: int = 20)
stats = tracker.get_document_stats()

OllamaAIProcessor

AI processing engine using Ollama.

ai = OllamaAIProcessor(
    model: str = "llama3.2",
    base_url: str = "http://localhost:11434"
)

keywords = ai.extract_keywords_and_concepts(text: str)
content_type = ai.classify_content_type(text: str, page_num: int, total_pages: int)
fixed_text = ai.fix_text_spacing(text: str)

Data Models

ExtractionResult

Complete extraction result with metadata, pages, and analysis.

DocumentMetadata

Document metadata including title, authors, keywords, and processing info.

PageContent

Individual page content with classification and keywords.

๐Ÿ’ก Examples

See the examples/ directory for complete examples:

๐ŸŽฏ Recommended Models

Based on extensive testing with academic papers:

๐Ÿ’ก Pro Tip: Start with qwen2.5:32b-instruct-q4_K_M for the best balance of accuracy and performance

Best Overall: qwen2.5:32b-instruct-q4_K_M

  • โœ… Perfect JSON output - No parsing errors
  • โœ… Excellent keyword extraction - High accuracy
  • โœ… Academic content optimized - Understands research papers
  • ๐Ÿ“ฆ Size: 19GB

Lightweight: llama3.2

  • โœ… Good performance - Reliable results
  • โœ… Small size - Only 2GB
  • โœ… Fast processing - Quick turnaround
  • โš  Occasional JSON issues - May need post-processing

High-End: qwen2:72b

  • โœ… Superior accuracy - Best quality results
  • โœ… Complex reasoning - Handles difficult papers
  • โŒ Large size - 40GB storage required
  • โŒ Slow processing - Higher compute requirements

๐Ÿ“‹ Requirements

System Requirements

  • Python: 3.10+ (3.11+ recommended)
  • Memory: 8GB RAM minimum (16GB+ for large models)
  • Storage: 20GB+ free space for models
  • OS: macOS, Linux, Windows (WSL2)

Core Dependencies

  • Ollama: For AI processing
  • PyMuPDF: For PDF text extraction
  • SQLite: For knowledge indexing (included with Python)

Optional Dependencies

  • bibtexparser: For bibliography integration
  • pdfplumber: Alternative PDF processing
  • unstructured: Advanced document parsing

๐Ÿค Contributing

We welcome contributions! ๐ŸŽ‰ Please see CONTRIBUTING.md for guidelines.

Quick Contribution Guide

  1. ๐Ÿด Fork the repository
  2. ๐ŸŒฟ Create a feature branch: git checkout -b feature/amazing-feature
  3. โœ… Test your changes: pytest
  4. ๐Ÿ“ Commit your changes: git commit -m 'Add amazing feature'
  5. ๐Ÿš€ Push to branch: git push origin feature/amazing-feature
  6. ๐Ÿ“ฌ Open a Pull Request

Development Setup

git clone https://github.com/retostamm/scixtract.git
cd scixtract
pip install -e ".[dev]"
pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=scixtract --cov-report=html

# Run specific test file
pytest tests/test_extractor.py -v

Code Quality

# Format code
black src/ tests/
isort src/ tests/

# Lint code
flake8 src/ tests/
mypy src/

# Security scan
bandit -r src/

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿš€ Roadmap

  • Multi-language support for international papers
  • Web interface for non-technical users
  • Cloud deployment options
  • Advanced visualization tools
  • Integration with reference managers (Zotero, Mendeley)
  • Collaborative features for research teams

Citation

If you use this software in your research, please cite:

@software{stamm2024_scixtract,
  author = {Stamm, Reto},
  title = {scixtract: AI-powered scientific PDF extraction using Ollama},
  year = {2024},
  url = {https://github.com/retostamm/scixtract},
  version = {1.0.2}
}

๐Ÿ™ Acknowledgments

  • ๐ŸŽ“ Research Context: Developed for NOx to Ammonia catalysis research at University of Limerick
  • ๐Ÿ‘จโ€๐Ÿซ Supervision: Prof. Matthias Vandichel
  • ๐Ÿค– AI Engine: Built with Ollama for local AI processing
  • ๐Ÿ“„ PDF Processing: Powered by PyMuPDF
  • ๐ŸŒŸ Community: Thanks to all contributors and users!

๐Ÿ“ˆ Stats

  • ๐Ÿงช Test Coverage: 95%+
  • ๐Ÿ“ฆ Dependencies: Minimal and well-maintained
  • ๐Ÿ”„ CI/CD: Automated testing and deployment
  • ๐Ÿ“š Documentation: Comprehensive guides and examples

Made with โค๏ธ for the research community

For issues and questions, please visit the GitHub repository.

b7ea440 (๐Ÿ“ Restructure documentation and update package metadata)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scixtract-1.0.2.tar.gz (31.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scixtract-1.0.2-py3-none-any.whl (31.0 kB view details)

Uploaded Python 3

File details

Details for the file scixtract-1.0.2.tar.gz.

File metadata

  • Download URL: scixtract-1.0.2.tar.gz
  • Upload date:
  • Size: 31.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.0.2.tar.gz
Algorithm Hash digest
SHA256 730fa36f7179ea431de81c0549f8f605c2e36215244247a29c6a9aebcce535c9
MD5 9f549f4f357a13427846e312e09b3aca
BLAKE2b-256 de73069dded065edded58856e6c742c01922765752a85366e3250f9a4004195e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.0.2.tar.gz:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scixtract-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: scixtract-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 31.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7402a5d4ed67018b9ba05c5944b7ceb671ed4efea5073054a7761abc437b41c9
MD5 ec1de3a51a50918c1a01aec4122f50a1
BLAKE2b-256 f2dfbeda8dffbb761b8a2caac454058952b3f72c376b2273369ae1f4e84bca4a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.0.2-py3-none-any.whl:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page