Skip to main content

AI-assisted scientific PDF text extraction using local Ollama models

Project description

scixtract

Python PyPI version License: GPL v3 Tests

AI-powered PDF text extraction for scientific papers. Removes artifacts, preserves formatting like chemical formulas and citations.

Usage

# First run creates directory structure
scixtract extract

# Put PDFs in pdf/ directory
# Run extraction
scixtract extract

# Clean markdown files appear in md/ directory

Output: Markdown files with page numbers preserved and extraction artifacts removed.

Directory structure:

your-project/sources/
├── pdf/         # Input PDFs
├── md/          # Output markdown
└── working/     # Intermediate files

What it does

For each PDF, scixtract:

  1. Extracts text from PDF using unstructured library
  2. Processes each page with AI (qwen3:8b via Ollama):
    • Removes spacing artifacts and broken words
    • Fixes line breaks and hyphenation
    • Preserves chemical formulas (H₂O, CO₂)
    • Preserves citations and references
    • Maintains paragraph structure
  3. Extracts metadata: Title, authors, keywords
  4. Generates summary of the document
  5. Outputs:
    • md/filename.md - Clean markdown with page markers
    • working/filename_ai_extraction.json - Structured data
    • working/filename_ai_processed.md - Full processed text

Page numbers are preserved as [Page X] markers in the markdown.

Prerequisites

Before using scixtract, you need to install and set up Ollama:

1. Install Ollama

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download from ollama.ai

2. Start Ollama service

ollama serve

3. Install a model

For scientific PDFs:

# Default model (4.7GB)
ollama pull qwen3:8b

Installation

pip install scixtract

Single file processing

# Extract a single PDF to custom location
scixtract extract paper.pdf

# Specify output directory
scixtract extract paper.pdf --output-dir results/

Python API

from scixtract import AdvancedPDFProcessor
from pathlib import Path

# Initialize processor
processor = AdvancedPDFProcessor(
    model="qwen3:8b"
)

# Process PDF
result = processor.process_pdf(Path("paper.pdf"))

# Access cleaned text
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")

# Get page content
for page in result.pages:
    print(f"Page {page.page_number}: {page.content[:200]}...")

Text cleanup utility

Removes hyphenation artifacts and reflows paragraphs:

scixtract text-fix extracted.txt --output cleaned.txt
cat messy.txt | scixtract text-fix - > clean.txt

Knowledge management

# Extract and index
scixtract extract paper.pdf --bib-file references.bib --update-knowledge

# Search
scixtract knowledge --search "catalysis"
scixtract knowledge --stats

Output

  • Markdown: Clean text with page numbers preserved
  • JSON: Structured data with metadata and keywords
  • SQLite database: Searchable index across documents

System requirements

  • Python: 3.10 or higher
  • Memory: 8GB RAM minimum (16GB+ recommended for large models)
  • Storage: 20GB+ free space for AI models
  • Ollama: Required for AI processing

Help and setup

Use the built-in setup helper:

# Check if Ollama is properly configured
scixtract-setup-ollama --check-only

# List available models
scixtract-setup-ollama --list-models

# Complete setup with default model
scixtract-setup-ollama --model qwen3:8b

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE.txt file for details.

Support

For technical documentation, API reference, and development information, see MAINTAINER_README.md.

For issues and questions, please visit the GitHub repository.


Built with Windsurf.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scixtract-1.1.0.tar.gz (28.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scixtract-1.1.0-py3-none-any.whl (29.9 kB view details)

Uploaded Python 3

File details

Details for the file scixtract-1.1.0.tar.gz.

File metadata

  • Download URL: scixtract-1.1.0.tar.gz
  • Upload date:
  • Size: 28.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.1.0.tar.gz
Algorithm Hash digest
SHA256 ac01a17e9dda9c8bfd3ba3e97eee85d4d39c512f9320624d5433c891ed3f82de
MD5 ca81713ba2969a1fa6a3f3dfaabbe751
BLAKE2b-256 acf4b9211adf2be677613d1772a5e3c336c88ea2f5328a883caf970c377965cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.1.0.tar.gz:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scixtract-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: scixtract-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4de72071097498640c964002086eb3fd83530beb734f6e72521589709ef92b17
MD5 f2513fd9032cc54a5e5348e505e19843
BLAKE2b-256 e93ef828725745a3cbc2eda80ec6f19f8f6e5e8f5109d135c791a546086f1937

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.1.0-py3-none-any.whl:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page