Skip to main content

AI-assisted scientific PDF text extraction using local Ollama models

Project description

scixtract

Python PyPI version License: GPL v3 Tests

AI-powered PDF text extraction for scientific papers. Removes artifacts, preserves formatting like chemical formulas and citations.

Usage

# First run creates directory structure
scixtract extract

# Put PDFs in pdf/ directory
# Run extraction
scixtract extract

# Clean markdown files appear in md/ directory

Output: Markdown files with page numbers preserved and extraction artifacts removed.

Directory structure:

your-project/sources/
├── pdf/         # Input PDFs
├── md/          # Output markdown
└── working/     # Intermediate files

What it does

For each PDF, scixtract:

  1. Extracts text from PDF using unstructured library
  2. Processes each page with AI (qwen3:8b via Ollama):
    • Removes spacing artifacts and broken words
    • Fixes line breaks and hyphenation
    • Preserves chemical formulas (H₂O, CO₂)
    • Preserves citations and references
    • Maintains paragraph structure
  3. Extracts metadata: Title, authors, keywords
  4. Generates summary of the document
  5. Outputs:
    • md/filename.md - Clean markdown with page markers
    • working/filename_ai_extraction.json - Structured data
    • working/filename_ai_processed.md - Full processed text

Page numbers are preserved as [Page X] markers in the markdown.

Prerequisites

Before using scixtract, you need to install and set up Ollama:

1. Install Ollama

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download from ollama.ai

2. Start Ollama service

ollama serve

3. Install a model

For scientific PDFs:

# Default model (4.7GB)
ollama pull qwen3:8b

Installation

pip install scixtract

Single file processing

# Extract a single PDF to custom location
scixtract extract paper.pdf

# Specify output directory
scixtract extract paper.pdf --output-dir results/

Python API

from scixtract import AdvancedPDFProcessor
from pathlib import Path

# Initialize processor
processor = AdvancedPDFProcessor(
    model="qwen3:8b"
)

# Process PDF
result = processor.process_pdf(Path("paper.pdf"))

# Access cleaned text
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")

# Get page content
for page in result.pages:
    print(f"Page {page.page_number}: {page.content[:200]}...")

Text cleanup utility

Removes hyphenation artifacts and reflows paragraphs:

scixtract text-fix extracted.txt --output cleaned.txt
cat messy.txt | scixtract text-fix - > clean.txt

Knowledge management

# Extract and index
scixtract extract paper.pdf --bib-file references.bib --update-knowledge

# Search
scixtract knowledge --search "catalysis"
scixtract knowledge --stats

Output

  • Markdown: Clean text with page numbers preserved
  • JSON: Structured data with metadata and keywords
  • SQLite database: Searchable index across documents

System requirements

  • Python: 3.10 or higher
  • Memory: 8GB RAM minimum (16GB+ recommended for large models)
  • Storage: 20GB+ free space for AI models
  • Ollama: Required for AI processing

Help and setup

Use the built-in setup helper:

# Check if Ollama is properly configured
scixtract-setup-ollama --check-only

# List available models
scixtract-setup-ollama --list-models

# Complete setup with default model
scixtract-setup-ollama --model qwen3:8b

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE.txt file for details.

Support

For technical documentation, API reference, and development information, see MAINTAINER_README.md.

For issues and questions, please visit the GitHub repository.


Built with Windsurf.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scixtract-1.1.1.tar.gz (29.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scixtract-1.1.1-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file scixtract-1.1.1.tar.gz.

File metadata

  • Download URL: scixtract-1.1.1.tar.gz
  • Upload date:
  • Size: 29.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.1.1.tar.gz
Algorithm Hash digest
SHA256 e7ec4a72c34e42d773a4ccfe614c307019148bfbde568426ca317dc736456b0c
MD5 cdbebe8cd384a4c6004c39bc9e2e4227
BLAKE2b-256 532d002c77c547a8d639dff1e40de885b0e21f7c7fa3bfd4a0702b8d6c38cad2

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.1.1.tar.gz:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scixtract-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: scixtract-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 40a6819a28c66a600f1236432b73b8fa488e11ce4d0e74e71d923850cfd2ba83
MD5 cc4f4dc60745766ed5434b4e05b0181c
BLAKE2b-256 61e9b8288d12bea75d8462449f65932b7ff43080a5399d24544031868fee2da7

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.1.1-py3-none-any.whl:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page