AI-assisted scientific PDF text extraction using local Ollama models

These details have not been verified by PyPI

Project links

Project description

scixtract

AI-powered PDF text extraction for scientific papers. Removes artifacts, preserves formatting like chemical formulas and citations.

Usage

# First run creates directory structure
scixtract extract

# Put PDFs in pdf/ directory
# Run extraction
scixtract extract

# Clean markdown files appear in md/ directory

Output: Markdown files with page numbers preserved and extraction artifacts removed.

Directory structure:

your-project/sources/
├── pdf/         # Input PDFs
├── md/          # Output markdown
└── working/     # Intermediate files

What it does

For each PDF, scixtract:

Extracts text from PDF using unstructured library
Processes each page with AI (qwen3:8b via Ollama):
- Removes spacing artifacts and broken words
- Fixes line breaks and hyphenation
- Preserves chemical formulas (H₂O, CO₂)
- Preserves citations and references
- Maintains paragraph structure
Extracts metadata: Title, authors, keywords
Generates summary of the document
Outputs:
- md/filename.md - Clean markdown with page markers
- working/filename_ai_extraction.json - Structured data
- working/filename_ai_processed.md - Full processed text

Page numbers are preserved as [Page X] markers in the markdown.

Prerequisites

Before using scixtract, you need to install and set up Ollama:

1. Install Ollama

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download from ollama.ai

2. Start Ollama service

ollama serve

3. Install a model

For scientific PDFs:

# Default model (4.7GB)
ollama pull qwen3:8b

Installation

pip install scixtract

Single file processing

# Extract a single PDF to custom location
scixtract extract paper.pdf

# Specify output directory
scixtract extract paper.pdf --output-dir results/

Python API

from scixtract import AdvancedPDFProcessor
from pathlib import Path

# Initialize processor
processor = AdvancedPDFProcessor(
    model="qwen3:8b"
)

# Process PDF
result = processor.process_pdf(Path("paper.pdf"))

# Access cleaned text
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")

# Get page content
for page in result.pages:
    print(f"Page {page.page_number}: {page.content[:200]}...")

Text cleanup utility

Removes hyphenation artifacts and reflows paragraphs:

scixtract text-fix extracted.txt --output cleaned.txt
cat messy.txt | scixtract text-fix - > clean.txt

Knowledge management

# Extract and index
scixtract extract paper.pdf --bib-file references.bib --update-knowledge

# Search
scixtract knowledge --search "catalysis"
scixtract knowledge --stats

Output

Markdown: Clean text with page numbers preserved
JSON: Structured data with metadata and keywords
SQLite database: Searchable index across documents

System requirements

Python: 3.10 or higher
Memory: 8GB RAM minimum (16GB+ recommended for large models)
Storage: 20GB+ free space for AI models
Ollama: Required for AI processing

Help and setup

Use the built-in setup helper:

# Check if Ollama is properly configured
scixtract-setup-ollama --check-only

# List available models
scixtract-setup-ollama --list-models

# Complete setup with default model
scixtract-setup-ollama --model qwen3:8b

License

This project is licensed under the GNU General Public License v3.0 - see the LICENSE.txt file for details.

Support

For technical documentation, API reference, and development information, see MAINTAINER_README.md.

For issues and questions, please visit the GitHub repository.

Built with Windsurf.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.1.1

Jan 18, 2026

This version

1.1.0

Jan 18, 2026

1.0.5

Nov 2, 2025

1.0.3

Nov 1, 2025

1.0.2

Nov 1, 2025

1.0.1

Nov 1, 2025

1.0.0

Nov 1, 2025

0.3.0

Nov 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scixtract-1.1.0.tar.gz (28.6 kB view details)

Uploaded Jan 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scixtract-1.1.0-py3-none-any.whl (29.9 kB view details)

Uploaded Jan 18, 2026 Python 3

File details

Details for the file scixtract-1.1.0.tar.gz.

File metadata

Download URL: scixtract-1.1.0.tar.gz
Upload date: Jan 18, 2026
Size: 28.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ac01a17e9dda9c8bfd3ba3e97eee85d4d39c512f9320624d5433c891ed3f82de`
MD5	`ca81713ba2969a1fa6a3f3dfaabbe751`
BLAKE2b-256	`acf4b9211adf2be677613d1772a5e3c336c88ea2f5328a883caf970c377965cc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.1.0.tar.gz:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scixtract-1.1.0.tar.gz
- Subject digest: ac01a17e9dda9c8bfd3ba3e97eee85d4d39c512f9320624d5433c891ed3f82de
- Sigstore transparency entry: 833739876
- Sigstore integration time: Jan 18, 2026
Source repository:
- Permalink: retospect/scixtract@0a82741f82fd0b66ac63dfe82fae70637d51dc85
- Branch / Tag: refs/heads/main
- Owner: https://github.com/retospect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi_publish.yml@0a82741f82fd0b66ac63dfe82fae70637d51dc85
- Trigger Event: workflow_dispatch

File details

Details for the file scixtract-1.1.0-py3-none-any.whl.

File metadata

Download URL: scixtract-1.1.0-py3-none-any.whl
Upload date: Jan 18, 2026
Size: 29.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scixtract-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4de72071097498640c964002086eb3fd83530beb734f6e72521589709ef92b17`
MD5	`f2513fd9032cc54a5e5348e505e19843`
BLAKE2b-256	`e93ef828725745a3cbc2eda80ec6f19f8f6e5e8f5109d135c791a546086f1937`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scixtract-1.1.0-py3-none-any.whl:

Publisher: pypi_publish.yml on retospect/scixtract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scixtract-1.1.0-py3-none-any.whl
- Subject digest: 4de72071097498640c964002086eb3fd83530beb734f6e72521589709ef92b17
- Sigstore transparency entry: 833739877
- Sigstore integration time: Jan 18, 2026
Source repository:
- Permalink: retospect/scixtract@0a82741f82fd0b66ac63dfe82fae70637d51dc85
- Branch / Tag: refs/heads/main
- Owner: https://github.com/retospect
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi_publish.yml@0a82741f82fd0b66ac63dfe82fae70637d51dc85
- Trigger Event: workflow_dispatch

scixtract 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scixtract

Usage

What it does

Prerequisites

1. Install Ollama

2. Start Ollama service

3. Install a model

Installation

Single file processing

Python API

Text cleanup utility

Knowledge management

Output

System requirements

Help and setup

License

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance