AI-assisted scientific PDF text extraction using local Ollama models
Project description
scixtract
AI-powered PDF text extraction for scientific papers. Removes artifacts, preserves formatting like chemical formulas and citations.
Usage
# First run creates directory structure
scixtract extract
# Put PDFs in pdf/ directory
# Run extraction
scixtract extract
# Clean markdown files appear in md/ directory
Output: Markdown files with page numbers preserved and extraction artifacts removed.
Directory structure:
your-project/sources/
├── pdf/ # Input PDFs
├── md/ # Output markdown
└── working/ # Intermediate files
What it does
For each PDF, scixtract:
- Extracts text from PDF using
unstructuredlibrary - Processes each page with AI (qwen3:8b via Ollama):
- Removes spacing artifacts and broken words
- Fixes line breaks and hyphenation
- Preserves chemical formulas (H₂O, CO₂)
- Preserves citations and references
- Maintains paragraph structure
- Extracts metadata: Title, authors, keywords
- Generates summary of the document
- Outputs:
md/filename.md- Clean markdown with page markersworking/filename_ai_extraction.json- Structured dataworking/filename_ai_processed.md- Full processed text
Page numbers are preserved as [Page X] markers in the markdown.
Prerequisites
Before using scixtract, you need to install and set up Ollama:
1. Install Ollama
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Windows: Download from ollama.ai
2. Start Ollama service
ollama serve
3. Install a model
For scientific PDFs:
# Default model (4.7GB)
ollama pull qwen3:8b
Installation
pip install scixtract
Single file processing
# Extract a single PDF to custom location
scixtract extract paper.pdf
# Specify output directory
scixtract extract paper.pdf --output-dir results/
Python API
from scixtract import AdvancedPDFProcessor
from pathlib import Path
# Initialize processor
processor = AdvancedPDFProcessor(
model="qwen3:8b"
)
# Process PDF
result = processor.process_pdf(Path("paper.pdf"))
# Access cleaned text
print(f"Title: {result.metadata.title}")
print(f"Authors: {', '.join(result.metadata.authors)}")
# Get page content
for page in result.pages:
print(f"Page {page.page_number}: {page.content[:200]}...")
Text cleanup utility
Removes hyphenation artifacts and reflows paragraphs:
scixtract text-fix extracted.txt --output cleaned.txt
cat messy.txt | scixtract text-fix - > clean.txt
Knowledge management
# Extract and index
scixtract extract paper.pdf --bib-file references.bib --update-knowledge
# Search
scixtract knowledge --search "catalysis"
scixtract knowledge --stats
Output
- Markdown: Clean text with page numbers preserved
- JSON: Structured data with metadata and keywords
- SQLite database: Searchable index across documents
System requirements
- Python: 3.10 or higher
- Memory: 8GB RAM minimum (16GB+ recommended for large models)
- Storage: 20GB+ free space for AI models
- Ollama: Required for AI processing
Help and setup
Use the built-in setup helper:
# Check if Ollama is properly configured
scixtract-setup-ollama --check-only
# List available models
scixtract-setup-ollama --list-models
# Complete setup with default model
scixtract-setup-ollama --model qwen3:8b
License
This project is licensed under the GNU General Public License v3.0 - see the LICENSE.txt file for details.
Support
For technical documentation, API reference, and development information, see MAINTAINER_README.md.
For issues and questions, please visit the GitHub repository.
Built with Windsurf.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scixtract-1.1.1.tar.gz.
File metadata
- Download URL: scixtract-1.1.1.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7ec4a72c34e42d773a4ccfe614c307019148bfbde568426ca317dc736456b0c
|
|
| MD5 |
cdbebe8cd384a4c6004c39bc9e2e4227
|
|
| BLAKE2b-256 |
532d002c77c547a8d639dff1e40de885b0e21f7c7fa3bfd4a0702b8d6c38cad2
|
Provenance
The following attestation bundles were made for scixtract-1.1.1.tar.gz:
Publisher:
pypi_publish.yml on retospect/scixtract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scixtract-1.1.1.tar.gz -
Subject digest:
e7ec4a72c34e42d773a4ccfe614c307019148bfbde568426ca317dc736456b0c - Sigstore transparency entry: 833808791
- Sigstore integration time:
-
Permalink:
retospect/scixtract@923406547d658d317d140cb25c88f3d4f2822b88 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/retospect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_publish.yml@923406547d658d317d140cb25c88f3d4f2822b88 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file scixtract-1.1.1-py3-none-any.whl.
File metadata
- Download URL: scixtract-1.1.1-py3-none-any.whl
- Upload date:
- Size: 30.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40a6819a28c66a600f1236432b73b8fa488e11ce4d0e74e71d923850cfd2ba83
|
|
| MD5 |
cc4f4dc60745766ed5434b4e05b0181c
|
|
| BLAKE2b-256 |
61e9b8288d12bea75d8462449f65932b7ff43080a5399d24544031868fee2da7
|
Provenance
The following attestation bundles were made for scixtract-1.1.1-py3-none-any.whl:
Publisher:
pypi_publish.yml on retospect/scixtract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scixtract-1.1.1-py3-none-any.whl -
Subject digest:
40a6819a28c66a600f1236432b73b8fa488e11ce4d0e74e71d923850cfd2ba83 - Sigstore transparency entry: 833808793
- Sigstore integration time:
-
Permalink:
retospect/scixtract@923406547d658d317d140cb25c88f3d4f2822b88 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/retospect
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi_publish.yml@923406547d658d317d140cb25c88f3d4f2822b88 -
Trigger Event:
workflow_dispatch
-
Statement type: