A modular pipeline for preprocessing scientific documents (PDF, DOCX, TEX, XML, TXT)
Project description
SciPreprocess
A modular, open-source pipeline for preprocessing scientific documents in multiple formats (PDF, DOCX, LaTeX, JATS XML, TXT) for LLM consumption and NLP analysis.
Features
- 📄 Multi-format support: PDF, DOCX, TEX, JATS XML, and plain text
- 🔍 OCR support: Extract text from scanned documents with Tesseract
- 🧹 Text cleaning: Remove citations, normalize unicode, clean special characters
- 🔤 NLP processing: Tokenization, lemmatization, stopword removal using spaCy or NLTK
- 📑 Section detection: Automatically identify paper sections (Abstract, Introduction, etc.)
- 🔗 Acronym handling: Detect and expand acronyms using scispacy
- 📊 Feature extraction: TF-IDF and semantic embeddings with sentence-transformers
- 🔎 Semantic search: FAISS indexing for efficient similarity search
- 🧩 Modular design: Use only the components you need
Installation
From PyPI (Recommended)
pip install scipreprocess
With Optional Dependencies
Install specific feature sets:
# PDF support
pip install "scipreprocess[pdf]"
# NLP features
pip install "scipreprocess[nlp]"
# Machine learning features
pip install "scipreprocess[ml]"
# OCR support
pip install "scipreprocess[ocr]"
# Everything
pip install "scipreprocess[all]"
Development Installation
For development or from source:
git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess
pip install -e ".[all,dev]"
Post-Installation Setup
For NLP features, download required models:
# Download spaCy model
python -m spacy download en_core_web_sm
# Install scispacy model (optional but recommended)
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz
Quick Start
Basic Usage
from scipreprocess import preprocess_file
# Process a single document
doc_json, clean_text = preprocess_file("path/to/paper.pdf")
# Access the results
print(doc_json['metadata']['title'])
print(doc_json['abstract'])
print(doc_json['sections'])
print(doc_json['acronyms'])
Process Multiple Documents
from scipreprocess import preprocess_documents
# Process multiple documents
files = ["paper1.pdf", "paper2.docx", "paper3.tex"]
results = preprocess_documents(files)
# Access results
documents = results['documents']
tfidf_matrix = results['tfidf']['X']
vectorizer = results['tfidf']['vectorizer']
chunks = results['chunks']
embeddings = results['embeddings'] # if enabled
Custom Configuration
from scipreprocess import PipelineConfig
from scipreprocess.pipeline import PreprocessingPipeline
# Configure the pipeline
config = PipelineConfig(
use_ocr=True,
use_spacy=True,
use_semantic_embeddings=True,
spacy_model='en_core_sci_sm',
embedding_model='sentence-transformers/all-MiniLM-L6-v2',
chunk_target_sentences=(3, 8)
)
# Create pipeline with custom config
pipeline = PreprocessingPipeline(config)
doc_json, text = pipeline.preprocess_file("paper.pdf")
Pipeline Components
The pipeline is organized into modular components:
parsers: Document ingestion (PDF, DOCX, TEX, XML, TXT)preprocessing: Text cleaning, tokenization, lemmatizationacronyms: Acronym detection and expansionsectioning: Section splitting and chunkingfeatures: TF-IDF and semantic embeddingspipeline: Main orchestration
Architecture
scipreprocess/
├── config.py # Configuration dataclasses
├── models.py # Data models (ParsedDocument)
├── utils.py # Dependency management and helpers
├── parsers.py # Document parsers for each format
├── preprocessing.py # Text cleaning and NLP
├── acronyms.py # Acronym detection/expansion
├── sectioning.py # Section splitting and chunking
├── features.py # Feature extraction (TF-IDF, embeddings)
└── pipeline.py # Main pipeline orchestration
Output Format
The pipeline produces structured JSON for each document:
{
"metadata": {
"title": "Paper Title",
"source_file": "path/to/file.pdf",
"pages": 12
},
"abstract": "Abstract text...",
"sections": [
{"heading": "Introduction", "text": "..."},
{"heading": "Methods", "text": "..."},
...
],
"acronyms": {
"NLP": "Natural Language Processing",
"ML": "Machine Learning"
},
"figures": [],
"tables": [],
"equations": [],
"references": []
}
Dependencies
Required
unidecode: Unicode normalization
Optional
PyMuPDF: PDF parsingpython-docx: DOCX parsinglxml: XML parsingopencv-python+pytesseract: OCR supportnltk: Basic NLP (tokenization, stopwords, lemmatization)spacy+scispacy: Advanced NLP and abbreviation detectionpysbd: Sentence boundary detectionscikit-learn: TF-IDF vectorizationsentence-transformers: Semantic embeddingsfaiss: Similarity search
Development
Setup Development Environment
# Clone the repository
git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess
# Install in development mode with dev dependencies
pip install -e ".[all,dev]"
# Run tests
pytest
# Format code
black src/ tests/
# Lint code
ruff check src/ tests/
# Type checking
mypy src/
Documentation
- Examples: examples/basic_usage.py
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=scipreprocess --cov-report=html
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this pipeline in your research, please cite:
@software{scipreprocess,
title = {SciPreprocess: A Modular Scientific Document Preprocessing Pipeline},
author = {Anik, Tarikul Islam},
year = {2025},
url = {https://github.com/Tarikul-Islam-Anik/scipreprocess}
}
Acknowledgments
- Built with spaCy, scispacy, and sentence-transformers
- Inspired by the needs of scientific text processing and NLP research
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scipreprocess-0.1.1.tar.gz.
File metadata
- Download URL: scipreprocess-0.1.1.tar.gz
- Upload date:
- Size: 69.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd80e083e40acbbde98871e31118e08eb9e24639f83ade3636212a97c74e87f7
|
|
| MD5 |
e9e543e3d5ea0fefaaafad2399f162fc
|
|
| BLAKE2b-256 |
1faac3dfc3fe9862452b904c2b3e556472c5c7159cc7faf1757246ebf257cc9c
|
Provenance
The following attestation bundles were made for scipreprocess-0.1.1.tar.gz:
Publisher:
publish.yml on Tarikul-Islam-Anik/SciPreprocess
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scipreprocess-0.1.1.tar.gz -
Subject digest:
fd80e083e40acbbde98871e31118e08eb9e24639f83ade3636212a97c74e87f7 - Sigstore transparency entry: 582963589
- Sigstore integration time:
-
Permalink:
Tarikul-Islam-Anik/SciPreprocess@ac9d0c3aeb11812d5dfcc4808833fc33e56578fb -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Tarikul-Islam-Anik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ac9d0c3aeb11812d5dfcc4808833fc33e56578fb -
Trigger Event:
release
-
Statement type:
File details
Details for the file scipreprocess-0.1.1-py3-none-any.whl.
File metadata
- Download URL: scipreprocess-0.1.1-py3-none-any.whl
- Upload date:
- Size: 27.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0ec7993ff3d45eddcd3d5faf44107d37309eda5a9687f97f8df8ce009d4d973
|
|
| MD5 |
d02742a061c999f7099c700fa22dbd97
|
|
| BLAKE2b-256 |
ed9ed21a465a12769d223ea4a95e8a6317324084ef92e123be7e811c7e1e38cb
|
Provenance
The following attestation bundles were made for scipreprocess-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on Tarikul-Islam-Anik/SciPreprocess
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
scipreprocess-0.1.1-py3-none-any.whl -
Subject digest:
d0ec7993ff3d45eddcd3d5faf44107d37309eda5a9687f97f8df8ce009d4d973 - Sigstore transparency entry: 582963603
- Sigstore integration time:
-
Permalink:
Tarikul-Islam-Anik/SciPreprocess@ac9d0c3aeb11812d5dfcc4808833fc33e56578fb -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Tarikul-Islam-Anik
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ac9d0c3aeb11812d5dfcc4808833fc33e56578fb -
Trigger Event:
release
-
Statement type: