Skip to main content

A modular pipeline for preprocessing scientific documents (PDF, DOCX, TEX, XML, TXT)

Project description

SciPreprocess

CI License: MIT Python 3.8+ Code style: black

A modular, open-source pipeline for preprocessing scientific documents in multiple formats (PDF, DOCX, LaTeX, JATS XML, TXT) for LLM consumption and NLP analysis.

Features

  • 📄 Multi-format support: PDF, DOCX, TEX, JATS XML, and plain text
  • 🔍 OCR support: Extract text from scanned documents with Tesseract
  • 🧹 Text cleaning: Remove citations, normalize unicode, clean special characters
  • 🔤 NLP processing: Tokenization, lemmatization, stopword removal using spaCy or NLTK
  • 📑 Section detection: Automatically identify paper sections (Abstract, Introduction, etc.)
  • 🔗 Acronym handling: Detect and expand acronyms using scispacy
  • 📊 Feature extraction: TF-IDF and semantic embeddings with sentence-transformers
  • 🔎 Semantic search: FAISS indexing for efficient similarity search
  • 🧩 Modular design: Use only the components you need

Installation

From PyPI (Recommended)

pip install scipreprocess

With Optional Dependencies

Install specific feature sets:

# PDF support
pip install "scipreprocess[pdf]"

# NLP features
pip install "scipreprocess[nlp]"

# Machine learning features
pip install "scipreprocess[ml]"

# OCR support
pip install "scipreprocess[ocr]"

# Everything
pip install "scipreprocess[all]"

Development Installation

For development or from source:

git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess
pip install -e ".[all,dev]"

Post-Installation Setup

For NLP features, download required models:

# Download spaCy model
python -m spacy download en_core_web_sm

# Install scispacy model (optional but recommended)
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

Quick Start

Basic Usage

from scipreprocess import preprocess_file

# Process a single document
doc_json, clean_text = preprocess_file("path/to/paper.pdf")

# Access the results
print(doc_json['metadata']['title'])
print(doc_json['abstract'])
print(doc_json['sections'])
print(doc_json['acronyms'])

Process Multiple Documents

from scipreprocess import preprocess_documents

# Process multiple documents
files = ["paper1.pdf", "paper2.docx", "paper3.tex"]
results = preprocess_documents(files)

# Access results
documents = results['documents']
tfidf_matrix = results['tfidf']['X']
vectorizer = results['tfidf']['vectorizer']
chunks = results['chunks']
embeddings = results['embeddings']  # if enabled

Custom Configuration

from scipreprocess import PipelineConfig
from scipreprocess.pipeline import PreprocessingPipeline

# Configure the pipeline
config = PipelineConfig(
    use_ocr=True,
    use_spacy=True,
    use_semantic_embeddings=True,
    spacy_model='en_core_sci_sm',
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    chunk_target_sentences=(3, 8)
)

# Create pipeline with custom config
pipeline = PreprocessingPipeline(config)
doc_json, text = pipeline.preprocess_file("paper.pdf")

Pipeline Components

The pipeline is organized into modular components:

  • parsers: Document ingestion (PDF, DOCX, TEX, XML, TXT)
  • preprocessing: Text cleaning, tokenization, lemmatization
  • acronyms: Acronym detection and expansion
  • sectioning: Section splitting and chunking
  • features: TF-IDF and semantic embeddings
  • pipeline: Main orchestration

Architecture

scipreprocess/
├── config.py          # Configuration dataclasses
├── models.py          # Data models (ParsedDocument)
├── utils.py           # Dependency management and helpers
├── parsers.py         # Document parsers for each format
├── preprocessing.py   # Text cleaning and NLP
├── acronyms.py        # Acronym detection/expansion
├── sectioning.py      # Section splitting and chunking
├── features.py        # Feature extraction (TF-IDF, embeddings)
└── pipeline.py        # Main pipeline orchestration

Output Format

The pipeline produces structured JSON for each document:

{
    "metadata": {
        "title": "Paper Title",
        "source_file": "path/to/file.pdf",
        "pages": 12
    },
    "abstract": "Abstract text...",
    "sections": [
        {"heading": "Introduction", "text": "..."},
        {"heading": "Methods", "text": "..."},
        ...
    ],
    "acronyms": {
        "NLP": "Natural Language Processing",
        "ML": "Machine Learning"
    },
    "figures": [],
    "tables": [],
    "equations": [],
    "references": []
}

Dependencies

Required

  • unidecode: Unicode normalization

Optional

  • PyMuPDF: PDF parsing
  • python-docx: DOCX parsing
  • lxml: XML parsing
  • opencv-python + pytesseract: OCR support
  • nltk: Basic NLP (tokenization, stopwords, lemmatization)
  • spacy + scispacy: Advanced NLP and abbreviation detection
  • pysbd: Sentence boundary detection
  • scikit-learn: TF-IDF vectorization
  • sentence-transformers: Semantic embeddings
  • faiss: Similarity search

Development

Setup Development Environment

# Clone the repository
git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess

# Install in development mode with dev dependencies
pip install -e ".[all,dev]"

# Run tests
pytest

# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking
mypy src/

Documentation

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=scipreprocess --cov-report=html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this pipeline in your research, please cite:

@software{scipreprocess,
  title = {SciPreprocess: A Modular Scientific Document Preprocessing Pipeline},
  author = {Anik, Tarikul Islam},
  year = {2025},
  url = {https://github.com/Tarikul-Islam-Anik/scipreprocess}
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scipreprocess-0.1.1.tar.gz (69.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scipreprocess-0.1.1-py3-none-any.whl (27.2 kB view details)

Uploaded Python 3

File details

Details for the file scipreprocess-0.1.1.tar.gz.

File metadata

  • Download URL: scipreprocess-0.1.1.tar.gz
  • Upload date:
  • Size: 69.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scipreprocess-0.1.1.tar.gz
Algorithm Hash digest
SHA256 fd80e083e40acbbde98871e31118e08eb9e24639f83ade3636212a97c74e87f7
MD5 e9e543e3d5ea0fefaaafad2399f162fc
BLAKE2b-256 1faac3dfc3fe9862452b904c2b3e556472c5c7159cc7faf1757246ebf257cc9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scipreprocess-0.1.1.tar.gz:

Publisher: publish.yml on Tarikul-Islam-Anik/SciPreprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scipreprocess-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: scipreprocess-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scipreprocess-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d0ec7993ff3d45eddcd3d5faf44107d37309eda5a9687f97f8df8ce009d4d973
MD5 d02742a061c999f7099c700fa22dbd97
BLAKE2b-256 ed9ed21a465a12769d223ea4a95e8a6317324084ef92e123be7e811c7e1e38cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for scipreprocess-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Tarikul-Islam-Anik/SciPreprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page