Skip to main content

A modular pipeline for preprocessing scientific documents (PDF, DOCX, TEX, XML, TXT)

Project description

SciPreprocess

CI License: MIT Python 3.8+ Code style: black

A modular, open-source pipeline for preprocessing scientific documents in multiple formats (PDF, DOCX, LaTeX, JATS XML, TXT) for LLM consumption and NLP analysis.

Features

  • 📄 Multi-format support: PDF, DOCX, TEX, JATS XML, and plain text
  • 🔍 OCR support: Extract text from scanned documents with Tesseract
  • 🧹 Text cleaning: Remove citations, normalize unicode, clean special characters
  • 🔤 NLP processing: Tokenization, lemmatization, stopword removal using spaCy or NLTK
  • 📑 Section detection: Automatically identify paper sections (Abstract, Introduction, etc.)
  • 🔗 Acronym handling: Detect and expand acronyms using scispacy
  • 📊 Feature extraction: TF-IDF and semantic embeddings with sentence-transformers
  • 🔎 Semantic search: FAISS indexing for efficient similarity search
  • 🧩 Modular design: Use only the components you need

Installation

From PyPI (Recommended)

pip install scipreprocess

With Optional Dependencies

Install specific feature sets:

# PDF support
pip install "scipreprocess[pdf]"

# NLP features
pip install "scipreprocess[nlp]"

# Machine learning features
pip install "scipreprocess[ml]"

# OCR support
pip install "scipreprocess[ocr]"

# Everything
pip install "scipreprocess[all]"

Development Installation

For development or from source:

git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess
pip install -e ".[all,dev]"

Post-Installation Setup

For NLP features, download required models:

# Download spaCy model
python -m spacy download en_core_web_sm

# Install scispacy model (optional but recommended)
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

Quick Start

Basic Usage

from scipreprocess import preprocess_file

# Process a single document
doc_json, clean_text = preprocess_file("path/to/paper.pdf")

# Access the results
print(doc_json['metadata']['title'])
print(doc_json['abstract'])
print(doc_json['sections'])
print(doc_json['acronyms'])

Process Multiple Documents

from scipreprocess import preprocess_documents

# Process multiple documents
files = ["paper1.pdf", "paper2.docx", "paper3.tex"]
results = preprocess_documents(files)

# Access results
documents = results['documents']
tfidf_matrix = results['tfidf']['X']
vectorizer = results['tfidf']['vectorizer']
chunks = results['chunks']
embeddings = results['embeddings']  # if enabled

Custom Configuration

from scipreprocess import PipelineConfig
from scipreprocess.pipeline import PreprocessingPipeline

# Configure the pipeline
config = PipelineConfig(
    use_ocr=True,
    use_spacy=True,
    use_semantic_embeddings=True,
    spacy_model='en_core_sci_sm',
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    chunk_target_sentences=(3, 8)
)

# Create pipeline with custom config
pipeline = PreprocessingPipeline(config)
doc_json, text = pipeline.preprocess_file("paper.pdf")

Pipeline Components

The pipeline is organized into modular components:

  • parsers: Document ingestion (PDF, DOCX, TEX, XML, TXT)
  • preprocessing: Text cleaning, tokenization, lemmatization
  • acronyms: Acronym detection and expansion
  • sectioning: Section splitting and chunking
  • features: TF-IDF and semantic embeddings
  • pipeline: Main orchestration

Architecture

scipreprocess/
├── config.py          # Configuration dataclasses
├── models.py          # Data models (ParsedDocument)
├── utils.py           # Dependency management and helpers
├── parsers.py         # Document parsers for each format
├── preprocessing.py   # Text cleaning and NLP
├── acronyms.py        # Acronym detection/expansion
├── sectioning.py      # Section splitting and chunking
├── features.py        # Feature extraction (TF-IDF, embeddings)
└── pipeline.py        # Main pipeline orchestration

Output Format

The pipeline produces structured JSON for each document:

{
    "metadata": {
        "title": "Paper Title",
        "source_file": "path/to/file.pdf",
        "pages": 12
    },
    "abstract": "Abstract text...",
    "sections": [
        {"heading": "Introduction", "text": "..."},
        {"heading": "Methods", "text": "..."},
        ...
    ],
    "acronyms": {
        "NLP": "Natural Language Processing",
        "ML": "Machine Learning"
    },
    "figures": [],
    "tables": [],
    "equations": [],
    "references": []
}

Dependencies

Required

  • unidecode: Unicode normalization

Optional

  • PyMuPDF: PDF parsing
  • python-docx: DOCX parsing
  • lxml: XML parsing
  • opencv-python + pytesseract: OCR support
  • nltk: Basic NLP (tokenization, stopwords, lemmatization)
  • spacy + scispacy: Advanced NLP and abbreviation detection
  • pysbd: Sentence boundary detection
  • scikit-learn: TF-IDF vectorization
  • sentence-transformers: Semantic embeddings
  • faiss: Similarity search

Development

Setup Development Environment

# Clone the repository
git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess

# Install in development mode with dev dependencies
pip install -e ".[all,dev]"

# Run tests
pytest

# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking
mypy src/

Documentation

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=scipreprocess --cov-report=html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this pipeline in your research, please cite:

@software{scipreprocess,
  title = {SciPreprocess: A Modular Scientific Document Preprocessing Pipeline},
  author = {Anik, Tarikul Islam},
  year = {2025},
  url = {https://github.com/Tarikul-Islam-Anik/scipreprocess}
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scipreprocess-0.1.0.tar.gz (62.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scipreprocess-0.1.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file scipreprocess-0.1.0.tar.gz.

File metadata

  • Download URL: scipreprocess-0.1.0.tar.gz
  • Upload date:
  • Size: 62.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scipreprocess-0.1.0.tar.gz
Algorithm Hash digest
SHA256 453361cbf7ad44049d2c670e75e1b6d435e2960754712fb60d8a7b9d470772e2
MD5 b691bf3ffb180126266c0f86f78aa28d
BLAKE2b-256 6ba2dafd4b182c27f38f1df76d85d348fe6397ebca8324d92e4cc07aa439c21a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scipreprocess-0.1.0.tar.gz:

Publisher: publish.yml on Tarikul-Islam-Anik/SciPreprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scipreprocess-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: scipreprocess-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scipreprocess-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 847f50da0a3bd4afe398c38a1e033b9621429e9a1dccb949c8c1aabbc3d4edef
MD5 fb50bfe3ac6c34f826e1dc8d26887273
BLAKE2b-256 95d761018c2112ba8fe5a441db611340da15a2cb31f6b591e9ef754556834430

See more details on using hashes here.

Provenance

The following attestation bundles were made for scipreprocess-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Tarikul-Islam-Anik/SciPreprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page