Skip to main content

A modular pipeline for preprocessing scientific documents (PDF, DOCX, TEX, XML, TXT)

Project description

SciPreprocess

CI License: MIT Python 3.8+ Code style: black

A modular, open-source pipeline for preprocessing scientific documents in multiple formats (PDF, DOCX, LaTeX, JATS XML, TXT) for LLM consumption and NLP analysis.

Features

  • 📄 Multi-format support: PDF, DOCX, TEX, JATS XML, and plain text
  • 🔍 OCR support: Extract text from scanned documents with Tesseract
  • 🧹 Text cleaning: Remove citations, normalize unicode, clean special characters
  • 🔤 NLP processing: Tokenization, lemmatization, stopword removal using spaCy or NLTK
  • 📑 Section detection: Automatically identify paper sections (Abstract, Introduction, etc.)
  • 🔗 Acronym handling: Detect and expand acronyms using scispacy
  • 📊 Feature extraction: TF-IDF and semantic embeddings with sentence-transformers
  • 🔎 Semantic search: FAISS indexing for efficient similarity search
  • 🧩 Modular design: Use only the components you need
  • 📊 Export formats: JSON (default) or CSV output with --format flag

Installation

From PyPI (Recommended)

pip install scipreprocess

With Optional Dependencies

Install specific feature sets:

# PDF support
pip install "scipreprocess[pdf]"

# NLP features
pip install "scipreprocess[nlp]"

# Machine learning features
pip install "scipreprocess[ml]"

# OCR support
pip install "scipreprocess[ocr]"

# Everything
pip install "scipreprocess[all]"

Development Installation

For development or from source:

git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess
pip install -e ".[all,dev]"

Post-Installation Setup

For NLP features, download required models:

# Download spaCy model
python -m spacy download en_core_web_sm

# Install scispacy model (optional but recommended)
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

Quick Start

Basic Usage

from scipreprocess import preprocess_file

# Process a single document
doc_json, clean_text = preprocess_file("path/to/paper.pdf")

# Access the results
print(doc_json['metadata']['title'])
print(doc_json['abstract'])
print(doc_json['sections'])
print(doc_json['acronyms'])

Process Multiple Documents

from scipreprocess import preprocess_documents

# Process multiple documents
files = ["paper1.pdf", "paper2.docx", "paper3.tex"]
results = preprocess_documents(files)

# Access results
documents = results['documents']
tfidf_matrix = results['tfidf']['X']
vectorizer = results['tfidf']['vectorizer']
chunks = results['chunks']
embeddings = results['embeddings']  # if enabled

Custom Configuration

from scipreprocess import PipelineConfig
from scipreprocess.pipeline import PreprocessingPipeline

# Configure the pipeline
config = PipelineConfig(
    use_ocr=True,
    use_spacy=True,
    use_semantic_embeddings=True,
    spacy_model='en_core_sci_sm',
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    chunk_target_sentences=(3, 8)
)

# Create pipeline with custom config
pipeline = PreprocessingPipeline(config)
doc_json, text = pipeline.preprocess_file("paper.pdf")

Command Line Interface

SciPreprocess includes a command-line interface for easy document processing:

Basic CLI Usage

# Process documents and output JSON (default)
scipreprocess document1.pdf document2.docx

# Process with OCR enabled
scipreprocess --ocr scanned_document.pdf

# Process with layout analysis
scipreprocess --layout complex_document.pdf

# Convert text to lowercase
scipreprocess --lower document.pdf

Export Formats

The CLI supports two output formats:

# JSON output (default)
scipreprocess document.pdf

# CSV output - one row per document
scipreprocess document.pdf --format csv

# Save to file
scipreprocess document.pdf --format csv --out results.csv

CLI Options

  • inputs: Paths to documents to process (required)
  • --backend {auto,docling,local}: Parser backend (default: auto)
  • --ocr: Enable OCR for scanned documents
  • --layout: Enable layout analysis
  • --lower: Convert text to lowercase
  • --format {json,csv}: Output format (default: json)
  • --out FILE: Output file path (default: stdout)

CSV Output Format

When using --format csv, the output contains one row per document with flattened nested data:

abstract,metadata.source_file,metadata.title,metadata.pages,sections
"Abstract text...","document.pdf","Paper Title",12,"[{""heading"": ""Introduction"", ""text"": ""..."", ...}]"
  • Nested dictionaries are flattened with dotted keys (e.g., metadata.title)
  • Arrays are JSON-stringified (e.g., sections, figures, tables)
  • Only document data is included (excludes tfidf, chunks, embeddings, index)

Pipeline Components

The pipeline is organized into modular components:

  • parsers: Document ingestion (PDF, DOCX, TEX, XML, TXT)
  • preprocessing: Text cleaning, tokenization, lemmatization
  • acronyms: Acronym detection and expansion
  • sectioning: Section splitting and chunking
  • features: TF-IDF and semantic embeddings
  • pipeline: Main orchestration

Architecture

scipreprocess/
├── config.py          # Configuration dataclasses
├── models.py          # Data models (ParsedDocument)
├── utils.py           # Dependency management and helpers
├── parsers.py         # Document parsers for each format
├── preprocessing.py   # Text cleaning and NLP
├── acronyms.py        # Acronym detection/expansion
├── sectioning.py      # Section splitting and chunking
├── features.py        # Feature extraction (TF-IDF, embeddings)
└── pipeline.py        # Main pipeline orchestration

Output Format

The pipeline produces structured JSON for each document:

{
    "metadata": {
        "title": "Paper Title",
        "source_file": "path/to/file.pdf",
        "pages": 12
    },
    "abstract": "Abstract text...",
    "sections": [
        {"heading": "Introduction", "text": "..."},
        {"heading": "Methods", "text": "..."},
        ...
    ],
    "acronyms": {
        "NLP": "Natural Language Processing",
        "ML": "Machine Learning"
    },
    "figures": [],
    "tables": [],
    "equations": [],
    "references": []
}

Dependencies

Required

  • unidecode: Unicode normalization

Optional

  • PyMuPDF: PDF parsing
  • python-docx: DOCX parsing
  • lxml: XML parsing
  • opencv-python + pytesseract: OCR support
  • nltk: Basic NLP (tokenization, stopwords, lemmatization)
  • spacy + scispacy: Advanced NLP and abbreviation detection
  • pysbd: Sentence boundary detection
  • scikit-learn: TF-IDF vectorization
  • sentence-transformers: Semantic embeddings
  • faiss: Similarity search

Development

Setup Development Environment

# Clone the repository
git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess

# Install in development mode with dev dependencies
pip install -e ".[all,dev]"

# Run tests
pytest

# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking
mypy src/

Documentation

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=scipreprocess --cov-report=html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this pipeline in your research, please cite:

@software{scipreprocess,
  title = {SciPreprocess: A Modular Scientific Document Preprocessing Pipeline},
  author = {Anik, Tarikul Islam},
  year = {2025},
  url = {https://github.com/Tarikul-Islam-Anik/scipreprocess}
}

Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scipreprocess-0.1.2.tar.gz (72.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scipreprocess-0.1.2-py3-none-any.whl (29.2 kB view details)

Uploaded Python 3

File details

Details for the file scipreprocess-0.1.2.tar.gz.

File metadata

  • Download URL: scipreprocess-0.1.2.tar.gz
  • Upload date:
  • Size: 72.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scipreprocess-0.1.2.tar.gz
Algorithm Hash digest
SHA256 de4f9e7c411ad27014097aff2ddd85ccd151ff072509cea80f49dc610713e2c8
MD5 d1fe8bb20c7ce2fa780f85f9bb59bbf6
BLAKE2b-256 5eb33bf466bda8c9bf5ad7ba67aa885f36319234a1a2515dca94cf2255c9f6ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for scipreprocess-0.1.2.tar.gz:

Publisher: publish.yml on Tarikul-Islam-Anik/SciPreprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scipreprocess-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: scipreprocess-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 29.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scipreprocess-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 62c1d2b8b5360099f1b7132997285827c799e9a1d6a115abf016586cff4f0bc5
MD5 0dec38881e52f8dfd9eb4d452d9013cf
BLAKE2b-256 729dc123de3cfe9b5c8938c54868ff20e935be3f25118af088175611a2f64017

See more details on using hashes here.

Provenance

The following attestation bundles were made for scipreprocess-0.1.2-py3-none-any.whl:

Publisher: publish.yml on Tarikul-Islam-Anik/SciPreprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page