A modular pipeline for preprocessing scientific documents (PDF, DOCX, TEX, XML, TXT)

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

SciPreprocess

A modular, open-source pipeline for preprocessing scientific documents in multiple formats (PDF, DOCX, LaTeX, JATS XML, TXT) for LLM consumption and NLP analysis.

Features

📄 Multi-format support: PDF, DOCX, TEX, JATS XML, and plain text
🔍 OCR support: Extract text from scanned documents with Tesseract
🧹 Text cleaning: Remove citations, normalize unicode, clean special characters
🔤 NLP processing: Tokenization, lemmatization, stopword removal using spaCy or NLTK
📑 Section detection: Automatically identify paper sections (Abstract, Introduction, etc.)
🔗 Acronym handling: Detect and expand acronyms using scispacy
📊 Feature extraction: TF-IDF and semantic embeddings with sentence-transformers
🔎 Semantic search: FAISS indexing for efficient similarity search
🧩 Modular design: Use only the components you need
📊 Export formats: JSON (default) or CSV output with --format flag

Installation

From PyPI (Recommended)

pip install scipreprocess

With Optional Dependencies

Install specific feature sets:

# PDF support
pip install "scipreprocess[pdf]"

# NLP features
pip install "scipreprocess[nlp]"

# Machine learning features
pip install "scipreprocess[ml]"

# OCR support
pip install "scipreprocess[ocr]"

# Everything
pip install "scipreprocess[all]"

Development Installation

For development or from source:

git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess
pip install -e ".[all,dev]"

Post-Installation Setup

For NLP features, download required models:

# Download spaCy model
python -m spacy download en_core_web_sm

# Install scispacy model (optional but recommended)
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

Quick Start

Basic Usage

from scipreprocess import preprocess_file

# Process a single document
doc_json, clean_text = preprocess_file("path/to/paper.pdf")

# Access the results
print(doc_json['metadata']['title'])
print(doc_json['abstract'])
print(doc_json['sections'])
print(doc_json['acronyms'])

Process Multiple Documents

from scipreprocess import preprocess_documents

# Process multiple documents
files = ["paper1.pdf", "paper2.docx", "paper3.tex"]
results = preprocess_documents(files)

# Access results
documents = results['documents']
tfidf_matrix = results['tfidf']['X']
vectorizer = results['tfidf']['vectorizer']
chunks = results['chunks']
embeddings = results['embeddings']  # if enabled

Custom Configuration

from scipreprocess import PipelineConfig
from scipreprocess.pipeline import PreprocessingPipeline

# Configure the pipeline
config = PipelineConfig(
    use_ocr=True,
    use_spacy=True,
    use_semantic_embeddings=True,
    spacy_model='en_core_sci_sm',
    embedding_model='sentence-transformers/all-MiniLM-L6-v2',
    chunk_target_sentences=(3, 8)
)

# Create pipeline with custom config
pipeline = PreprocessingPipeline(config)
doc_json, text = pipeline.preprocess_file("paper.pdf")

Command Line Interface

SciPreprocess includes a command-line interface for easy document processing:

Basic CLI Usage

# Process documents and output JSON (default)
scipreprocess document1.pdf document2.docx

# Process with OCR enabled
scipreprocess --ocr scanned_document.pdf

# Process with layout analysis
scipreprocess --layout complex_document.pdf

# Convert text to lowercase
scipreprocess --lower document.pdf

Export Formats

The CLI supports two output formats:

# JSON output (default)
scipreprocess document.pdf

# CSV output - one row per document
scipreprocess document.pdf --format csv

# Save to file
scipreprocess document.pdf --format csv --out results.csv

CLI Options

inputs: Paths to documents to process (required)
--backend {auto,docling,local}: Parser backend (default: auto)
--ocr: Enable OCR for scanned documents
--layout: Enable layout analysis
--lower: Convert text to lowercase
--format {json,csv}: Output format (default: json)
--out FILE: Output file path (default: stdout)

CSV Output Format

When using --format csv, the output contains one row per document with flattened nested data:

abstract,metadata.source_file,metadata.title,metadata.pages,sections
"Abstract text...","document.pdf","Paper Title",12,"[{""heading"": ""Introduction"", ""text"": ""..."", ...}]"

Nested dictionaries are flattened with dotted keys (e.g., metadata.title)
Arrays are JSON-stringified (e.g., sections, figures, tables)
Only document data is included (excludes tfidf, chunks, embeddings, index)

Pipeline Components

The pipeline is organized into modular components:

parsers: Document ingestion (PDF, DOCX, TEX, XML, TXT)
preprocessing: Text cleaning, tokenization, lemmatization
acronyms: Acronym detection and expansion
sectioning: Section splitting and chunking
features: TF-IDF and semantic embeddings
pipeline: Main orchestration

Architecture

scipreprocess/
├── config.py          # Configuration dataclasses
├── models.py          # Data models (ParsedDocument)
├── utils.py           # Dependency management and helpers
├── parsers.py         # Document parsers for each format
├── preprocessing.py   # Text cleaning and NLP
├── acronyms.py        # Acronym detection/expansion
├── sectioning.py      # Section splitting and chunking
├── features.py        # Feature extraction (TF-IDF, embeddings)
└── pipeline.py        # Main pipeline orchestration

Output Format

The pipeline produces structured JSON for each document:

{
    "metadata": {
        "title": "Paper Title",
        "source_file": "path/to/file.pdf",
        "pages": 12
    },
    "abstract": "Abstract text...",
    "sections": [
        {"heading": "Introduction", "text": "..."},
        {"heading": "Methods", "text": "..."},
        ...
    ],
    "acronyms": {
        "NLP": "Natural Language Processing",
        "ML": "Machine Learning"
    },
    "figures": [],
    "tables": [],
    "equations": [],
    "references": []
}

Dependencies

Required

unidecode: Unicode normalization

Optional

PyMuPDF: PDF parsing
python-docx: DOCX parsing
lxml: XML parsing
opencv-python + pytesseract: OCR support
nltk: Basic NLP (tokenization, stopwords, lemmatization)
spacy + scispacy: Advanced NLP and abbreviation detection
pysbd: Sentence boundary detection
scikit-learn: TF-IDF vectorization
sentence-transformers: Semantic embeddings
faiss: Similarity search

Development

Setup Development Environment

# Clone the repository
git clone https://github.com/Tarikul-Islam-Anik/scipreprocess.git
cd scipreprocess

# Install in development mode with dev dependencies
pip install -e ".[all,dev]"

# Run tests
pytest

# Format code
black src/ tests/

# Lint code
ruff check src/ tests/

# Type checking
mypy src/

Documentation

Examples: examples/basic_usage.py

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=scipreprocess --cov-report=html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this pipeline in your research, please cite:

@software{scipreprocess,
  title = {SciPreprocess: A Modular Scientific Document Preprocessing Pipeline},
  author = {Anik, Tarikul Islam},
  year = {2025},
  url = {https://github.com/Tarikul-Islam-Anik/scipreprocess}
}

Acknowledgments

Built with spaCy, scispacy, and sentence-transformers
Inspired by the needs of scientific text processing and NLP research

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

oxyzen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.2

Oct 4, 2025

0.1.1

Oct 3, 2025

0.1.0

Oct 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scipreprocess-0.1.2.tar.gz (72.1 kB view details)

Uploaded Oct 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scipreprocess-0.1.2-py3-none-any.whl (29.2 kB view details)

Uploaded Oct 4, 2025 Python 3

File details

Details for the file scipreprocess-0.1.2.tar.gz.

File metadata

Download URL: scipreprocess-0.1.2.tar.gz
Upload date: Oct 4, 2025
Size: 72.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scipreprocess-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`de4f9e7c411ad27014097aff2ddd85ccd151ff072509cea80f49dc610713e2c8`
MD5	`d1fe8bb20c7ce2fa780f85f9bb59bbf6`
BLAKE2b-256	`5eb33bf466bda8c9bf5ad7ba67aa885f36319234a1a2515dca94cf2255c9f6ee`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scipreprocess-0.1.2.tar.gz:

Publisher: publish.yml on Tarikul-Islam-Anik/SciPreprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scipreprocess-0.1.2.tar.gz
- Subject digest: de4f9e7c411ad27014097aff2ddd85ccd151ff072509cea80f49dc610713e2c8
- Sigstore transparency entry: 583784221
- Sigstore integration time: Oct 4, 2025
Source repository:
- Permalink: Tarikul-Islam-Anik/SciPreprocess@6117459e29ddf07972042fa0324032845b8bbdbf
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/Tarikul-Islam-Anik
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6117459e29ddf07972042fa0324032845b8bbdbf
- Trigger Event: release

File details

Details for the file scipreprocess-0.1.2-py3-none-any.whl.

File metadata

Download URL: scipreprocess-0.1.2-py3-none-any.whl
Upload date: Oct 4, 2025
Size: 29.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scipreprocess-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`62c1d2b8b5360099f1b7132997285827c799e9a1d6a115abf016586cff4f0bc5`
MD5	`0dec38881e52f8dfd9eb4d452d9013cf`
BLAKE2b-256	`729dc123de3cfe9b5c8938c54868ff20e935be3f25118af088175611a2f64017`

See more details on using hashes here.

Provenance

The following attestation bundles were made for scipreprocess-0.1.2-py3-none-any.whl:

Publisher: publish.yml on Tarikul-Islam-Anik/SciPreprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: scipreprocess-0.1.2-py3-none-any.whl
- Subject digest: 62c1d2b8b5360099f1b7132997285827c799e9a1d6a115abf016586cff4f0bc5
- Sigstore transparency entry: 583784222
- Sigstore integration time: Oct 4, 2025
Source repository:
- Permalink: Tarikul-Islam-Anik/SciPreprocess@6117459e29ddf07972042fa0324032845b8bbdbf
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/Tarikul-Islam-Anik
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6117459e29ddf07972042fa0324032845b8bbdbf
- Trigger Event: release

scipreprocess 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SciPreprocess

Features

Installation

From PyPI (Recommended)

With Optional Dependencies

Development Installation

Post-Installation Setup

Quick Start

Basic Usage

Process Multiple Documents

Custom Configuration

Command Line Interface

Basic CLI Usage

Export Formats

CLI Options

CSV Output Format

Pipeline Components

Architecture

Output Format

Dependencies

Required

Optional

Development

Setup Development Environment

Documentation

Running Tests

Contributing

License

Citation

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance