Convert PDF documents to structured Markdown with a multi-stage pipeline.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

CortexMark

A multi-stage pipeline that converts PDF documents into structured Markdown, cleans the output, and splits it into chunks. It runs on CPU and does not require an LLM or any cloud service.

Features

Dual-engine conversion: combines Docling for structure analysis and markitdown for raw text recovery
Multi-stage processing: PDF → raw Markdown → cleaned Markdown → chunks
Idempotent execution: unchanged files are skipped through a SHA-256 manifest
Parallel processing: thread/process pool support for multi-file workloads
Quality assurance: QA pipeline with GOLD/SILVER/BRONZE/FAIL badges, OCR quality grading (A–F), formula fidelity scoring
Scholarly metadata: extraction of title, authors, abstract, keywords, DOI, and more with BibTeX and APA7 output
Citation analysis: author-year and numeric citation detection, citation graph, Graphviz DOT export
Document classification: automatic detection of paper, textbook, syllabus, slides, report, or generic types
Topic classification: keyword-frequency scoring across RL, ML, NLP, CV, optimization, statistics, math, physics, economics
RAG export: JSONL/JSON output with SHA-256 IDs, token estimates, and chapter/section metadata
Multi-format output: HTML (standalone pages), plain text, YAML with front-matter
GitHub Pages generation: static site with document cards, navigation, and breadcrumbs
Figure extraction: image catalog from Markdown and HTML <img> references with file existence validation
Version diffing: unified diff format with JSON change statistics across file trees
Plugin architecture: custom pipeline hooks (pre_convert, post_convert, pre_clean, post_clean, pre_chunk, post_chunk, post_pipeline) via file-based discovery
Template rendering: deterministic source profile and section template population
Docker support: containerized execution with minimal setup
VS Code extension: session management, Markdown preview panel, quality dashboard, analysis module integration, progress visualization, and chat panel with 22 commands
Validated by an extensive pytest suite with a minimum coverage threshold of 70%

Dual-Engine Approach

Engine	Strengths	Weaknesses
docling	Structural analysis for headings, formulas, and algorithmic blocks	May skip some paragraphs
markitdown	Extracts more raw text in difficult PDFs	Can turn formulas into table-like artifacts

The default mode, dual, uses Docling output as the structural backbone and fills missing paragraphs from markitdown output through fingerprint matching. Table artifacts and short fragments are filtered automatically.

Migration from PhiniteLab PDF Pipeline

This release rebrands the public package and extension surfaces to CortexMark.

Old surface	New surface
PyPI package `phinitelab-pdf-pipeline`	PyPI package `cortexmark`
Python module `phinitelab_pdf_pipeline`	Python module `cortexmark`
CLI command `phinitelab-pdf-pipeline`	CLI command `cortexmark`
VS Code extension `phinitelab-pdf-pipeline-vscode`	VS Code extension `cortexmark-vscode`
Session store `.phinitelab-pdf-pipeline/`	Session store `.cortexmark/`

Notes:

This is a breaking rename for public package, module, CLI, and extension IDs.
The VS Code extension now uses a new extension identity; existing users should install the new cortexmark-vscode package manually.
Existing workspace session data is read from the legacy .phinitelab-pdf-pipeline/sessions.json path and copied into .cortexmark/sessions.json automatically when needed.

Installation

Requirements

Python 3.11 or newer
Optional: Poppler and Tesseract for OCR and advanced PDF handling

Install with pip (lightweight, CPU-only)

The default installation is lightweight and does not include Docling, PyTorch, or any GPU/CUDA packages. It provides the markitdown conversion engine:

pip install git+https://github.com/PhiniteLab/pdf-to-markdown-pipeline.git

This is sufficient for running the pipeline with --engine markitdown.

Install with Docling engine (CPU)

To use the docling or dual conversion engine on CPU, first install CPU-only PyTorch, then install the package with the [docling] extra:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install "cortexmark[docling] @ git+https://github.com/PhiniteLab/pdf-to-markdown-pipeline.git"

Note: Pre-installing CPU-only PyTorch prevents pip from downloading the much larger GPU-enabled build (~2 GB+) from PyPI.

Install with GPU support

If you have an NVIDIA GPU with CUDA, you can install with GPU support directly:

pip install "cortexmark[gpu] @ git+https://github.com/PhiniteLab/pdf-to-markdown-pipeline.git"

This pulls the default PyTorch from PyPI, which includes CUDA support on Linux. To target a specific CUDA version (e.g. CUDA 12.8):

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install "cortexmark[gpu] @ git+https://github.com/PhiniteLab/pdf-to-markdown-pipeline.git"

Which installation should I choose?

Scenario	Command
Lightweight / markitdown only	`pip install cortexmark`
Docling engine on CPU	Pre-install CPU torch, then `pip install "cortexmark[docling]"`
Docling engine with NVIDIA GPU	`pip install "cortexmark[gpu]"`

WSL / Linux notes

On WSL2 with GPU passthrough, the [gpu] extra works if NVIDIA drivers are properly configured on the Windows host.
On headless Linux servers without a GPU, always use the CPU installation path to avoid pulling unnecessary CUDA libraries.

Developer installation

git clone https://github.com/PhiniteLab/pdf-to-markdown-pipeline.git
cd pdf-to-markdown-pipeline
python3 -m venv .venv
source .venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install -e ".[dev]"

This installs the runtime dependencies together with Docling and the development toolchain, including pytest, pytest-cov, Ruff, Pyright, and pre-commit.

Install with Docker

docker compose up pipeline        # Run the pipeline
docker compose --profile test up  # Run the test profile

Usage

CLI command

After installation, you can use the cortexmark command:

# Run all stages in order
cortexmark

# Run only selected stages
cortexmark --stages convert clean

# Use a different config file
cortexmark --config configs/pipeline.yaml

# Select a conversion engine
cortexmark --engine docling      # Docling only
cortexmark --engine markitdown   # markitdown only
cortexmark --engine dual         # combined mode (default)

# Custom input directory or single file
cortexmark --input path/to/my.pdf

# Session-scoped output directories
cortexmark --session-name sample-session

# Disable idempotency (force reprocess)
cortexmark --no-manifest

Run cortexmark --help to view all available arguments.

Makefile shortcuts

make help           # List available commands
make all            # Run the full pipeline
make convert        # Run only PDF → Markdown conversion
make clean          # Run only the cleaning stage
make chunk          # Run only chunk generation
make render         # Run only template rendering
make test           # Run the test suite
make lint           # Run Ruff lint and formatting checks
make format         # Apply automatic formatting fixes
make clean-outputs  # Remove all generated outputs

Run modules directly

Each module can also be executed independently:

# Core stages
python -m cortexmark.convert --config configs/pipeline.yaml
python -m cortexmark.clean --config configs/pipeline.yaml
python -m cortexmark.chunk --config configs/pipeline.yaml
python -m cortexmark.render_templates --config configs/pipeline.yaml

# Quality & analysis
python -m cortexmark.qa_pipeline --input outputs/cleaned_md
python -m cortexmark.ocr_quality --input outputs/raw_md
python -m cortexmark.formula_score --input outputs/raw_md

# Metadata & classification
python -m cortexmark.metadata --input outputs/cleaned_md
python -m cortexmark.citations --input outputs/cleaned_md
python -m cortexmark.doc_type --input outputs/cleaned_md
python -m cortexmark.topics --input outputs/cleaned_md
python -m cortexmark.figures --input outputs/cleaned_md

# Export & output
python -m cortexmark.rag_export --input outputs/chunks
python -m cortexmark.multi_format --input outputs/cleaned_md
python -m cortexmark.ghpages --input outputs/cleaned_md
python -m cortexmark.diff --old outputs/v1 --new outputs/v2

# Utilities
python -m cortexmark.parallel --help
python -m cortexmark.plugin --help

Pipeline Stages

PDF files
   │
   ▼
┌──────────┐   raw_md/    ┌───────┐  cleaned_md/  ┌───────┐  chunks/
│ convert  │ ───────────►  │ clean │ ────────────►  │ chunk │ ──────────► output
│ (dual)   │               │       │                │       │
└──────────┘               └───────┘                └───────┘
                                                        │
                                                        ▼
                                                ┌───────────────┐
                                                │    render      │
                                                │  (templates)   │
                                                └───────────────┘

convert: PDF → raw Markdown. Docling handles structure (headings, formulas, algorithms), while markitdown fills text gaps via fingerprint-based deduplication.
clean: removes page numbers, repeated headers/footers, and broken line wraps. Normalizes heading hierarchy and table blocks.
chunk: splits cleaned Markdown into logical sections based on heading levels (default: H1 and H2). Files are numbered (e.g., chunk_001_Introduction.md).
render (optional): fills source profile and section template files deterministically from outline/content metadata.
analyze (optional): runs semantic chunking, cross-reference analysis, algorithm extraction, and notation glossary on cleaned Markdown.
validate (optional): runs formula validation, scientific QA checks, and citation context analysis. Produces quality reports under outputs/quality/ (or outputs/quality/<session-name>/ when session-scoped).

Optional Analysis Modules

Module	Purpose	Output
`qa_pipeline`	Encoding errors, missing text, broken links, orphan headings, table integrity	Markdown/JSON report with GOLD/SILVER/BRONZE/FAIL badges
`ocr_quality`	Garble count, symbol-soup, repeat artifacts, common-word ratio	A–F confidence grade (0–1 score)
`formula_score`	Recovered equations, incomplete markers, balanced parentheses	Per-file fidelity percentage
`metadata`	Title, authors, abstract, keywords, DOI, journal, year, emails, funding	YAML front-matter, BibTeX, APA7
`citations`	Author-year and numeric `[1,2,3]` citation patterns	JSON graph, Graphviz DOT
`doc_type`	Document type (paper, textbook, syllabus, slides, report, generic)	Type, confidence (0–1), detection signals
`topics`	Topic distribution (RL, ML, NLP, CV, optimization, etc.)	Per-file and aggregated distribution
`figures`	Markdown `![alt](src)` and HTML `<img>` image references	JSON manifest, Markdown gallery
`diff`	File tree comparison with unified diff	JSON change statistics
`rag_export`	RAG-ready chunks with SHA-256 IDs, token estimates, entity types, formulas, cross-refs	JSONL or JSON array
`semantic_chunk`	Scientific-aware chunking: theorems, proofs, definitions, algorithms, examples	Numbered chunk files with entity metadata
`cross_ref`	Cross-reference resolution: definition sites, mention detection, kind normalization	JSON report with resolution rate, unresolved refs
`algorithm_extract`	Algorithm/pseudocode extraction: fenced blocks, header lines, input/output/step parsing	JSON report with algorithm structures
`notation_glossary`	Mathematical notation glossary: explicit definitions, list/table notations, common LaTeX symbols	JSON report, Markdown glossary table
`formula_validate`	Enhanced LaTeX formula validation: balanced delimiters, environment matching, command validation, complexity scoring	JSON report with per-formula issues
`citation_context`	Citation context extraction: purpose classification (7 categories), co-citation analysis, self-citation detection	JSON report with sentence-level context
`scientific_qa`	Scientific document QA: theorem-proof pairing, definition-before-use, notation consistency, algorithm validity, formula quality gate	JSON report with GOLD/SILVER/BRONZE/FAIL badges
`multi_format`	HTML, plain text, YAML with front-matter	Standalone pages per document
`ghpages`	GitHub Pages-compatible static site	HTML index with document cards
`parallel`	Thread/process pool abstraction with timing	`TaskResult` and `ParallelReport`
`plugin`	Custom hooks via file-based discovery in `plugins/`	Hook-based extensibility

Configuration

All settings are controlled from configs/pipeline.yaml:

source_id: default

paths:
  data_raw: data/raw
  output_raw_md: outputs/raw_md
  output_cleaned_md: outputs/cleaned_md
  output_chunks: outputs/chunks
  output_quality: outputs/quality
  output_semantic_chunks: outputs/semantic_chunks

convert:
  engine: dual                         # docling | markitdown | dual
  docling:
    device: auto                       # auto | cpu | cuda
    num_threads: 1
    do_ocr: false
    do_table_structure: true
    table_structure_mode: accurate     # accurate | fast
  markitdown:
    enabled: true

clean:
  min_repeated_header_count: 3
  max_repeated_header_length: 80

chunk:
  split_levels: [1, 2]                 # Heading levels that trigger new chunks

render_templates:
  outline_file: 00_meta/outline.md
  language: en
  max_summary_chars: 240
  max_scope_items: 6
  max_tasks: 5

logging:
  level: INFO                          # DEBUG | INFO | WARNING | ERROR
  format: "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
  date_format: "%Y-%m-%d %H:%M:%S"

idempotency:
  enabled: true
  manifest_file: outputs/.manifest.json

Any script can receive an alternative config file through --config <path>.

Project Structure

pdf-to-markdown-pipeline/
├── cortexmark/           # Python package
│   ├── run_pipeline.py                #   Orchestrator and CLI entry point
│   ├── convert.py                     #   PDF → Markdown (docling/markitdown/dual)
│   ├── clean.py                       #   Markdown cleanup and normalization
│   ├── chunk.py                       #   Split by heading levels into numbered chunks
│   ├── common.py                      #   Config, logging, manifest, path utilities
│   ├── render_templates.py            #   Deterministic template rendering
│   ├── parallel.py                    #   Thread/process pool helpers
│   ├── plugin.py                      #   Plugin base class and registry
│   ├── qa_pipeline.py                 #   Quality checks and badge scoring
│   ├── ocr_quality.py                 #   OCR text quality metrics (A–F grade)
│   ├── formula_score.py               #   Formula/equation fidelity scoring
│   ├── metadata.py                    #   Scholarly metadata (BibTeX, APA7, YAML)
│   ├── citations.py                   #   Citation graph extraction (DOT, JSON)
│   ├── doc_type.py                    #   Document type detection with templates
│   ├── topics.py                      #   Topic classification by keyword frequency
│   ├── figures.py                     #   Figure reference catalog
│   ├── diff.py                        #   Version-to-version diff reporting
│   ├── rag_export.py                  #   RAG-oriented JSONL/JSON export
│   ├── semantic_chunk.py              #   Scientific-aware chunking (theorem/proof/def/algo)
│   ├── cross_ref.py                   #   Cross-reference resolution and linking
│   ├── algorithm_extract.py           #   Algorithm/pseudocode extraction and analysis
│   ├── notation_glossary.py           #   Mathematical notation glossary builder
│   ├── formula_validate.py            #   Enhanced LaTeX formula validation
│   ├── citation_context.py            #   Citation context extraction and classification
│   ├── scientific_qa.py               #   Scientific document quality assurance checks
│   ├── multi_format.py                #   HTML / plain text / YAML export
│   └── ghpages.py                     #   GitHub Pages static site generation
├── configs/
│   └── pipeline.yaml                  # Central configuration file
├── tests/
│   └── test_pipeline_structure.py     # 755 tests (70% minimum coverage)
├── data/raw/                          # Source PDF files (user-provided)
│   ├── books/
│   ├── notes/
│   ├── manuscripts/
│   ├── reports/
│   ├── textbooks/chapters/
│   └── theses/
├── outputs/                           # Generated outputs
│   ├── raw_md/                        #   Raw Markdown from conversion
│   ├── cleaned_md/                    #   Cleaned and normalized Markdown
│   └── chunks/                        #   Chunked output sections
├── vscode-extension/                  # VS Code extension v0.3.0 (TypeScript)
│   ├── src/extension.ts               #   Activation, 22 commands, file watchers
│   ├── src/sessionManager.ts          #   Session persistence and events
│   ├── src/sessionTree.ts             #   Tree data provider (Sessions, Actions, Analysis, Outputs)
│   ├── src/pipelineRunner.ts          #   Subprocess spawning with progress bar & cancellation
│   ├── src/previewPanel.ts            #   Markdown preview WebView with QA badges & math
│   ├── src/dashboardPanel.ts          #   Quality metrics dashboard WebView
│   ├── src/chatView.ts                #   Chat panel (11 commands, EN + TR)
│   └── src/types.ts                   #   TypeScript interfaces
├── .github/workflows/ci.yml          # GitHub Actions CI (lint + test + typecheck)
├── pyproject.toml                     # Package metadata, dependencies, tool config
├── Makefile                           # Common developer commands
├── Dockerfile                         # Multi-stage container image
├── docker-compose.yml                 # Pipeline, test, and lint services
└── requirements.txt                   # Pinned runtime dependencies

Troubleshooting

Problem	Resolution
`ImportError: The 'docling' package is required`	Docling is not included in the default installation. Install it with `pip install "cortexmark[docling]"` or `pip install "cortexmark[gpu]"`.
`ModuleNotFoundError: No module named 'cortexmark'`	Make sure the project was installed with `pip install -e .`. Prefer `python -m cortexmark.convert` over `python cortexmark/convert.py`.
`FileNotFoundError: Config file not found`	Pass a valid config path with `--config`, for example `cortexmark --config configs/pipeline.yaml`.
Docling installation fails	Docling may require system libraries and a compiler toolchain. On Debian/Ubuntu, install `build-essential` and `poppler-utils`, or use Docker instead.
Memory issues on large PDFs	Verify `num_threads: 1` and `device: cpu` in `configs/pipeline.yaml`. If needed, constrain Docker resources explicitly.
Formulas look corrupted	`engine: dual` usually gives the best output. Compare with `--engine docling` and `--engine markitdown` when debugging.
Idempotency does not skip unchanged files	Check that `outputs/.manifest.json` is writable. To rebuild from scratch, run `make clean-outputs`.
Tests fail after code changes	Run `make lint && make test && pyright cortexmark/` to check all quality gates.

Development Notes

Tests

make test                          # or: python -m pytest tests/ -v

Linting and formatting

make lint                          # Checks only
make format                        # Apply automatic fixes

Type checking

pyright cortexmark/   # standard mode, currently 0 errors, 0 warnings

Coverage

python -m pytest tests/ --cov=cortexmark --cov-report=term-missing

The minimum coverage threshold is 70%, enforced by pytest-cov.

Pre-commit hooks

pre-commit install
pre-commit run --all-files

Build

pip install build
python -m build                    # creates .tar.gz and .whl files under dist/

License

This project is licensed under the MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

PythaLab

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.4

Apr 18, 2026

0.3.3

Apr 18, 2026

This version

0.3.1

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cortexmark-0.3.1.tar.gz (143.2 kB view details)

Uploaded Apr 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cortexmark-0.3.1-py3-none-any.whl (109.5 kB view details)

Uploaded Apr 14, 2026 Python 3

File details

Details for the file cortexmark-0.3.1.tar.gz.

File metadata

Download URL: cortexmark-0.3.1.tar.gz
Upload date: Apr 14, 2026
Size: 143.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cortexmark-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`55a6b0788e84712cf7c35ba8122e3f3e1fd44d1ed24e638c14f9de456fdde8e3`
MD5	`5325484d2bdeb0c4628b582d312350cd`
BLAKE2b-256	`f1fd1cf8733bcf00ba2e30db11cb4f0ce057019ae4be3e0b11d69d3d50b816d8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cortexmark-0.3.1.tar.gz:

Publisher: release.yml on PhiniteLab/pdf-to-markdown-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cortexmark-0.3.1.tar.gz
- Subject digest: 55a6b0788e84712cf7c35ba8122e3f3e1fd44d1ed24e638c14f9de456fdde8e3
- Sigstore transparency entry: 1297842479
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: PhiniteLab/pdf-to-markdown-pipeline@86371c140de3c9bf91fbb8a0865bff445676bf13
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/PhiniteLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@86371c140de3c9bf91fbb8a0865bff445676bf13
- Trigger Event: push

File details

Details for the file cortexmark-0.3.1-py3-none-any.whl.

File metadata

Download URL: cortexmark-0.3.1-py3-none-any.whl
Upload date: Apr 14, 2026
Size: 109.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cortexmark-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7f5c8a575a905ee8acd675ba151e6c071279c79d285b54705d6808aa6cef8f55`
MD5	`80a235da87217d7ce4607481ef7721f9`
BLAKE2b-256	`dda57ab5d340540aac38d4adb8cea4f45cc3a602dd51623ce63153e3e7cbedd5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cortexmark-0.3.1-py3-none-any.whl:

Publisher: release.yml on PhiniteLab/pdf-to-markdown-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cortexmark-0.3.1-py3-none-any.whl
- Subject digest: 7f5c8a575a905ee8acd675ba151e6c071279c79d285b54705d6808aa6cef8f55
- Sigstore transparency entry: 1297842679
- Sigstore integration time: Apr 14, 2026
Source repository:
- Permalink: PhiniteLab/pdf-to-markdown-pipeline@86371c140de3c9bf91fbb8a0865bff445676bf13
- Branch / Tag: refs/tags/v0.3.2
- Owner: https://github.com/PhiniteLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@86371c140de3c9bf91fbb8a0865bff445676bf13
- Trigger Event: push

cortexmark 0.3.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CortexMark

Features

Dual-Engine Approach

Migration from PhiniteLab PDF Pipeline

Installation

Requirements

Install with pip (lightweight, CPU-only)

Install with Docling engine (CPU)

Install with GPU support

Which installation should I choose?

WSL / Linux notes

Developer installation

Install with Docker

Usage

CLI command

Makefile shortcuts

Run modules directly

Pipeline Stages

Optional Analysis Modules

Configuration

Project Structure

Troubleshooting

Development Notes

Tests

Linting and formatting

Type checking

Coverage

Pre-commit hooks

Build

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance