Convert PDF documents to structured Markdown with a multi-stage pipeline.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

CortexMark

A multi-stage pipeline that converts PDF documents into structured Markdown, cleans the output, and splits it into chunks. It runs on CPU and does not require an LLM or any cloud service.

Features

Dual-engine conversion: combines Docling for structure analysis and markitdown for raw text recovery
Multi-stage processing: PDF → raw Markdown → cleaned Markdown → chunks
Idempotent execution: unchanged files are skipped through a SHA-256 manifest
Parallel processing: thread/process pool support for multi-file workloads
Quality assurance: QA pipeline with GOLD/SILVER/BRONZE/FAIL badges, OCR quality grading (A–F), formula fidelity scoring
Scholarly metadata: extraction of title, authors, abstract, keywords, DOI, and more with BibTeX and APA7 output
Citation analysis: author-year and numeric citation detection, citation graph, Graphviz DOT export
Document classification: automatic detection of paper, textbook, syllabus, slides, report, or generic types
Topic classification: keyword-frequency scoring across RL, ML, NLP, CV, optimization, statistics, math, physics, economics
RAG export: JSONL/JSON output with SHA-256 IDs, token estimates, and chapter/section metadata
Multi-format output: HTML (standalone pages), plain text, YAML with front-matter
GitHub Pages generation: static site with document cards, navigation, and breadcrumbs
Figure extraction: image catalog from Markdown and HTML <img> references with file existence validation
Version diffing: unified diff format with JSON change statistics across file trees
Plugin architecture: custom pipeline hooks (pre_convert, post_convert, pre_clean, post_clean, pre_chunk, post_chunk, post_pipeline) via file-based discovery
Template rendering: deterministic source profile and section template population
Docker support: containerized execution with minimal setup
VS Code extension: session management, Markdown preview panel, quality dashboard, analysis module integration, progress visualization, and a chat-oriented control surface
Validated by an extensive pytest suite with a minimum coverage threshold of 70%

Dual-Engine Approach

Engine	Strengths	Weaknesses
docling	Structural analysis for headings, formulas, and algorithmic blocks	May skip some paragraphs
markitdown	Extracts more raw text in difficult PDFs	Can turn formulas into table-like artifacts

The default mode, dual, uses Docling output as the structural backbone and fills missing paragraphs from markitdown output through fingerprint matching. Table artifacts and short fragments are filtered automatically.

Migration from PhiniteLab PDF Pipeline

This release rebrands the public package and extension surfaces to CortexMark.

Old surface	New surface
PyPI package `phinitelab-pdf-pipeline`	PyPI package `cortexmark`
Python module `phinitelab_pdf_pipeline`	Python module `cortexmark`
CLI command `phinitelab-pdf-pipeline`	CLI command `cortexmark`
VS Code extension `phinitelab-pdf-pipeline-vscode`	VS Code extension `cortexmark-pipeline-vscode`
Session store `.phinitelab-pdf-pipeline/`	Session store `.cortexmark/`

Notes:

This is a breaking rename for public package, module, CLI, and extension IDs.
The VS Code extension now uses a new extension identity; existing users should install the new cortexmark-pipeline-vscode package manually.
Existing workspace session data is read from the legacy .phinitelab-pdf-pipeline/sessions.json path and copied into .cortexmark/sessions.json automatically when needed.

Installation

Recommended installation paths

Choose the smallest installation that matches your workload.

Scenario	Command	What you get
Lightweight CPU setup	`pip install cortexmark`	Installs the CLI plus the `markitdown` engine
Layout-aware CPU setup	`pip install "cortexmark[docling]"`	Adds Docling for `docling` and `dual` modes
GPU-oriented setup	`pip install "cortexmark[gpu]"`	Same Docling-enabled workflow, intended for CUDA-capable hosts
Developer setup	`pip install -e ".[dev]"`	Runtime + tests + lint/type/build tooling
Docs build setup	`pip install -e ".[docs]"`	MkDocs + mkdocstrings for local docs builds

Requirements and dependency matrix

Item	Needed for	Required?	Notes
Python 3.11+	All installs	Yes	Supported baseline runtime
`cortexmark` package	CLI + modules	Yes	Provides the `cortexmark` command
`markitdown[pdf]`	`markitdown` engine and `dual` gap-fill	Installed by default	Lightweight CPU-friendly path
`docling`	`docling` engine and `dual` structural parsing	Optional	Install via `cortexmark[docling]` or `cortexmark[gpu]`
PyTorch	Docling runtime	Optional	Pre-install CPU PyTorch on CPU-only hosts to avoid large CUDA downloads
Poppler	Some PDF/OCR-adjacent workflows	Optional	Helpful, not mandatory for every document
Tesseract OCR	Scanned/image-heavy PDFs and OCR-style workflows	Optional	Only useful when OCR is needed
Docker	Containerized execution	Optional	Good for reproducible setups

CortexMark does not require an API key, LLM, or cloud service for its core pipeline.

Install with pip

pip install cortexmark

This installs the lightweight default runtime declared in pyproject.toml and is enough for:

cortexmark --engine markitdown
Markdown cleaning, chunking, export, quality reports, and downstream analysis on produced Markdown

Install with Docling on CPU

pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install "cortexmark[docling]"

Use this when you want:

--engine docling
--engine dual
stronger layout recovery for complex academic PDFs

Install with GPU support

pip install "cortexmark[gpu]"

Or preinstall a specific CUDA-targeted PyTorch build first, then install the extra.

System tools

Optional system tools:

=== "Ubuntu / Debian"

```bash
sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr
```

=== "macOS"

```bash
brew install poppler tesseract
```

=== "Windows (WSL)"

```bash
sudo apt-get update
sudo apt-get install -y poppler-utils tesseract-ocr
```

These tools are not mandatory for every installation. They become useful when:

your PDFs are scanned or image-heavy,
your environment doctor asks for them,
or your preferred PDF workflow depends on them.

Developer installation

git clone https://github.com/PhiniteLab/pdf-to-markdown-pipeline.git
cd pdf-to-markdown-pipeline
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

For local documentation builds:

pip install -e ".[docs]"

What CortexMark processes

Primary input types

The full pipeline starts from PDF files:

a single .pdf
or a directory tree containing .pdf files

Examples:

cortexmark --input path/to/paper.pdf
cortexmark --input path/to/folder-of-pdfs

Downstream module input types

After conversion, most modules work on Markdown files produced by the pipeline.

Input type	Used by
`.pdf`	`convert`, full `cortexmark` pipeline entrypoint
`.md` trees	`clean`, `chunk`, `metadata`, `citations`, `doc_type`, `topics`, `figures`, `rag_export`, `semantic_chunk`, `cross_ref`, `algorithm_extract`, `notation_glossary`, `formula_validate`, `citation_context`, `scientific_qa`, `multi_format`, `ghpages`
`outline` / `syllabus` `.md` or `.txt` helper files	`render_templates`

Best-fit document categories

CortexMark is optimized for:

academic papers
lecture notes
textbooks and book chapters
theses and reports
math/theorem-heavy PDFs
algorithm/code-heavy technical documents
scanned or noisy PDFs where a fallback recovery path helps

Outputs you will get

By default, CortexMark writes under outputs/.

Output	Typical path	Description
Raw Markdown	`outputs/raw_md/`	First-pass PDF → Markdown conversion
Cleaned Markdown	`outputs/cleaned_md/`	Normalized Markdown with repeated noise reduced
Chunks	`outputs/chunks/`	Section-based chunk files such as `chunk_001_Introduction.md`
Semantic chunks	`outputs/semantic_chunks/`	Theorem/proof/definition-aware chunk artifacts
Quality reports	`outputs/quality/`	QA, citation, formula, cross-ref, notation, and scientific validation reports
Rendered templates	render-specific folders	Source profile and section/task templates
RAG exports	quality/export locations	JSON / JSONL records with chunk IDs and scholarly metadata
Static site exports	GitHub Pages / HTML outputs	HTML pages for browsing processed content

With --session-name, all of the above are isolated under sessions/<session-name>/....

Basic usage

Important: the published cortexmark package ships the CLI and Python modules, but it does not ship the repository's example configs/pipeline.yaml.

If you are working from a cloned repository, you can use the checked-in configs/pipeline.yaml. If you installed from PyPI into a fresh working directory, create your own config file first as shown in docs/getting-started/quickstart.md.

CLI command

After installation, use the cortexmark command:

# Run the default pipeline
cortexmark

# Use a specific file or directory
cortexmark --input path/to/paper.pdf
cortexmark --input path/to/folder-of-pdfs

# Choose a conversion engine
cortexmark --engine markitdown
cortexmark --engine docling
cortexmark --engine dual

# Run only selected stages
cortexmark --stages convert clean
cortexmark --stages analyze validate

# Isolate outputs inside a named session workspace
cortexmark --session-name experiment-1

Run cortexmark --help to view all arguments.

Makefile shortcuts

make help                # List available commands
make all                 # Run convert → clean → chunk → render
make analyze             # Run semantic/cross-ref/algorithm/notation modules
make validate            # Run formula/scientific QA/citation validation modules
make benchmark-reference # Run the reference benchmark gate
make test                # Run pytest
make lint                # Run Ruff checks

Run modules directly

python -m cortexmark.convert --input data/raw/paper.pdf --engine docling
python -m cortexmark.clean --input outputs/raw_md --output-dir outputs/cleaned_md
python -m cortexmark.chunk --input outputs/cleaned_md --output-dir outputs/chunks
python -m cortexmark.cross_ref --input outputs/cleaned_md
python -m cortexmark.rag_export --input outputs/chunks
python -m cortexmark.reference_eval --benchmarks benchmarks/references --baseline benchmarks/references/baseline.json

Pipeline stages

PDF files
   │
   ▼
┌──────────┐   raw_md/    ┌───────┐  cleaned_md/  ┌───────┐  chunks/
│ convert  │ ───────────► │ clean │ ────────────► │ chunk │ ──────────► output
│ (engine) │              │       │               │       │
└──────────┘              └───────┘               └───────┘
                                                       │
                                                       ▼
                                               ┌───────────────┐
                                               │    render     │
                                               │  templates    │
                                               └───────────────┘

convert — PDF → raw Markdown using markitdown, docling, or dual
clean — normalize repeated headers/footers, line wraps, and noisy formatting
chunk — split cleaned Markdown into logical sections
render (optional) — generate source-profile and section templates
analyze (optional) — semantic chunking, cross-reference analysis, algorithm extraction, notation glossary
validate (optional) — formula validation, citation context extraction, scientific QA

VS Code extension

The published VS Code extension is PhiniteLab.cortexmark-pipeline-vscode.

Install it from the VS Code Extensions view by searching for CortexMark Pipeline, then:

install the extension,
install the Python backend separately with pip install cortexmark (or cortexmark[docling]),
open your workspace,
run CortexMark: Environment Doctor,
create a session and add PDFs.

The extension documentation lives in:

vscode-extension/README.md
docs/vscode/setup.md
docs/vscode/commands.md

Portable path and environment configuration

CortexMark resolves runtime paths with a stable precedence order:

CLI arguments
environment variables
workspace or project .env
configs/pipeline.yaml
repo-relative defaults

Useful overrides include PROJECT_ROOT, DATA_DIR, OUTPUT_DIR, REPORT_DIR, LOG_DIR, CHECKPOINT_DIR, CACHE_DIR, MODEL_DIR, EXTERNAL_BIN_DIR, plus direct output overrides such as RAW_DATA_DIR, OUTPUT_RAW_MD, OUTPUT_CLEANED_MD, OUTPUT_CHUNKS, OUTPUT_SEMANTIC_CHUNKS, and MANIFEST_FILE.

A ready-to-copy template is available in .env.example.

Docker

For a containerized workflow:

docker compose up pipeline
docker compose --profile test up

This is useful when you want a reproducible local environment or do not want to manage host dependencies manually.

Configuration

All settings are controlled from configs/pipeline.yaml:

source_id: default

paths:
  data_raw: data/raw
  output_raw_md: outputs/raw_md
  output_cleaned_md: outputs/cleaned_md
  output_chunks: outputs/chunks
  output_quality: outputs/quality
  output_semantic_chunks: outputs/semantic_chunks

convert:
  engine: dual                         # docling | markitdown | dual
  docling:
    device: auto                       # auto | cpu | cuda
    num_threads: 1
    do_ocr: false
    do_table_structure: true
    table_structure_mode: accurate     # accurate | fast
  markitdown:
    enabled: true

clean:
  min_repeated_header_count: 3
  max_repeated_header_length: 80

chunk:
  split_levels: [1, 2]                 # Heading levels that trigger new chunks

render_templates:
  outline_file: 00_meta/outline.md
  language: en
  max_summary_chars: 240
  max_scope_items: 6
  max_tasks: 5

logging:
  level: INFO                          # DEBUG | INFO | WARNING | ERROR
  format: "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
  date_format: "%Y-%m-%d %H:%M:%S"

idempotency:
  enabled: true
  manifest_file: outputs/.manifest.json

Any script can receive an alternative config file through --config <path>.

Project Structure

pdf-to-markdown-pipeline/
├── cortexmark/           # Python package
│   ├── run_pipeline.py                #   Orchestrator and CLI entry point
│   ├── convert.py                     #   PDF → Markdown (docling/markitdown/dual)
│   ├── clean.py                       #   Markdown cleanup and normalization
│   ├── chunk.py                       #   Split by heading levels into numbered chunks
│   ├── common.py                      #   Config, logging, manifest, path utilities
│   ├── render_templates.py            #   Deterministic template rendering
│   ├── parallel.py                    #   Thread/process pool helpers
│   ├── plugin.py                      #   Plugin base class and registry
│   ├── qa_pipeline.py                 #   Quality checks and badge scoring
│   ├── ocr_quality.py                 #   OCR text quality metrics (A–F grade)
│   ├── formula_score.py               #   Formula/equation fidelity scoring
│   ├── metadata.py                    #   Scholarly metadata (BibTeX, APA7, YAML)
│   ├── citations.py                   #   Citation graph extraction (DOT, JSON)
│   ├── doc_type.py                    #   Document type detection with templates
│   ├── topics.py                      #   Topic classification by keyword frequency
│   ├── figures.py                     #   Figure reference catalog
│   ├── diff.py                        #   Version-to-version diff reporting
│   ├── rag_export.py                  #   RAG-oriented JSONL/JSON export
│   ├── semantic_chunk.py              #   Scientific-aware chunking (theorem/proof/def/algo)
│   ├── cross_ref.py                   #   Cross-reference resolution and linking
│   ├── algorithm_extract.py           #   Algorithm/pseudocode extraction and analysis
│   ├── notation_glossary.py           #   Mathematical notation glossary builder
│   ├── formula_validate.py            #   Enhanced LaTeX formula validation
│   ├── citation_context.py            #   Citation context extraction and classification
│   ├── scientific_qa.py               #   Scientific document quality assurance checks
│   ├── multi_format.py                #   HTML / plain text / YAML export
│   └── ghpages.py                     #   GitHub Pages static site generation
├── configs/
│   └── pipeline.yaml                  # Central configuration file
├── tests/
│   └── test_pipeline_structure.py     # Extensive pytest suite (70% minimum coverage)
├── data/raw/                          # Source PDF files (user-provided)
│   ├── books/
│   ├── notes/
│   ├── manuscripts/
│   ├── reports/
│   ├── textbooks/chapters/
│   └── theses/
├── outputs/                           # Generated outputs
│   ├── raw_md/                        #   Raw Markdown from conversion
│   ├── cleaned_md/                    #   Cleaned and normalized Markdown
│   └── chunks/                        #   Chunked output sections
├── vscode-extension/                  # VS Code extension v0.3.4 (TypeScript)
│   ├── src/extension.ts               #   Activation, command registration, file watchers
│   ├── src/sessionManager.ts          #   Session persistence and events
│   ├── src/sessionTree.ts             #   Tree data provider (Sessions, Actions, Analysis, Outputs)
│   ├── src/pipelineRunner.ts          #   Subprocess spawning with progress bar & cancellation
│   ├── src/previewPanel.ts            #   Markdown preview WebView with QA badges & math
│   ├── src/dashboardPanel.ts          #   Quality metrics dashboard WebView
│   ├── src/chatView.ts                #   Chat panel with command-driven workflows
│   └── src/types.ts                   #   TypeScript interfaces
├── .github/workflows/ci.yml          # GitHub Actions CI (lint + test + typecheck)
├── pyproject.toml                     # Package metadata, dependencies, tool config
├── Makefile                           # Common developer commands
├── Dockerfile                         # Multi-stage container image
├── docker-compose.yml                 # Pipeline, test, and lint services
└── requirements.txt                   # Pinned runtime dependencies

Troubleshooting

Problem	Resolution
`ImportError: The 'docling' package is required`	Docling is not included in the default installation. Install it with `pip install "cortexmark[docling]"` or `pip install "cortexmark[gpu]"`.
`ModuleNotFoundError: No module named 'cortexmark'`	Make sure the project was installed with `pip install -e .`. Prefer `python -m cortexmark.convert` over `python cortexmark/convert.py`.
`FileNotFoundError: Config file not found`	Pass a valid config path with `--config`, for example `cortexmark --config configs/pipeline.yaml`.
Docling installation fails	Docling may require system libraries and a compiler toolchain. On Debian/Ubuntu, install `build-essential` and `poppler-utils`, or use Docker instead.
Memory issues on large PDFs	Verify `num_threads: 1` and `device: cpu` in `configs/pipeline.yaml`. If needed, constrain Docker resources explicitly.
Formulas look corrupted	`engine: dual` usually gives the best output. Compare with `--engine docling` and `--engine markitdown` when debugging.
Idempotency does not skip unchanged files	Check that `outputs/.manifest.json` is writable. To rebuild from scratch, run `make clean-outputs`.
Tests fail after code changes	Run `make lint && make test && pyright cortexmark/` to check all quality gates.

Development Notes

Tests

make test                          # or: python -m pytest tests/ -v

Linting and formatting

make lint                          # Checks only
make format                        # Apply automatic fixes

Type checking

pyright cortexmark/   # standard mode, currently 0 errors, 0 warnings

Coverage

python -m pytest tests/ --cov=cortexmark --cov-report=term-missing

The minimum coverage threshold is 70%, enforced by pytest-cov.

Pre-commit hooks

pre-commit install
pre-commit run --all-files

Build

pip install build
python -m build                    # creates .tar.gz and .whl files under dist/

License

This project is licensed under the MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

PythaLab

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.4

Apr 18, 2026

0.3.3

Apr 18, 2026

0.3.1

Apr 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cortexmark-0.3.4.tar.gz (172.3 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cortexmark-0.3.4-py3-none-any.whl (133.5 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file cortexmark-0.3.4.tar.gz.

File metadata

Download URL: cortexmark-0.3.4.tar.gz
Upload date: Apr 18, 2026
Size: 172.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cortexmark-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`672cf4080b350e792acaa1cff460dde92c2ca19f14bc35eaddbd437d0e47d3d8`
MD5	`154fbbd664bb85c316292543fd4fdb15`
BLAKE2b-256	`db99c7e54e81778e0d76656d2e93f0e2e7e0fc0b67511997efe4032d92eaf107`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cortexmark-0.3.4.tar.gz:

Publisher: release.yml on PhiniteLab/pdf-to-markdown-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cortexmark-0.3.4.tar.gz
- Subject digest: 672cf4080b350e792acaa1cff460dde92c2ca19f14bc35eaddbd437d0e47d3d8
- Sigstore transparency entry: 1338598411
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: PhiniteLab/pdf-to-markdown-pipeline@e39a901471927f5353b6a902e99efaaa6db7b1b6
- Branch / Tag: refs/tags/v0.3.4
- Owner: https://github.com/PhiniteLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e39a901471927f5353b6a902e99efaaa6db7b1b6
- Trigger Event: push

File details

Details for the file cortexmark-0.3.4-py3-none-any.whl.

File metadata

Download URL: cortexmark-0.3.4-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 133.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cortexmark-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`31811b05ea284894000e7ac9725938600b7a517ec1e098372aab30f15110ee5f`
MD5	`e96bd7801e9b94ea4a5b8f94042ecb0e`
BLAKE2b-256	`d0cc7a01fbe4d53479a6210978482fa91b9ee59e333720fcc5d1524877785618`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cortexmark-0.3.4-py3-none-any.whl:

Publisher: release.yml on PhiniteLab/pdf-to-markdown-pipeline

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cortexmark-0.3.4-py3-none-any.whl
- Subject digest: 31811b05ea284894000e7ac9725938600b7a517ec1e098372aab30f15110ee5f
- Sigstore transparency entry: 1338598415
- Sigstore integration time: Apr 18, 2026
Source repository:
- Permalink: PhiniteLab/pdf-to-markdown-pipeline@e39a901471927f5353b6a902e99efaaa6db7b1b6
- Branch / Tag: refs/tags/v0.3.4
- Owner: https://github.com/PhiniteLab
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e39a901471927f5353b6a902e99efaaa6db7b1b6
- Trigger Event: push

cortexmark 0.3.4

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CortexMark

Features

Dual-Engine Approach

Migration from PhiniteLab PDF Pipeline

Installation

Recommended installation paths

Requirements and dependency matrix

Install with pip

Install with Docling on CPU

Install with GPU support

System tools

Developer installation

What CortexMark processes

Primary input types

Downstream module input types

Best-fit document categories

Outputs you will get

Basic usage

CLI command

Makefile shortcuts

Run modules directly

Pipeline stages

VS Code extension

Portable path and environment configuration

Docker

Configuration

Project Structure

Troubleshooting

Development Notes

Tests

Linting and formatting

Type checking

Coverage

Pre-commit hooks

Build

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance