High-fidelity PDF-to-markdown pipeline for financial filings. Multilingual, jurisdiction-aware.
Reason this release was yanked:
Project deprecated and removed
Project description
Cartographer
High-fidelity PDF-to-markdown pipeline for financial filings. Multilingual, jurisdiction-aware, deterministic.
Cartographer takes annual reports, quarterly filings, and regulatory documents — and produces structured markdown plus classified sections (Income Statement, Balance Sheet, Cash Flow, MD&A, Risk Factors, numbered Notes) that downstream pipelines can consume directly. It works across 9 languages and 10+ jurisdictions, from SEC 10-Ks to ESEF annual reports to HKEX filings.
Why it exists
Commercial PDF-to-markdown services for financial documents are expensive, opaque, and often miss the structural cues that matter most: note numbering hierarchies, table reconstruction from broken cells, multi-column layouts, language-specific section headers. Cartographer is deterministic — regex and structural rules do the cleanup, LLMs are only used as an optional fallback for image-heavy pages. Every transformation is auditable. In internal benchmarks across 6 jurisdictions, it outperformed a commercial baseline on deterministic structural metrics.
Quickstart
pip install cartographer-filings
from cartographer import Pipeline
pipe = Pipeline()
result = pipe.extract("annual_report.pdf")
print(result["metadata"])
# {"company": "Siemens AG", "currency": "EUR", "fiscal_year": 2025, "standard": "IFRS", ...}
print(f"{len(result['notes'])} notes across {result['stats']['sections_found']} classified sections")
print(result["sections"]["income_statement"][:500])
Pipeline
flowchart LR
PDF[PDF file] --> RAW[pymupdf4llm<br/>raw markdown]
RAW --> ENH[enhance<br/>tables • headings • noise]
ENH --> CLEAN[clean markdown]
CLEAN --> PARSE[parser<br/>sections + notes]
PARSE --> OUT[structured dict]
ENH -.optional.-> VIS[Qwen-VL<br/>image-heavy pages]
VIS -.-> CLEAN
Three deterministic stages plus one optional LLM fallback:
enhance— reconstructs tables from<br>-stacked cells, promotes bold runs to headings, strips repeated page headers/footers, normalises negative number formats ((1,234)→-1,234).parser— classifies sections (IS, BS, CF, MD&A, Risk, Audit), detects numbered notes with V5 type mapping, extracts metadata (company, currency, fiscal year, reporting standard) across 9 languages.vision— only triggered on image-heavy pages where text extraction yields nothing useful. Uses Qwen-VL via SiliconFlow. Opt-in via API key.
Output schema
{
"markdown": str, # Enhanced markdown, full document
"metadata": {
"company": str | None,
"currency": str | None, # ISO 4217 code
"fiscal_year": int | None,
"period_end": str | None, # ISO date
"standard": str | None, # IFRS, US-GAAP, K-IFRS, ...
"language": str | None,
},
"sections": {
"income_statement": str | None,
"balance_sheet": str | None,
"cash_flow": str | None,
"mda": str | None,
"risk": str | None,
"audit": str | None,
},
"notes": [
{
"note_number": int,
"title": str,
"content": str,
"v5_type": str | None, # Classified note type (e.g. U03, Related Parties)
"start_char": int,
"end_char": int,
},
...
],
"stats": {
"pages": int,
"raw_chars": int,
"enhanced_chars": int,
"sections_found": int,
"notes_count": int,
"vision_pages": int,
"time_seconds": float,
},
}
Vision fallback
Image-heavy pages (scanned filings, heavy-graphic annual reports) are only handled when you provide a SiliconFlow API key:
from cartographer import Pipeline
pipe = Pipeline(vision_api_key="sk-...") # or SILICONFLOW_API_KEY env var
result = pipe.extract("scanned_report.pdf")
Vision is a fallback, not a default path. The overwhelming majority of financial filings parse correctly through the deterministic pipeline alone.
Supported languages
English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Chinese (partial).
Jurisdictions tested
| Region | Standard | Examples |
|---|---|---|
| United States | US-GAAP | SEC 10-K, 10-Q |
| Germany | IFRS | ESEF filings, Bundesanzeiger |
| Italy | IFRS | CONSOB / SDIR |
| Portugal | IFRS | CMVM filings |
| Korea | K-IFRS | DART filings |
| Hong Kong | IFRS (HK) | HKEXnews |
| Australia | IFRS | ASX listings |
Development
git clone https://github.com/hugocondesa-debug/cartographer.git
cd cartographer
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/ tests/
See CONTRIBUTING.md for details.
Roadmap
- PyPI release (
pip install cartographer-filings) - Regression test suite fetching from primary sources (SEC, ESEF, HKEXnews)
- Thai, Arabic, additional CJK coverage
- Multi-column prospectus layouts
- Table-of-contents anchor refinement
- Full audit pass on
parser.pyto tighten ruff rules
Origin
Cartographer was extracted from Atlas, a personal financial data platform. It exists as a standalone library because the pipeline has utility beyond any single consumer — anyone processing annual reports or regulatory filings at scale will benefit.
License
MIT. See LICENSE.
Built by Hugo Condesa.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cartographer_filings-0.1.0.tar.gz.
File metadata
- Download URL: cartographer_filings-0.1.0.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b5b5c3eacfb309468b5c7edf762eb3238b1ee994ae3c896e499cb835ef7f96d6
|
|
| MD5 |
7dca5ed1253ad3924cdfc3add633dba9
|
|
| BLAKE2b-256 |
b09cc3b2e1e9b9710342e1062303a20e40abd725c04ef101d24a44da87e2e923
|
File details
Details for the file cartographer_filings-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cartographer_filings-0.1.0-py3-none-any.whl
- Upload date:
- Size: 31.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a851ff8611adf9fd06da306292f7c500b2601f970ed8e735157b57e72a8050d
|
|
| MD5 |
0f06379d5acc325ba563b1ea1996a877
|
|
| BLAKE2b-256 |
52a34cc47ac77d285afac20f82b1c428310244049c0c20e37058e59164ee5e04
|