Skip to main content

High-fidelity PDF-to-markdown pipeline for financial filings. Multilingual, jurisdiction-aware.

Reason this release was yanked:

Project deprecated and removed

Project description

Cartographer

tests Python 3.10+ License: MIT Code style: ruff

High-fidelity PDF-to-markdown pipeline for financial filings. Multilingual, jurisdiction-aware, deterministic.

Cartographer takes annual reports, quarterly filings, and regulatory documents — and produces structured markdown plus classified sections (Income Statement, Balance Sheet, Cash Flow, MD&A, Risk Factors, numbered Notes) that downstream pipelines can consume directly. It works across 9 languages and 10+ jurisdictions, from SEC 10-Ks to ESEF annual reports to HKEX filings.

Why it exists

Commercial PDF-to-markdown services for financial documents are expensive, opaque, and often miss the structural cues that matter most: note numbering hierarchies, table reconstruction from broken cells, multi-column layouts, language-specific section headers. Cartographer is deterministic — regex and structural rules do the cleanup, LLMs are only used as an optional fallback for image-heavy pages. Every transformation is auditable. In internal benchmarks across 6 jurisdictions, it outperformed a commercial baseline on deterministic structural metrics.

Quickstart

pip install cartographer-filings
from cartographer import Pipeline

pipe = Pipeline()
result = pipe.extract("annual_report.pdf")

print(result["metadata"])
# {"company": "Siemens AG", "currency": "EUR", "fiscal_year": 2025, "standard": "IFRS", ...}

print(f"{len(result['notes'])} notes across {result['stats']['sections_found']} classified sections")
print(result["sections"]["income_statement"][:500])

Pipeline

flowchart LR
    PDF[PDF file] --> RAW[pymupdf4llm<br/>raw markdown]
    RAW --> ENH[enhance<br/>tables • headings • noise]
    ENH --> CLEAN[clean markdown]
    CLEAN --> PARSE[parser<br/>sections + notes]
    PARSE --> OUT[structured dict]
    ENH -.optional.-> VIS[Qwen-VL<br/>image-heavy pages]
    VIS -.-> CLEAN

Three deterministic stages plus one optional LLM fallback:

  • enhance — reconstructs tables from <br>-stacked cells, promotes bold runs to headings, strips repeated page headers/footers, normalises negative number formats ((1,234)-1,234).
  • parser — classifies sections (IS, BS, CF, MD&A, Risk, Audit), detects numbered notes with V5 type mapping, extracts metadata (company, currency, fiscal year, reporting standard) across 9 languages.
  • vision — only triggered on image-heavy pages where text extraction yields nothing useful. Uses Qwen-VL via SiliconFlow. Opt-in via API key.

Output schema

{
    "markdown": str,                  # Enhanced markdown, full document
    "metadata": {
        "company": str | None,
        "currency": str | None,       # ISO 4217 code
        "fiscal_year": int | None,
        "period_end": str | None,     # ISO date
        "standard": str | None,       # IFRS, US-GAAP, K-IFRS, ...
        "language": str | None,
    },
    "sections": {
        "income_statement": str | None,
        "balance_sheet": str | None,
        "cash_flow": str | None,
        "mda": str | None,
        "risk": str | None,
        "audit": str | None,
    },
    "notes": [
        {
            "note_number": int,
            "title": str,
            "content": str,
            "v5_type": str | None,    # Classified note type (e.g. U03, Related Parties)
            "start_char": int,
            "end_char": int,
        },
        ...
    ],
    "stats": {
        "pages": int,
        "raw_chars": int,
        "enhanced_chars": int,
        "sections_found": int,
        "notes_count": int,
        "vision_pages": int,
        "time_seconds": float,
    },
}

Vision fallback

Image-heavy pages (scanned filings, heavy-graphic annual reports) are only handled when you provide a SiliconFlow API key:

from cartographer import Pipeline

pipe = Pipeline(vision_api_key="sk-...")  # or SILICONFLOW_API_KEY env var
result = pipe.extract("scanned_report.pdf")

Vision is a fallback, not a default path. The overwhelming majority of financial filings parse correctly through the deterministic pipeline alone.

Supported languages

English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Chinese (partial).

Jurisdictions tested

Region Standard Examples
United States US-GAAP SEC 10-K, 10-Q
Germany IFRS ESEF filings, Bundesanzeiger
Italy IFRS CONSOB / SDIR
Portugal IFRS CMVM filings
Korea K-IFRS DART filings
Hong Kong IFRS (HK) HKEXnews
Australia IFRS ASX listings

Development

git clone https://github.com/hugocondesa-debug/cartographer.git
cd cartographer
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/ tests/

See CONTRIBUTING.md for details.

Roadmap

  • PyPI release (pip install cartographer-filings)
  • Regression test suite fetching from primary sources (SEC, ESEF, HKEXnews)
  • Thai, Arabic, additional CJK coverage
  • Multi-column prospectus layouts
  • Table-of-contents anchor refinement
  • Full audit pass on parser.py to tighten ruff rules

Origin

Cartographer was extracted from Atlas, a personal financial data platform. It exists as a standalone library because the pipeline has utility beyond any single consumer — anyone processing annual reports or regulatory filings at scale will benefit.

License

MIT. See LICENSE.


Built by Hugo Condesa.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cartographer_filings-0.1.1.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cartographer_filings-0.1.1-py3-none-any.whl (32.0 kB view details)

Uploaded Python 3

File details

Details for the file cartographer_filings-0.1.1.tar.gz.

File metadata

  • Download URL: cartographer_filings-0.1.1.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cartographer_filings-0.1.1.tar.gz
Algorithm Hash digest
SHA256 1b63d82fb124d0287863b3c6f702d1affa7b34c36a7b0a7de72b0a96b9b4424e
MD5 268deb64c81635afbde877ce73ad155c
BLAKE2b-256 b0624e5f1e74f8ee10c6636d1901cf184118a6cab276b8634463af1167e4c63b

See more details on using hashes here.

File details

Details for the file cartographer_filings-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for cartographer_filings-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3b6ca5e2002584079a42ad8c3279e3d7d0267a986b767f6bcec38eaf6f91cca8
MD5 bf2ae3e08b136f8e7134ceb8eb5c441d
BLAKE2b-256 46df45e4c158026daf2c919136b3dac283337ecb2a8a1d220a063cb56d53ea88

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page