doc parser using llm driven engine with graph knowledge awareness
Project description
Kogwistar-docparser
If you want the shortest path into the project, start with QUICKSTART.md.
Graph Knowledge Doc Parser
Utilities and experiments for document ingestion, PDF splitting, OCR, and page-level parsing. Refactored from the kogwistar project as a stand alone ingestor.
Status
This repository is still being refactored and should be treated as work in progress.
The document ingestion pipeline is currently being extracted and consolidated into the main kogwistar repository. Until that refactor is complete, this repo should be considered an active staging area for parser and ingestion-related work.
What Is Here
- PDF splitting and image generation helpers in
src/pdf2png.py - Gemini-based OCR and page parsing flows in
src/ocr.py - SQLite-based ingestion telemetry in
src/document_ingester_logger.py - File discovery and filtering helpers in
src/utils/file_loaders.py - Experimental and regression-style tests under
tests/
Workflow Surface
The reusable workflow-ingest code now has three layers:
- Python APIs, which are the primary contract for tests and orchestration
- CLI entrypoints, which are thin wrappers around those APIs
- composable subworkflows for OCR, page-index parsing, and recursive layerwise parsing
The reusable helpers live under src/workflow_ingest/ and are designed so the
same core logic can be called from tests, scripts, and higher-level workflow
code without duplicating orchestration.
CLI Commands
After poetry install, the repo exposes a workflow-ingest command family:
workflow-ingest --help
workflow-ingest ocr --help
workflow-ingest page-index --help
workflow-ingest layerwise --help
workflow-ingest demo --help
workflow-ingest ocr-smoke-assets --help
If you want the local checked-out ./kogwistar subtree to win over the GitHub
dependency during development, run the Bash bootstrap helper after install:
bash ./scripts/bootstrap-dev.sh
That script does two things:
- runs
poetry install - if
./kogwistarexists, installs it editable into the active environment
If you do not run the bootstrap helper, the repo keeps using the GitHub-sourced
kogwistar dependency declared in pyproject.toml.
Typical examples:
workflow-ingest ocr tests\.tmp_workflow_ingest_ocr\generated_smoke_assets\ocr_smoke_document.pdf --output-dir logs\ocr_run
workflow-ingest ocr tests\.tmp_workflow_ingest_ocr\generated_smoke_assets --output-dir logs\ocr_batch
workflow-ingest page-index tests\fixtures\page_index\sample_page_index.txt --output-dir logs\page_index
workflow-ingest layerwise tests\.tmp_workflow_ingest_ocr\manual_cases\ollama\glm-ocr_latest\image\artifacts\legacy_split_pages\ocr-manual-ollama-image --output-dir logs\layerwise
workflow-ingest ocr-smoke-assets --output-dir tests\.tmp_workflow_ingest_ocr\generated_smoke_assets
Reusable Artifacts
The workflow outputs are intentionally inspectable on disk:
workflow-events.jsonl: readable step trail for the outer orchestration layerocr-state.sqlite: authoritative OCR/render resume stateocr-progress.json: human-readable mirror of the current OCR stateocr-summary.json: final OCR run summarylegacy_split_pages/<document>/page_N.json: legacy-compatible OCR page artifactsrendered_pages/<document>/page_N.png: rasterized page imagespage-index-summary.json: page-index run summarylayerwise-summary.json: recursive layerwise parser summarylayerwise-graph.json: legacy recursive layerwise graph payload
Python APIs
If you want to embed the pipelines directly, the main helpers are:
src.workflow_ingest.run_ocr_source_workflow(...)src.workflow_ingest.run_ocr_batch_workflow(...)src.workflow_ingest.parse_page_index_document(...)src.workflow_ingest.run_page_index_source_workflow(...)src.workflow_ingest.run_layerwise_source_workflow(...)src.workflow_ingest.run_demo_harness_workflow(...)
Those helpers are meant to stay stable and are what the CLI layer calls under the hood.
Setup
- Create and activate a Python 3.13 environment.
- Install dependencies with Poetry:
poetry install
- Create a local env file from the example:
Copy-Item .env.example .env
- Add your local
GOOGLE_API_KEYand any optional file-list paths needed for your workflow.
Environment Variables
The project currently expects or optionally uses:
GOOGLE_API_KEY: required for Gemini OCR and LLM-backed parsing flowsLANGSMITH_TRACING: optional LangSmith tracing toggleocr_file_list: optional allow-list file for OCR runssplit_raw_file_list: optional allow-list file for PDF splitting runsanswer_export_list: optional export list path used by local workflows
An example template is provided in .env.example.
Provider Guide
The workflow layer is vendor-neutral, but the concrete OCR, parser, and embedding backends are selected by config.
OCR Provider Examples
- Google GenAI OCR:
KG_DOC_OCR_PROVIDER=geminiKG_DOC_OCR_MODEL=gemini-2.5-flash
- Ollama OCR or vision-capable local model:
KG_DOC_OCR_PROVIDER=ollamaKG_DOC_OCR_MODEL=llava:latestKG_DOC_OCR_BASE_URL=http://127.0.0.1:11434
- Vertex AI OCR:
KG_DOC_OCR_PROVIDER=vertexKG_DOC_OCR_MODEL=gemini-2.5-proKG_DOC_OCR_PROJECT=my-projectKG_DOC_OCR_LOCATION=us-central1
Parser / LLM Provider Examples
The parser provider is the chat model used for semantic parsing, layer review, and structured extraction.
- LangChain Google GenAI:
KG_DOC_PARSER_PROVIDER=geminiKG_DOC_PARSER_MODEL=gemini-2.5-flash
- ChatGPT / OpenAI REST:
KG_DOC_PARSER_PROVIDER=openaiKG_DOC_PARSER_MODEL=gpt-4.1-miniKG_DOC_PARSER_API_KEY_ENV=OPENAI_API_KEY
- LangChain Ollama:
KG_DOC_PARSER_PROVIDER=ollamaKG_DOC_PARSER_MODEL=llama3.1KG_DOC_PARSER_BASE_URL=http://127.0.0.1:11434
- LangChain Vertex AI:
KG_DOC_PARSER_PROVIDER=vertexKG_DOC_PARSER_MODEL=gemini-2.5-proKG_DOC_PARSER_PROJECT=my-projectKG_DOC_PARSER_LOCATION=us-central1
Recipe Parsing Example
If you are parsing a cooking recipe, one practical split is:
- OCR on Gemini or another vision model
- parser on OpenAI, Ollama, or Vertex AI
For example:
KG_DOC_OCR_PROVIDER=gemini
KG_DOC_OCR_MODEL=gemini-2.5-flash
KG_DOC_PARSER_PROVIDER=openai
KG_DOC_PARSER_MODEL=gpt-4.1-mini
KG_DOC_PARSER_API_KEY_ENV=OPENAI_API_KEY
KG_DOC_EMBED_PROVIDER=fake
That setup can extract a recipe into structured graph data such as:
- ingredients
- steps
- tools
- timers
- inferred sections like
prep,cook, andserve
Embedding Examples
- Fake deterministic CI embedding:
KG_DOC_EMBED_PROVIDER=fake
- OpenAI embeddings:
KG_DOC_EMBED_PROVIDER=openaiKG_DOC_EMBED_MODEL=text-embedding-3-small
- Vertex AI embeddings:
KG_DOC_EMBED_PROVIDER=vertexKG_DOC_EMBED_MODEL=text-embedding-004
- Ollama embeddings:
KG_DOC_EMBED_PROVIDER=ollamaKG_DOC_EMBED_MODEL=nomic-embed-text
Note:
embedding_spacein the workflow ingest models is currently a metadata and routing-intent label.- It does not yet imply that the engine is using a separate embedder per space.
- The current engine bootstrap still wires one embedding function per engine instance, while the multi-space routing proposal remains a future Kogwistar core concern.
Running Tests
Some tests are integration-style and expect local document folders and API credentials to exist. That means not every test is portable in a clean checkout.
To run the test suite:
pytest -q
If you only want to work on isolated units, review the test files first and run a narrower subset.
Demo Harness
There is now a manual workflow-ingest demo harness that can run the end-to-end flow against:
- an in-process isolated FastAPI server
- a subprocess-hosted local server
- an already running external Kogwistar server
The legacy semantic-smoke test in
tests/test_semantic_layerwise_doc_parsing.py
also expects a live Kogwistar server at http://127.0.0.1:28110. It does not
start that server for you, so use the VS Code server launch config or start it
manually before running the Ollama case.
Example:
.venv\Scripts\python.exe scripts\run_workflow_ingest_demo.py --output-dir logs\workflow_ingest_demo
External live server example:
.venv\Scripts\python.exe scripts\run_workflow_ingest_demo.py --server-mode external_http --external-base-url http://127.0.0.1:28110
Demo artifacts are written into the chosen output directory:
probe-events.jsonl: demo-friendly step and lifecycle probe eventsdemo-summary.json: run summary, persistence result, and artifact pointersllm-cache/: workflow-native cached proposal/review call resultsengines/: local workflow and conversation graph storage for the runserver-data/: isolated server-side persistence directory when the harness boots its own server
Notes:
- The workflow-native layer proposal/review path uses deterministic file-backed caching to reduce repeated token cost and compute time.
- The legacy parser path also supports a redirected
joblibcache viaKG_DOC_PARSER_JOBLIB_CACHE_DIR. - Probe logging is separate from CDC and conversation graph traces, so demos can show a short readable event trail without digging into runtime internals.
OCR And Parsing Workflows
The newer workflow-first paths are designed as reusable subworkflows:
- OCR image/PDF ingest
- resumable via
ocr-state.sqlite - emits
workflow-events.jsonl - keeps legacy OCR page artifacts on disk
- resumable via
- page-index parsing
- heuristic mode for deterministic structure extraction
- Ollama mode for local parser-backed parsing
- recursive layerwise parsing
- wraps the legacy recursive parser in a reusable workflow runner
These can be invoked from Python directly or through the workflow-ingest
CLI family, depending on whether you want reusable orchestration or a quick
shell command.
Notes
README.md, env handling, and ingestion boundaries are still being cleaned up as part of the ongoing refactor.- Runtime outputs such as
logs/, local.env, caches, and generated artifacts should remain uncommitted. - If behavior diverges between this repo and
kogwistar, prefer the direction of the ongoing migration and refactor work.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file graph_knowledge_doc_parser-0.1.0.tar.gz.
File metadata
- Download URL: graph_knowledge_doc_parser-0.1.0.tar.gz
- Upload date:
- Size: 140.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14e993415113bb99d4047c6e2889946d3746112f5f3bbabb48c5a2ebdd28a42d
|
|
| MD5 |
e9a7fb66e66c5b69c812a52bb50e510b
|
|
| BLAKE2b-256 |
6ffe56b841669b6964abd83cf8c1c4c4a200c6b1a33fc3feb4e1dd7f5a90682f
|
File details
Details for the file graph_knowledge_doc_parser-0.1.0-py3-none-any.whl.
File metadata
- Download URL: graph_knowledge_doc_parser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 159.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84a47687e1169dc2a22e4c3ffa4cb6409d3c9580a648d9bcde3450f2f0a8303a
|
|
| MD5 |
dd8ac03dc261c3c74a9fe00f06fffe9b
|
|
| BLAKE2b-256 |
f82cd58b996593872de07c8f345765018255de0edf1e53aecf9b61bac8f50362
|