DISK (Domain Incremental conStruction of Knowledge graph) - A tool for distilling text from documents, extracting entities and relations, and building domain knowledge graphs
Project description
DISK
Domain Incremental conStruction of Knowledge graph.
Overview
DISK is a comprehensive toolkit for extracting knowledge from PDF documents and building domain knowledge graphs through text distillation, entity/relation extraction, and semantic merging. The system provides a modular pipeline that transforms unstructured PDF documents into structured knowledge representations.
Core Capabilities
- Document Distillation: Extract and validate text blocks, tables, and images from PDF documents
- Entity Extraction: Identify and extract domain entities with semantic embeddings
- Relation Extraction: Discover relationships between entities with contextual understanding
- Knowledge Graph Construction: Build and manage knowledge graphs with incremental updates
- Semantic Merging: Intelligently merge similar entities and relations using cosine similarity
Architecture
System Architecture
graph TB
subgraph "Input Layer"
PDF[PDF Document]
end
subgraph "Distillation Layer"
Distiller[PDF Distiller]
TextBlocks[Validated Text Blocks]
end
subgraph "Extraction Layer"
EntExtractor[Entity Extractor]
RelExtractor[Relation Extractor]
UnifiedExtractor[Unified Extractor]
Entities[(Entities + Embeddings)]
Relations[(Relations + Embeddings)]
end
subgraph "Processing Layer"
Merger[Semantic Merger]
Manager[KG Manager]
end
subgraph "Output Layer"
KG[Knowledge Graph]
Logs[Logs & Results]
end
subgraph "Configuration"
Config[LLM Config]
Embed[Embeddings Model]
end
PDF --> Distiller
Distiller --> TextBlocks
TextBlocks --> EntExtractor
TextBlocks --> RelExtractor
TextBlocks --> UnifiedExtractor
EntExtractor --> Entities
RelExtractor --> Relations
UnifiedExtractor --> Entities
UnifiedExtractor --> Relations
Entities --> Merger
Relations --> Merger
Merger --> Manager
Manager --> KG
Manager --> Logs
Config --> EntExtractor
Config --> RelExtractor
Config --> UnifiedExtractor
Embed --> EntExtractor
Embed --> RelExtractor
Embed --> UnifiedExtractor
Embed --> Merger
style PDF fill:#e1f5fe
style KG fill:#c8e6c9
style Distiller fill:#fff3e0
style Merger fill:#f3e5f5
style Manager fill:#e8f5e9
Module Structure
graph LR
subgraph DISK
DiskMain[disk.py<br/>Main Entry Point]
subgraph Core
Distiller[distiller/<br/>PDF Distillation]
Extractor[extractor/<br/>Information Extraction]
MergerMod[merger/<br/>Knowledge Merging]
ManagerMod[manager/<br/>KG Management]
end
subgraph Support
Models[models/<br/>Data Models]
Utils[utils/<br/>Utilities]
ConfigMod[config/<br/>Configuration]
end
end
DiskMain --> Distiller
DiskMain --> Extractor
DiskMain --> MergerMod
DiskMain --> ManagerMod
Extractor --> Models
MergerMod --> Models
ManagerMod --> Models
Distiller --> Utils
Extractor --> Utils
ManagerMod --> Utils
DiskMain --> ConfigMod
style DiskMain fill:#1976d2,color:#fff
style Distiller fill:#ffa726
style Extractor fill:#42a5f5
style MergerMod fill:#ab47bc
style ManagerMod fill:#66bb6a
Data Flow
sequenceDiagram
participant User
participant DISK
participant Distiller
participant Extractor
participant Merger
participant Manager
participant KG
User->>DISK: build_knowledge_graph(pdf_path)
DISK->>Distiller: extract_text_blocks(pdf)
Distiller-->>DISK: validated_text_blocks
loop For each text block
DISK->>Extractor: extract_entities(text)
Extractor-->>DISK: entities + embeddings
DISK->>Extractor: extract_relations(text)
Extractor-->>DISK: relations + embeddings
DISK->>Merger: merge(new, existing)
Merger-->>DISK: merged entities/relations
end
DISK->>Manager: add_entities(entities)
DISK->>Manager: add_relations(relations)
Manager->>KG: update_knowledge_graph
DISK-->>User: Knowledge Graph
Modules
Distillation Module (distiller/)
- pdf_distiller
- extract paragraphs with intelligent validation
- extract tables(to be improved)
- extract imgs(to be improved)
- filter out low-quality text blocks (references, incomplete sentences)
Extraction Module (extractor/)
-
entities_extractor
- extract domain entities with labels and descriptions
- generate semantic embeddings for each entity
-
relations_extractor
- extract relationships between entities
- generate semantic embeddings for each relation
-
extractor (unified)
- extract both entities and relations in a single pass
- optimized for incremental processing
Processing Modules
- extract entities
- extract relationships
- semantic merging (merger/)
- merge similar entities using cosine similarity
- update relations after entity merging
- configurable threshold (default: 0.8)
- construct knowledge graph (manager/)
- incremental knowledge graph construction
- deduplication of entities and relations
Config
env
# use uv to manage the environment
uv venv
uv sync
LLM Configuration
- Copy the example configuration file:
cp config.example.toml config.toml
- Edit
config.tomlto set your API keys and preferences:
[disk]
llm = "openai" # Choose provider: openai, qwen, ollama, etc.
[disk.embeddings]
model = "text-embedding-3-small"
api_key = "ai-..."
api_url = "https://api.openai.com/v1"
[model.openai]
api_url = "https://api.openai.com/v1"
api_key = "ai-..."
model = "gpt-4o"
[model.other]
api_url = "https://api.otherprovider.com/v1"
api_key = "sk-..."
model = "gpt-4o"
- Supported providers:
- OpenAI (default)
- Qwen (DashScope)
- Kimi (Moonshot)
- Ollama (Local)
- All other providers that support OpenAI-compatible APIs
You can switch providers by changing the llm field in [disk] or using the runtime switch() function.
Contrast
merge
- itext2kg
[INFO] Wohoo! Entity was matched --- [poor deep semantic understanding in traditional ie models:Limitation] --merged--> [cosine similarity ignores deep semantic differences:Limitation]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file disk_kg-1.0.0.tar.gz.
File metadata
- Download URL: disk_kg-1.0.0.tar.gz
- Upload date:
- Size: 1.9 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.11 {"installer":{"name":"uv","version":"0.9.11"},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c1e41a6ae91d373abe3b94139a3b90c194fad91ab13915b01bee2fb65f9136a
|
|
| MD5 |
9501ca1c9b195ea5bcf1149ebcbbac42
|
|
| BLAKE2b-256 |
006c4581726c1b95bcf5cf4a7c041ef20983c7adf102b4ddb4cc129f88cf1a9f
|
File details
Details for the file disk_kg-1.0.0-py3-none-any.whl.
File metadata
- Download URL: disk_kg-1.0.0-py3-none-any.whl
- Upload date:
- Size: 47.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.11 {"installer":{"name":"uv","version":"0.9.11"},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43d1dd8e66e7d7548b2a31ca6e7dcb22efecd5aa717135b33a0471a7263bf33d
|
|
| MD5 |
7631a6380d1137dff121227f104635ea
|
|
| BLAKE2b-256 |
534d878871d6a40a49f85017685084a33e91d2d4fcedc2441d9244bf5347a318
|