LLM-powered CIDOC CRM v7.1.3 entity extraction from unstructured text — Pydantic models, Cypher emitters, and NetworkX integration

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

David.Spencer

These details have not been verified by PyPI

Project description

infoextract-cidoc (COLLIE)

Classful Ontology for Life-Events Information Extraction

COLLIE logo

A developer-friendly toolkit for working with the CIDOC CRM v7.1.3 in modern data workflows. infoextract-cidoc provides complete Pydantic models (99 classes, 322 properties), LangStruct-powered AI extraction, Markdown renderers, and Cypher emitters that bridge the gap between conceptual rigor and developer usability.

Why infoextract-cidoc?

Cultural heritage and information extraction projects often need a CRM-compliant backbone without the overhead of RDF stacks. infoextract-cidoc:

Keeps the conceptual rigor of CIDOC CRM
Provides lean, open-world Pydantic validation
Outputs formats directly usable by LLMs (Markdown) and LPGs (Cypher)
Prioritizes ergonomics and performance for real-world extraction pipelines
Zero RDF/OWL/JSON-LD dependencies

Quick Start

For a comprehensive getting started guide, see QUICKSTART.md

Installation

pip install infoextract-cidoc

# With GraphForge graph database integration:
pip install infoextract-cidoc[graphforge]

# Using uv:
uv add infoextract-cidoc

AI Extraction (New Pipeline)

from infoextract_cidoc.extraction import LangStructExtractor, resolve_extraction, map_to_crm_entities
from infoextract_cidoc.io.to_markdown import to_markdown, MarkdownStyle

# Extract from text (set GOOGLE_API_KEY or LANGSTRUCT_DEFAULT_MODEL env var)
extractor = LangStructExtractor()
lite_result = extractor.extract(
    "Albert Einstein was born on March 14, 1879, in Ulm, Germany. "
    "He won the Nobel Prize in Physics in 1921."
)

# Resolve to stable UUIDs
extraction_result = resolve_extraction(lite_result)

# Map to CIDOC CRM entities
crm_entities, crm_relations = map_to_crm_entities(extraction_result)

# Render to Markdown
for entity in crm_entities:
    print(to_markdown(entity, MarkdownStyle.CARD))

CRM Models (No AI Needed)

from infoextract_cidoc.models.generated.e_classes import EE22_HumanMadeObject
from infoextract_cidoc.io.to_markdown import to_markdown, MarkdownStyle
from infoextract_cidoc.io.to_cypher import generate_cypher_script

# Create a CRM entity (string IDs are automatically converted to UUIDs)
vase = EE22_HumanMadeObject(
    id="obj_001",
    label="Ancient Greek Vase",
    type=["E55:Vessel", "E55:Ceramic"]
)

# Render as Markdown
markdown = to_markdown(vase, MarkdownStyle.CARD)

# Generate Cypher for Neo4j/Memgraph
cypher = generate_cypher_script([vase])

CLI

# Extract entities from text
infoextract-cidoc extract --text "Marie Curie was born in Warsaw in 1867."

# Extract from file
infoextract-cidoc extract --file biography.txt --output ./output/

# Run complete workflow
infoextract-cidoc workflow --file biography.txt --all --output results/

# Run Einstein demo
infoextract-cidoc demo --einstein

Core Features

AI-Powered Information Extraction

LangStruct pipeline for single-pass entity and relationship extraction
Entity Resolution with stable UUID5 identifiers and deduplication
Relationship Resolution with broken link detection and logging
CRM Mapping to E21 Person, E5 Event, E53 Place, E22 Object, E52 Time-Span
DSPy optimization support for fine-tuning extraction quality
Works with any LiteLLM-compatible model (Gemini, OpenAI, Anthropic, etc.)

Pydantic Models

Complete CIDOC CRM v7.1.3 coverage (99 E-classes, 322 P-properties)
Flexible UUID handling with automatic string-to-UUID conversion
Canonical JSON schema with stable IDs and explicit cross-references
Auto-generated from curated YAML specifications

Class Naming Convention

Official CIDOC CRM: E1, E22, E96 (class codes)
Python Classes: EE1_CRMEntity, EE22_HumanMadeObject, EE96_Purchase
Pattern: E{code}_{label_without_spaces}

Markdown Renderers

Entity Cards: Concise summaries optimized for LLM prompts
Detailed Narratives: Rich descriptions with full context
Tabular Summaries: Structured data presentation

NetworkX Integration

Direct conversion from CRM entities to NetworkX graphs
Built-in social network analysis (centrality, communities)
Temporal network analysis for historical data

Output Formats

Markdown (4 styles): entity cards, detailed, tabular, narrative
Cypher: idempotent MERGE/UNWIND scripts for Neo4j/Memgraph
NetworkX: graph objects for programmatic analysis
GraphForge (optional): pip install infoextract-cidoc[graphforge]

Validation Framework

Cardinality enforcement (configurable from warnings to strict)
Type alignment validation
Extensible validation profiles

Complete Workflow

from infoextract_cidoc.extraction import LangStructExtractor, resolve_extraction, map_to_crm_entities
from infoextract_cidoc.io.to_networkx import to_networkx_graph
from infoextract_cidoc.io.to_cypher import generate_cypher_script
from infoextract_cidoc.visualization import plot_network_graph

# 1. Extract entities via LangStruct
extractor = LangStructExtractor()
lite_result = await extractor.extract_async("""
Albert Einstein was born on March 14, 1879, in Ulm, Germany.
He developed the theory of relativity and won the Nobel Prize in 1921.
""")

# 2. Resolve and map to CRM
extraction_result = resolve_extraction(lite_result)
crm_entities, crm_relations = map_to_crm_entities(extraction_result)

# 3. Serialize as canonical JSON
json_data = [entity.model_dump(mode='json') for entity in crm_entities]

# 4. Convert to NetworkX graph for social network analysis
graph = to_networkx_graph(crm_entities)

# 5. Visualize the network
plot_network_graph(graph, title="Einstein's Life Network")

# 6. Export to Cypher for graph database persistence
cypher_script = generate_cypher_script(crm_entities)

Project Structure

src/infoextract_cidoc/
├── extraction/           # AI extraction pipeline
│   ├── lite_schema.py   # LangStruct output schema
│   ├── resolution.py    # Entity/relationship resolution
│   ├── crm_mapper.py    # CRM mapping layer
│   └── langstruct_extractor.py  # LangStructExtractor
├── models/               # Pydantic CRM models
│   ├── base.py          # CRMEntity, CRMRelation
│   └── generated/       # Auto-generated E-classes (99)
├── io/                   # Output modules
│   ├── to_markdown.py   # Markdown renderers
│   ├── to_cypher.py     # Cypher emitters
│   ├── to_networkx/     # NetworkX conversion
│   └── to_graphforge.py # GraphForge (optional)
├── validators/           # Validation framework
├── visualization/        # matplotlib/plotly plots
├── codegen/              # YAML -> Pydantic generation
└── tests/                # Test suite (77 tests)

Testing

make test           # Run all tests
make test-unit      # Unit tests only
make coverage       # With coverage report
make pre-push       # Full CI: lint + type-check + security + coverage

Documentation

Quickstart - Getting started guide
Contributing - Development workflow
Changelog - Version history
HOWTOs - Comprehensive modeling guide
CIDOC CRM Standard - Official specification

Project Status

Phase 1: Complete - Core CIDOC CRM implementation
Phase 2: Complete - LangStruct extraction pipeline, validation, full CRM coverage
Phase 3: Planned - Profile packs and additional analysis tools

Current Coverage: 99 E-classes, 322 P-properties (complete CRM 7.1.3) Test Status: 77 tests passing (100% success rate) CI/CD: GitHub Actions with uv, ruff, mypy, bandit, codecov

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

CIDOC CRM Working Group for the foundational ontology
Pydantic team for the excellent validation framework
Neo4j community for Cypher language inspiration

Made with care for the cultural heritage and information extraction community

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

David.Spencer

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

Mar 1, 2026

0.1.6

Mar 1, 2026

This version

0.1.5

Mar 1, 2026

0.1.4

Feb 27, 2026

0.1.3

Feb 26, 2026

0.1.2

Feb 26, 2026

0.1.1

Feb 26, 2026

0.1.0

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

infoextract_cidoc-0.1.5.tar.gz (6.5 MB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

infoextract_cidoc-0.1.5-py3-none-any.whl (6.3 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file infoextract_cidoc-0.1.5.tar.gz.

File metadata

Download URL: infoextract_cidoc-0.1.5.tar.gz
Upload date: Mar 1, 2026
Size: 6.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for infoextract_cidoc-0.1.5.tar.gz
Algorithm	Hash digest
SHA256	`7b6677bc4e494aa6d71fc54e21bcf8b571a862358867bb1979769f47a4a93318`
MD5	`beb2be82d5b1e532c40259a91e5a2fba`
BLAKE2b-256	`d466024756217c772c26eacda9e2b2bb20bf43f5c5e5597b3e96c2c91715f6ea`

See more details on using hashes here.

File details

Details for the file infoextract_cidoc-0.1.5-py3-none-any.whl.

File metadata

Download URL: infoextract_cidoc-0.1.5-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 6.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.10.7 {"installer":{"name":"uv","version":"0.10.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for infoextract_cidoc-0.1.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0aa7cdd46ed3ec5e1987dadea29a73fe3082ae523e1c7530e72fd18e35cff8ab`
MD5	`c482f307da3537019279fd175a83c51e`
BLAKE2b-256	`764808d7004a4ccbfc21b66a39537b24e3b9cac09a5337cfcec7476a7f52d380`

See more details on using hashes here.

infoextract-cidoc 0.1.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

infoextract-cidoc (COLLIE)

Why infoextract-cidoc?

Quick Start

Installation

AI Extraction (New Pipeline)

CRM Models (No AI Needed)

CLI

Core Features

AI-Powered Information Extraction

Pydantic Models

Class Naming Convention

Markdown Renderers

NetworkX Integration

Output Formats

Validation Framework

Complete Workflow

Project Structure

Testing

Documentation

Project Status

License

Acknowledgments

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes