Open-domain text-to-graph extractor with entities, relations, schema inference, and Neo4j export.
Project description
rapidGraph
rapidGraph is a local-first, open-domain text-to-graph extractor for arbitrary text. It reads inline text or one or more .txt files and produces a structured graph-oriented JSON payload with:
entitiesrelationspotential_schemaexpanded_schemadocumentschunksrelation_supportmeta
The package is designed for:
- entity and relation extraction across general, technical, scientific, and mixed-domain text
- CPU-friendly local execution
- provenance-aware graph construction
- future GraphRAG / RAG workflows
- optional Neo4j ingestion
- optional Neo4j vector GraphRAG question answering with Ollama
The public package name is rapidGraph, the import package is rapidgraph, and the installed CLI command is rapidgraph.
Table of Contents
- What rapidGraph Does
- Key Capabilities
- Installation
- Quick Start
- How the Pipeline Works
- Execution Modes
- Input Model
- Output Model
- CLI Reference
- Recommended Flag Combinations
- Neo4j Export Model
- GraphRAG Ask Mode
- Python Library Usage
- Performance and Practical Notes
- Troubleshooting
- Development
- Publishing
- License
What rapidGraph Does
At a high level, rapidGraph takes arbitrary text and turns it into a graph-friendly representation.
The pipeline:
- normalizes input text
- splits text into chunked spans
- extracts entity candidates
- extracts relation candidates
- canonicalizes duplicate or near-duplicate mentions
- links relation endpoints to canonical entities
- infers schema patterns from the accepted graph edges
- stores document and chunk provenance so every entity mention and relation can be traced back to source text
The extractor is open-domain and best-effort. It does not rely on a fixed business-only ontology. If typing confidence is weak, it keeps entities as Unknown rather than discarding them.
Key Capabilities
- Open-domain entity extraction
- Open-domain relation extraction
- Schema inference from extracted graph edges
- Provenance-aware output using
documents,chunks, andrelation_support - Multi-file corpus ingestion in one run
- Two entity canonicalization scopes:
documentcorpus
- Three execution modes:
fastbalancedquality
- Optional embedding-assisted entity merging and relation endpoint linking
- Optional Neo4j export
- Optional GraphRAG query layer over Neo4j vector indexes and Ollama
- Backward-compatible
potential_schemaplus richerexpanded_schema
Installation
Install from PyPI
pip install rapidGraph
Install with optional extras
Neo4j support:
pip install "rapidGraph[neo4j]"
Embedding-assisted linking:
pip install "rapidGraph[embeddings]"
GraphRAG query support:
pip install "rapidGraph[graphrag]"
Development tooling:
pip install "rapidGraph[dev]"
Everything:
pip install "rapidGraph[neo4j,embeddings,graphrag,dev]"
Install from source
pip install .
Or with extras:
pip install ".[neo4j,embeddings,graphrag,dev]"
Quick Start
Show CLI help:
rapidgraph --help
Extract from inline text:
rapidgraph --text "Google is based in California." --pretty
Extract from a file:
rapidgraph --input input.txt --pretty
Extract from multiple files:
rapidgraph --input input.txt input2.txt --pretty
Write JSON to a file:
rapidgraph --input input.txt --output graph.json --pretty
Export to Neo4j with chunk embeddings and a vector index for GraphRAG:
rapidgraph \
--input input.txt \
--neo4j-uri neo4j://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password 12345678 \
--neo4j-embed-chunks \
--neo4j-create-vector-index
Ask the graph using Ollama:
rapidgraph ask \
--question "What does the text say about attention?" \
--neo4j-uri neo4j://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password 12345678 \
--ollama-model llama3.2 \
--pretty
The repo-root compatibility shim also works:
python extract_graph.py --input input.txt --pretty
How the Pipeline Works
1. Text normalization
The input is normalized for whitespace and line ending consistency before extraction begins.
2. Chunking
The extractor splits text into chunks before model inference. Chunking exists because relation and entity models work better on bounded spans than on arbitrarily long documents.
Two chunking strategies are available:
paragraph- default
- respects paragraph and block boundaries first
- better for preserving local structure
sentence- simpler sentence packing
- useful for experimentation or tighter chunk control
Optional overlap preserves context across chunk boundaries.
3. Entity extraction
Entities are primarily extracted with GLiNER, with heuristic fallback and supplemental heuristics used where useful.
4. Relation extraction
Relations come from a combination of:
- heuristic relation extraction
- context-based relation patterns
- optional REBEL relation extraction
The amount of REBEL usage depends on --mode.
5. Canonicalization
Mentions are merged into canonical entities using:
- normalized string matching
- fuzzy matching
- optional embedding-assisted rescue for borderline cases
6. Relation linking
Relation endpoints are linked back to canonical entity IDs using:
- exact and local mention-aware matching first
- fuzzy matching second
- optional embedding-assisted rescue last
7. Schema generation
Two schema views are produced:
potential_schema- strict compatibility view
- grouped by
(source_type, relation, target_type)
expanded_schema- richer view using more refined type groupings
- keeps more semantic detail
8. Provenance capture
Each mention and accepted relation can be traced to:
- a
document - one or more
chunks - representative evidence text
This is what makes the model usable later for retrieval or graph-backed answer generation.
Execution Modes
rapidGraph supports three runtime modes.
fast
Best for:
- CPU-only quick passes
- rapid iteration
- rough graph drafts
Behavior:
- uses GLiNER plus heuristics
- does not run REBEL
- lowest startup cost
- lowest relation recall of the three modes
Example:
rapidgraph --input input.txt --mode fast --pretty
balanced
This is the default mode.
Best for:
- most local CPU runs
- practical relation quality without paying the full REBEL cost
Behavior:
- runs heuristic relations everywhere
- runs REBEL only on shortlisted high-value spans
- usually the best speed/quality tradeoff
Example:
rapidgraph --input input.txt --mode balanced --pretty
quality
Best for:
- slower offline analysis
- smaller corpora
- maximum relation recall
Behavior:
- runs REBEL across all chunks
- highest model cost
- typically the slowest mode
Example:
rapidgraph --input input.txt --mode quality --pretty
Input Model
The CLI accepts exactly one of:
--text "..."for inline text--input file1.txt [file2.txt ...]for file input
--text and --input are mutually exclusive.
Multi-file ingestion produces a single combined JSON result with multiple documents and chunks.
Output Model
The extractor returns a single JSON object.
entities
Each entity contains:
idtextcanonicaltypeconfidencementions
Each mention contains:
textstartendchunk_indexdocument_idchunk_id
relations
Each relation contains:
source_idtarget_idrelationconfidenceevidencechunk_idsdocument_ids
potential_schema
Strict schema aggregation. This preserves backward compatibility and groups edges by:
source_typerelationtarget_type
expanded_schema
Richer schema aggregation that retains more type detail and gives a broader schema view than potential_schema.
documents
One row per input document:
idsourcetitletext_hashchar_count
chunks
One row per extraction chunk:
iddocument_idindextextstartendblock_indexoverlap_sentences
relation_support
One row per final accepted relation edge with merged provenance:
source_idrelationtarget_idchunk_idsdocument_idsevidence
meta
Contains execution metadata such as:
- model names
- thresholds
- chunk count
- elapsed time
- mode
- relation backend strategy
- REBEL usage counts
- embedding usage counts
- warning list
- fallback indicator
Example output shape
{
"entities": [
{
"id": "E1",
"text": "Google",
"canonical": "Google",
"type": "Organization",
"confidence": 0.91,
"mentions": [
{
"text": "Google",
"start": 0,
"end": 6,
"chunk_index": 0,
"document_id": "D1",
"chunk_id": "D1:C0"
}
]
}
],
"relations": [
{
"source_id": "E1",
"target_id": "E2",
"relation": "IS_BASED_IN",
"confidence": 0.78,
"evidence": "Google is based in California.",
"chunk_ids": ["D1:C0"],
"document_ids": ["D1"]
}
],
"potential_schema": [],
"expanded_schema": [],
"documents": [],
"chunks": [],
"relation_support": [],
"meta": {}
}
CLI Reference
Input and output
--text TEXT
Inline text to process.
Example:
rapidgraph --text "Transformer uses attention." --pretty
--input INPUT [INPUT ...]
One or more UTF-8 text files.
Examples:
rapidgraph --input input.txt
rapidgraph --input input.txt input2.txt
--output OUTPUT
Write JSON to a file instead of stdout.
Example:
rapidgraph --input input.txt --output graph.json --pretty
--pretty
Pretty-print JSON output.
Thresholds and chunking
--entity-threshold
Default: 0.35
Minimum confidence for keeping entity candidates.
--relation-threshold
Default: 0.2
Minimum confidence for keeping relation candidates.
--max-chars
Default: 600
Approximate chunk size budget.
Higher values:
- preserve more context
- may improve some relations
- increase compute cost
--chunk-mode {paragraph,sentence}
Default: paragraph
Controls chunk construction.
--chunk-overlap
Default: 1
Number of overlapping sentences preserved between neighboring chunks.
Example:
rapidgraph --input input.txt --chunk-overlap 2
Runtime mode and relation strategy
--mode {fast,balanced,quality}
Default: balanced
Controls the speed/quality tradeoff.
--max-model-spans
Default: 4
Balanced-mode only. Caps how many shortlisted spans go through REBEL.
Example:
rapidgraph --input input.txt --mode balanced --max-model-spans 6
--disable-rebel
Force heuristic-only relation extraction regardless of mode.
Example:
rapidgraph --input input.txt --mode quality --disable-rebel
Entity canonicalization scope
--entity-scope {document,corpus}
Default: document
Controls whether compatible entities can merge across files.
Use document when:
- files are independent
- names may be ambiguous across documents
- you want safer graph boundaries
Use corpus when:
- the files describe the same topic or domain
- you want one merged entity layer across the corpus
- you are building a shared graph for Neo4j or GraphRAG
Examples:
rapidgraph --input input.txt input2.txt --entity-scope document
rapidgraph --input input.txt input2.txt --entity-scope corpus
Embedding-assisted linking
These flags are optional. They are not enabled by default.
--embedding-linking
Enable embedding-assisted rescue for ambiguous entity merges and unresolved relation endpoints.
--embedding-model
Default:
sentence-transformers/all-MiniLM-L6-v2
--embedding-threshold
Default: 0.84
Cosine similarity threshold for accepting embedding-assisted merge or link candidates.
--embedding-cache-dir
Default:
.cache/extract_graph_embeddings
Embedding vectors are cached locally in SQLite form.
--embedding-max-candidates
Default: 8
Maximum number of candidates considered in embedding-assisted linking for an unresolved mention.
Example:
rapidgraph \
--input input.txt input2.txt \
--entity-scope corpus \
--embedding-linking \
--embedding-threshold 0.84 \
--embedding-max-candidates 8 \
--pretty
Provenance controls
--include-chunk-text
Default: enabled
Include chunk text inside the chunks array.
--no-include-chunk-text
Omit chunk text while keeping chunk metadata.
--omit-provenance-text
Alias for omitting chunk text while preserving chunk IDs and provenance structure.
Examples:
rapidgraph --input input.txt --no-include-chunk-text
rapidgraph --input input.txt --omit-provenance-text
Neo4j export
These flags are optional. Without them, the CLI only prints or writes JSON.
--neo4j-uri
Neo4j URI, for example:
neo4j://127.0.0.1:7687
--neo4j-user
Neo4j username.
--neo4j-password
Neo4j password.
--neo4j-database
Default: neo4j
Neo4j database name.
--neo4j-clean-document
Deletes matching document subgraphs before re-ingesting them.
Useful when rerunning the same files and you do not want duplicate document/chunk subgraphs.
--neo4j-embed-chunks
Generate embeddings for exported Chunk nodes and store them in Neo4j.
--neo4j-create-vector-index
Create a Neo4j vector index for Chunk embeddings. This is required for rapidgraph ask.
--neo4j-vector-index-name
Default: rapidgraph_chunk_embedding
Name of the Neo4j vector index used for chunk retrieval.
--neo4j-embedding-property
Default: embedding
Property on Chunk nodes used to store vector embeddings.
--chunk-embedding-model
Default:
sentence-transformers/all-MiniLM-L6-v2
Sentence-transformers model used for chunk embeddings.
Example:
rapidgraph \
--input input.txt input2.txt \
--mode quality \
--entity-scope corpus \
--neo4j-uri neo4j://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password 12345678 \
--neo4j-database neo4j \
--neo4j-clean-document \
--neo4j-embed-chunks \
--neo4j-create-vector-index
Logging
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Default: WARNING
Controls CLI log verbosity.
Example:
rapidgraph --input input.txt --log-level DEBUG
Recommended Flag Combinations
Fastest CPU pass
rapidgraph --input input.txt --mode fast --pretty
Best default for most users
rapidgraph --input input.txt --mode balanced --pretty
Higher-recall single document run
rapidgraph --input input.txt --mode quality --chunk-overlap 2 --pretty
Multi-file corpus merge
rapidgraph \
--input input.txt input2.txt \
--mode balanced \
--entity-scope corpus \
--pretty
Multi-file corpus with stronger ambiguous-link rescue
rapidgraph \
--input input.txt input2.txt \
--mode balanced \
--entity-scope corpus \
--embedding-linking \
--pretty
Smaller provenance payload
rapidgraph \
--input input.txt \
--omit-provenance-text \
--pretty
Neo4j ingestion with document replacement
rapidgraph \
--input input.txt input2.txt \
--mode quality \
--entity-scope corpus \
--neo4j-uri neo4j://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password 12345678 \
--neo4j-database neo4j \
--neo4j-clean-document \
--neo4j-embed-chunks \
--neo4j-create-vector-index
Neo4j Export Model
When Neo4j export is enabled, the graph currently uses:
Node labels:
DocumentChunkEntity
Relationship types:
HAS_CHUNKMENTIONSRELATES_TO
Important detail:
- the semantic edge label such as
IS_BASED_IN,USES, orDERIVED_FROMis stored as a property onRELATES_TO - this is why Neo4j Browser may show many
RELATES_TOrelationships while the actual semantic relation name is visible inr.relation
Example query:
MATCH (s:Entity)-[r:RELATES_TO]->(t:Entity)
RETURN s.text, r.relation, t.text, r.evidence
ORDER BY r.relation
GraphRAG Ask Mode
rapidgraph ask lets users ask questions against a Neo4j graph that was exported with chunk embeddings.
Required setup:
pip install "rapidGraph[graphrag]"
First export data with embeddings and a vector index:
rapidgraph \
--input input.txt \
--mode balanced \
--neo4j-uri neo4j://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password 12345678 \
--neo4j-database neo4j \
--neo4j-clean-document \
--neo4j-embed-chunks \
--neo4j-create-vector-index
Then ask a question:
rapidgraph ask \
--question "What are the main relations in this document?" \
--neo4j-uri neo4j://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password 12345678 \
--neo4j-database neo4j \
--ollama-host http://127.0.0.1:11434 \
--ollama-model llama3.2 \
--top-k 5 \
--graph-depth 1 \
--max-facts 20 \
--pretty
Ask mode retrieval flow:
- embed the question
- query the Neo4j vector index over
Chunk.embedding - expand from retrieved chunks to mentioned entities
- collect nearby
RELATES_TOfacts - build a compact context packet
- ask Ollama to answer using only that context
Ask mode output contains:
answersourcesfactsmeta
Python Library Usage
Basic usage
from rapidgraph import DocumentInput, build_default_extractor
extractor = build_default_extractor(mode="balanced")
result = extractor.extract_documents(
[
DocumentInput(
text="Google is based in California.",
source="one.txt",
title="one.txt",
),
DocumentInput(
text="Sundar Pichai leads Google.",
source="two.txt",
title="two.txt",
),
],
entity_scope="corpus",
)
print(result.model_dump())
Single-document usage
from rapidgraph import build_default_extractor
extractor = build_default_extractor(mode="fast")
result = extractor.extract(
"Transformer uses multi-head attention.",
include_chunk_text=True,
)
print(result.model_dump_json(indent=2))
Configurable extractor construction
from rapidgraph import build_default_extractor
extractor = build_default_extractor(
max_chars=800,
chunk_mode="paragraph",
chunk_overlap=2,
mode="balanced",
max_model_spans=6,
embedding_linking=True,
)
Python library usage with Neo4j export
This example extracts a graph in Python and then writes it directly to Neo4j.
from rapidgraph import build_default_extractor, export_graph_to_neo4j
extractor = build_default_extractor(
mode="balanced",
chunk_mode="paragraph",
chunk_overlap=1,
)
result = extractor.extract(
"""
Google is based in California.
Sundar Pichai leads Google.
""",
entity_scope="document",
include_chunk_text=True,
)
export_graph_to_neo4j(
result,
uri="neo4j://127.0.0.1:7687",
user="neo4j",
password="12345678",
database="neo4j",
clean_document=True,
)
If you want to use the Neo4j helper, install the extra first:
pip install "rapidGraph[neo4j]"
Python library usage with GraphRAG ask
from rapidgraph import GraphRAGClient, Neo4jVectorRetriever, OllamaLLM
retriever = Neo4jVectorRetriever(
uri="neo4j://127.0.0.1:7687",
user="neo4j",
password="12345678",
database="neo4j",
)
llm = OllamaLLM(
model="llama3.2",
host="http://127.0.0.1:11434",
)
client = GraphRAGClient(retriever=retriever, llm=llm)
answer = client.ask(
"What does the graph say about attention?",
top_k=5,
graph_depth=1,
max_facts=20,
)
print(answer.model_dump_json(indent=2))
Performance and Practical Notes
CPU expectations
fastis the cheapest modebalancedis usually the best practical CPU choicequalitymay be significantly slower because it runs REBEL on every chunk
First-run cost
The first run may be slower because model weights may need to be loaded or downloaded.
Hugging Face access
Some backends download models from the Hugging Face Hub if they are not already present locally.
Optional environment variables:
HF_TOKEN- useful for higher rate limits
HF_HUB_OFFLINE=1- useful if models are already cached locally and you want fully offline behavior
Why chunking matters
Chunking directly affects:
- relation recall
- context preservation
- runtime
- schema richness
Too-small chunks can lose relation context. Too-large chunks can increase noise and runtime. paragraph mode with a small overlap is a good default.
Troubleshooting
rapidgraph command not found
Make sure the package is installed in the active environment:
pip install rapidGraph
Slow first run
Expected if models are being downloaded or loaded for the first time.
Hugging Face warnings
Warnings about unauthenticated requests are not fatal. Set HF_TOKEN if you want authenticated Hub access.
Real PyPI vs TestPyPI
To install from TestPyPI:
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple rapidGraph
For the public release:
pip install rapidGraph
Neo4j shows only RELATES_TO
That is expected. The semantic relationship name is stored in the relation property, not as a separate Neo4j relationship type.
Why expanded_schema can be larger than potential_schema
potential_schema is intentionally strict and compatibility-focused. expanded_schema preserves finer type detail and therefore often contains more rows.
Development
Install development dependencies:
pip install ".[dev]"
Run the main test suite:
pytest -q tests/test_extract_graph.py
Build the package:
python -m build
Validate package metadata:
python -m twine check dist/*
Publishing
TestPyPI
pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple rapidGraph
Real PyPI
pip install rapidGraph
This repository includes GitHub Actions workflows for:
- TestPyPI publishing
- real PyPI publishing
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rapidgraph-0.2.0.tar.gz.
File metadata
- Download URL: rapidgraph-0.2.0.tar.gz
- Upload date:
- Size: 47.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c576bf63100b2b89c0da9b7481c63c47bc1ebda113fd3738eae8ebcdfa89429e
|
|
| MD5 |
18481269269e0c1a60c2702bd4648cbb
|
|
| BLAKE2b-256 |
49e8e1af93d30a9ceadaf71ffaa8b114a77ed2ba5d84620e442593e671c76c09
|
Provenance
The following attestation bundles were made for rapidgraph-0.2.0.tar.gz:
Publisher:
publish.yml on Chillthrower/rapidGraph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rapidgraph-0.2.0.tar.gz -
Subject digest:
c576bf63100b2b89c0da9b7481c63c47bc1ebda113fd3738eae8ebcdfa89429e - Sigstore transparency entry: 1388094703
- Sigstore integration time:
-
Permalink:
Chillthrower/rapidGraph@b116642cba0cc0b3a4c8474a2c877e02e07d0f3b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Chillthrower
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b116642cba0cc0b3a4c8474a2c877e02e07d0f3b -
Trigger Event:
release
-
Statement type:
File details
Details for the file rapidgraph-0.2.0-py3-none-any.whl.
File metadata
- Download URL: rapidgraph-0.2.0-py3-none-any.whl
- Upload date:
- Size: 35.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ac39ddf4116cb3a300533ea730bbd8e20bee86f8311556ab0ff48b2af973317
|
|
| MD5 |
9afe4d9418c37d2bdc6ef6a11499c697
|
|
| BLAKE2b-256 |
1ba56563057eb77fb9cd73d290e8a805d482815db34e7d8874c47819b74e68ef
|
Provenance
The following attestation bundles were made for rapidgraph-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on Chillthrower/rapidGraph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rapidgraph-0.2.0-py3-none-any.whl -
Subject digest:
7ac39ddf4116cb3a300533ea730bbd8e20bee86f8311556ab0ff48b2af973317 - Sigstore transparency entry: 1388094799
- Sigstore integration time:
-
Permalink:
Chillthrower/rapidGraph@b116642cba0cc0b3a4c8474a2c877e02e07d0f3b -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Chillthrower
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b116642cba0cc0b3a4c8474a2c877e02e07d0f3b -
Trigger Event:
release
-
Statement type: