Open-domain text-to-graph extractor with entities, relations, schema inference, and Neo4j export.
Project description
rapidGraph
rapidGraph is a local-first, open-domain text-to-graph extractor for arbitrary text. It turns raw text files or inline text into structured JSON containing:
entitiesrelationspotential_schemaexpanded_schema- provenance-aware
documents,chunks, andrelation_support
It is designed for:
- general entity and relation extraction across business, technical, scientific, and mixed-topic text
- CPU-friendly local runs with selectable quality modes
- provenance-aware graph building for future RAG or GraphRAG pipelines
- optional direct Neo4j ingestion
The public distribution name is rapidGraph, the Python import package is rapidgraph, and the installed CLI command is rapidgraph.
What It Does
At a high level, rapidGraph:
- normalizes raw text
- splits it into chunked spans
- extracts entity candidates
- extracts relation candidates
- canonicalizes duplicate or near-duplicate entity mentions
- links relation endpoints back to canonical entities
- infers schema patterns from the final graph
- preserves chunk/document provenance for downstream graph and retrieval use
The extractor is open-domain best effort. It does not enforce a fixed ontology and keeps Unknown types when typing confidence is weak.
Core Features
- Open-domain entity extraction
- Open-domain relation extraction
- Schema inference from observed graph edges
- Provenance-aware output with
documents,chunks, and relation support records - Multi-file corpus ingestion in one run
- Two canonicalization scopes:
document: keep each file independentcorpus: merge compatible entities across files
- Three CPU-aware execution modes:
fastbalancedquality
- Optional embedding-assisted canonicalization and linking
- Optional Neo4j export
Install
Install from source:
pip install .
Install with optional extras:
pip install ".[neo4j]"
pip install ".[embeddings]"
pip install ".[dev]"
pip install ".[neo4j,embeddings,dev]"
After publishing to PyPI, users will be able to install with:
pip install rapidGraph
PyPI extras will work the same way:
pip install "rapidGraph[neo4j]"
pip install "rapidGraph[embeddings]"
pip install "rapidGraph[dev]"
CLI Quick Start
Show help:
rapidgraph --help
Process inline text:
rapidgraph --text "Google is based in California." --pretty
Process one file:
rapidgraph --input input.txt --pretty
Process multiple files:
rapidgraph --input input.txt input2.txt --pretty
Write output to JSON:
rapidgraph --input input.txt --output graph.json --pretty
The repo-root compatibility command still works:
python extract_graph.py --input input.txt --pretty
Execution Modes
rapidGraph supports three relation extraction modes.
fast
Best for:
- CPU-only quick passes
- bulk experiments
- basic graph drafts
Behavior:
- uses GLiNER and heuristics
- does not run REBEL
- fastest startup and lowest CPU cost
balanced
This is the default mode.
Best for:
- normal CPU usage
- better relation quality without full model cost
Behavior:
- runs heuristics everywhere
- runs REBEL only on shortlisted high-value spans
- usually the best tradeoff
quality
Best for:
- maximum relation recall
- slower offline analysis
- smaller corpora where quality matters more than throughput
Behavior:
- runs REBEL across all chunks
- highest model cost
Input Model
The CLI accepts either:
--text "..."for inline text--input file1.txt [file2.txt ...]for one or more text files
--text and --input are mutually exclusive.
Output Model
The extractor returns one combined JSON object with these top-level fields.
entities
Each entity includes:
idtextcanonicaltypeconfidencementions
Each mention includes:
textstartendchunk_indexdocument_idchunk_id
relations
Each relation includes:
source_idtarget_idrelationconfidenceevidencechunk_idsdocument_ids
potential_schema
Strict schema aggregation using:
(source_type, relation, target_type)
This is the backward-compatible schema view.
expanded_schema
Richer schema aggregation using finer-grained normalized types and more examples.
documents
One document row per input source:
idsourcetitletext_hashchar_count
chunks
Each chunk includes:
iddocument_idindextextunless omittedstartendblock_indexoverlap_sentences
relation_support
One row per final relation edge with merged provenance:
source_idrelationtarget_idchunk_idsdocument_idsevidence
meta
Includes model names, thresholds, chunk counts, mode, embedding stats, relation backend stats, warnings, and processing time.
Flag Reference
Input and Output Flags
--text TEXT
Inline text input.
Example:
rapidgraph --text "Transformer uses self-attention." --pretty
--input INPUT [INPUT ...]
One or more UTF-8 text files.
Examples:
rapidgraph --input input.txt
rapidgraph --input input.txt input2.txt
--output OUTPUT
Write JSON to a file instead of stdout.
Example:
rapidgraph --input input.txt --output graph.json --pretty
--pretty
Pretty-print JSON output.
Quality and Runtime Flags
--mode {fast,balanced,quality}
Controls the CPU and quality tradeoff.
Examples:
rapidgraph --input input.txt --mode fast
rapidgraph --input input.txt --mode balanced
rapidgraph --input input.txt --mode quality
--disable-rebel
Forces heuristic-only relation extraction even if the mode would otherwise use REBEL.
Example:
rapidgraph --input input.txt --mode quality --disable-rebel
--max-model-spans MAX_MODEL_SPANS
Only used meaningfully in balanced mode. Caps the number of shortlisted spans sent to REBEL.
Example:
rapidgraph --input input.txt --mode balanced --max-model-spans 6
Extraction Threshold Flags
--entity-threshold ENTITY_THRESHOLD
Minimum confidence used to keep entity candidates.
Example:
rapidgraph --input input.txt --entity-threshold 0.45
--relation-threshold RELATION_THRESHOLD
Minimum confidence used to keep relations.
Example:
rapidgraph --input input.txt --relation-threshold 0.3
--max-chars MAX_CHARS
Chunk size budget. Larger values preserve more context but cost more runtime.
Example:
rapidgraph --input input.txt --max-chars 1400
Chunking Flags
--chunk-mode {paragraph,sentence}
Controls chunk construction.
paragraph: structure-aware paragraph-first chunkingsentence: simpler sentence packing
Example:
rapidgraph --input input.txt --chunk-mode paragraph
rapidgraph --input input.txt --chunk-mode sentence
--chunk-overlap CHUNK_OVERLAP
Sentence overlap between neighboring chunks. Higher values preserve context across chunk boundaries but increase redundancy.
Example:
rapidgraph --input input.txt --chunk-overlap 2
Multi-File and Canonicalization Flags
--entity-scope {document,corpus}
Controls how entities are canonicalized across multiple files.
document: identical entities in different files stay separatecorpus: compatible entities can merge across files
Examples:
rapidgraph --input input.txt input2.txt --entity-scope document
rapidgraph --input input.txt input2.txt --entity-scope corpus
Use document when:
- document-local provenance matters most
- names are ambiguous across files
- you want a safer default
Use corpus when:
- the files are about a shared topic
- you want a consolidated graph across the corpus
- you plan to export one merged graph to Neo4j
Provenance Flags
--include-chunk-text
Include full chunk text in the chunks array. This is the default.
--no-include-chunk-text
Keep chunk records but omit chunk text.
--omit-provenance-text
Alias for omitting chunk text while preserving chunk IDs and metadata.
Examples:
rapidgraph --input input.txt --no-include-chunk-text
rapidgraph --input input.txt --omit-provenance-text
Embedding-Assisted Linking Flags
These are opt-in. They are not enabled by default.
--embedding-linking
Enable embedding-assisted rescue for ambiguous entity merges and unresolved relation endpoints.
--embedding-model EMBEDDING_MODEL
Sentence embedding model to use. Default:
sentence-transformers/all-MiniLM-L6-v2
--embedding-threshold EMBEDDING_THRESHOLD
Cosine similarity threshold for accepting embedding-based merges or links.
--embedding-cache-dir EMBEDDING_CACHE_DIR
Local cache directory for embedding vectors.
--embedding-max-candidates EMBEDDING_MAX_CANDIDATES
Caps the candidate pool used during embedding-assisted linking.
Examples:
rapidgraph \
--input input.txt \
--embedding-linking \
--embedding-threshold 0.84 \
--embedding-cache-dir .cache/extract_graph_embeddings
rapidgraph \
--input input.txt input2.txt \
--entity-scope corpus \
--embedding-linking \
--embedding-max-candidates 8
Neo4j Flags
These flags are optional. If omitted, the extractor only emits JSON.
--neo4j-uri NEO4J_URI
Neo4j URI such as:
neo4j://127.0.0.1:7687
--neo4j-user NEO4J_USER
Neo4j username.
--neo4j-password NEO4J_PASSWORD
Neo4j password.
--neo4j-database NEO4J_DATABASE
Target Neo4j database name.
--neo4j-clean-document
Delete matching document subgraphs before re-ingesting them. Useful when rerunning the same document set.
Example:
rapidgraph \
--input input.txt input2.txt \
--mode quality \
--entity-scope corpus \
--neo4j-uri neo4j://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password 12345678 \
--neo4j-database neo4j \
--neo4j-clean-document
Logging Flag
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Controls CLI log verbosity.
Example:
rapidgraph --input input.txt --log-level DEBUG
Recommended Flag Combinations
Quick CPU pass
rapidgraph --input input.txt --mode fast --pretty
Best default for most users
rapidgraph --input input.txt --mode balanced --pretty
Higher recall on one document
rapidgraph --input input.txt --mode quality --chunk-overlap 2 --pretty
Multi-file corpus graph
rapidgraph \
--input input.txt input2.txt \
--mode balanced \
--entity-scope corpus \
--pretty
Multi-file corpus with stronger cross-file merging
rapidgraph \
--input input.txt input2.txt \
--mode balanced \
--entity-scope corpus \
--embedding-linking \
--pretty
Lean provenance payload
rapidgraph \
--input input.txt \
--omit-provenance-text \
--pretty
Neo4j export with replacement of existing document graph
rapidgraph \
--input input.txt input2.txt \
--mode quality \
--entity-scope corpus \
--neo4j-uri neo4j://127.0.0.1:7687 \
--neo4j-user neo4j \
--neo4j-password 12345678 \
--neo4j-database neo4j \
--neo4j-clean-document
Python Library Usage
Basic usage:
from rapidgraph import DocumentInput, build_default_extractor
extractor = build_default_extractor(mode="balanced")
result = extractor.extract_documents(
[
DocumentInput(
text="Google is based in California.",
source="one.txt",
title="one.txt",
),
DocumentInput(
text="Google hired Sundar Pichai.",
source="two.txt",
title="two.txt",
),
],
entity_scope="corpus",
)
print(result.model_dump())
Neo4j Graph Shape
When Neo4j export is enabled, the graph is designed to remain compatible with future GraphRAG workflows.
Current node labels:
DocumentChunkEntity
Current relationship types:
HAS_CHUNKMENTIONSRELATES_TO
The semantic relation name is stored as a property on RELATES_TO, which is why Neo4j Browser shows one relationship type while preserving relation semantics in properties.
Packaging
Build distributions:
python -m build
Validate package metadata:
python -m twine check dist/*
Install from a built wheel:
pip install dist/rapidgraph-0.1.0-py3-none-any.whl
Publishing to PyPI
Create a PyPI account, generate an API token, then upload:
python -m twine upload dist/*
If the rapidGraph name is accepted on PyPI, users will be able to install with:
pip install rapidGraph
Development
Install dev dependencies:
pip install ".[dev]"
Run tests:
pytest -q tests/test_extract_graph.py
Build the package:
python -m build
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rapidgraph-0.1.0.tar.gz.
File metadata
- Download URL: rapidgraph-0.1.0.tar.gz
- Upload date:
- Size: 37.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad11ba1606ee59dd7fb251a14e6d70f88bb1b28a95f403867c36f1f241ebd393
|
|
| MD5 |
d3ee730a94084cae3769882d77015fa5
|
|
| BLAKE2b-256 |
140ef3f6db7fc31a9d7cd38ef2f1380ce3318d1426ff11f694c87b3d76543b9a
|
Provenance
The following attestation bundles were made for rapidgraph-0.1.0.tar.gz:
Publisher:
publish.yml on Chillthrower/rapidGraph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rapidgraph-0.1.0.tar.gz -
Subject digest:
ad11ba1606ee59dd7fb251a14e6d70f88bb1b28a95f403867c36f1f241ebd393 - Sigstore transparency entry: 1382926479
- Sigstore integration time:
-
Permalink:
Chillthrower/rapidGraph@6e4756195587f18821bc92574545b338ff298f8a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Chillthrower
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6e4756195587f18821bc92574545b338ff298f8a -
Trigger Event:
release
-
Statement type:
File details
Details for the file rapidgraph-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rapidgraph-0.1.0-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03e12128c5bb1e485c6773621ef3b231d0cb97861810b63d74d56b2839e5dcd9
|
|
| MD5 |
4823473f221f5ca2e7205c4d8ce36c58
|
|
| BLAKE2b-256 |
d8b932734779e17f37392c6625cbcb6624b6cc94edf9e6964816185e6d96ae7a
|
Provenance
The following attestation bundles were made for rapidgraph-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on Chillthrower/rapidGraph
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rapidgraph-0.1.0-py3-none-any.whl -
Subject digest:
03e12128c5bb1e485c6773621ef3b231d0cb97861810b63d74d56b2839e5dcd9 - Sigstore transparency entry: 1382926516
- Sigstore integration time:
-
Permalink:
Chillthrower/rapidGraph@6e4756195587f18821bc92574545b338ff298f8a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Chillthrower
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6e4756195587f18821bc92574545b338ff298f8a -
Trigger Event:
release
-
Statement type: