Graph-based corpus engine for annotated text with efficient traversal and search
Project description
Context-Fabric
A graph-based corpus engine for annotated text with efficient traversal and search.
Overview
Context-Fabric provides a powerful data model for working with annotated text corpora as graphs. It enables efficient navigation, feature lookup, and pattern-based search across large textual datasets.
Forked from Dirk Roorda's Text-Fabric.
Installation
pip install context-fabric
Quick Start
from cfabric.core import Fabric
# Load a dataset
CF = Fabric(locations='path/to/data')
api = CF.load('feature1 feature2')
# Navigate nodes
for node in api.N.walk():
print(api.F.feature1.v(node))
# Use locality
embedders = api.L.u(node)
embedded = api.L.d(node)
Core API
- N (Nodes) - Walk through nodes in canonical order
- F (Features) - Access node feature values
- E (Edges) - Access edge feature values
- L (Locality) - Navigate between related nodes
- T (Text) - Retrieve text representations
- S (Search) - Search using templates
Performance
Context-Fabric uses memory-mapped numpy arrays for dramatically faster loading and reduced memory consumption compared to Text-Fabric's pickle-based caching.
Benchmarks (BHSA Hebrew Bible corpus — 1.4M nodes, 109 features)
| Metric | Text-Fabric | Context-Fabric | Improvement |
|---|---|---|---|
| Load Time | 7.9s | 0.7s | 11x faster |
| Memory Usage | 6.3 GB | 305 MB | 95% reduction |
| Compile Time | 8s | 91s | one-time cost |
| Cache Size | 138 MB | 859 MB | 6x larger |
The key insight: compilation happens once, loading happens every session. Context-Fabric trades one-time compile cost for dramatic runtime efficiency:
- Memory-mapped arrays: Data stays on disk, accessed on-demand
- Efficient sparse iteration: Uses numpy vectorized operations instead of Python loops
- Cached materialization: Dictionary views computed once per session
Parallel Worker Scaling
Memory-mapped arrays enable efficient parallel processing. Multiple workers share the same mmap'd data instead of each loading a full copy into RAM.
Spawn mode (cold start — each worker loads independently):
| Metric | Text-Fabric | Context-Fabric | Savings |
|---|---|---|---|
| Total (4 workers) | 7.7 GB | 1.3 GB | 84% less |
| Per worker | 1.9 GB | 315 MB | 6x less |
Fork mode (API scenario — pre-load then fork workers):
| Metric | Text-Fabric | Context-Fabric | Savings |
|---|---|---|---|
| Total (4 workers) | 6.3 GB | 398 MB | 94% less |
| Per worker | 1.6 GB | 99 MB | 16x less |
Memory measured as total RSS after loading from cache on BHSA corpus.
Run the benchmark yourself:
python benchmarks/compare_performance.py --source path/to/tf/data --workers 4
Testing
See TESTING.md for how to run tests.
Authors
- Cody Kingham
- Dirk Roorda
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file context_fabric-0.5.7.tar.gz.
File metadata
- Download URL: context_fabric-0.5.7.tar.gz
- Upload date:
- Size: 142.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
818fbbea221d893b11c033147ee4393606276031951d87e55b08e9e4434ffd2e
|
|
| MD5 |
2924a889a04638876b2213bcccfdd480
|
|
| BLAKE2b-256 |
ce135413f553df9bb6c4763da841e357564f9b3017e836911e7fba149b1081ba
|
File details
Details for the file context_fabric-0.5.7-py3-none-any.whl.
File metadata
- Download URL: context_fabric-0.5.7-py3-none-any.whl
- Upload date:
- Size: 160.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c28df151528aeaf9da8b7a397e88fc35eddc740de2cf2cae1dc0034769b9302
|
|
| MD5 |
50dc2b1b8e0c2f4eaf2db4687af33b45
|
|
| BLAKE2b-256 |
ceb92d45dd257773c3eb160f70d62545173f0117ea734242ac00eab803a7fa3d
|