Cross-lingual sparse retrieval + reasoning + GA imagination over typed concept anchors (EN/JA/KO).
Project description
BlackMagic
Cross-lingual sparse retrieval + reasoning + GA imagination over typed
concept anchors. Default encoder is a multilingual SPLADE fine-tune
(cp500/opensearch-neural-sparse-en-jp-ko)
so an English-authored schema retrieves across EN / JA / KO.
Install
pip install blackmagic-retrieval
The multilingual SPLADE model (~670MB) is downloaded from HuggingFace on
first use and cached by the transformers library. For development:
git clone <repo>
pip install -e '.[dev]'
PYTHONPATH=src pytest tests/
Quickstart
from blackmagic import BlackMagic, BlackMagicConfig
bm = BlackMagic(BlackMagicConfig(
schema_path="examples/automotive_schema.json",
db_path=":memory:",
))
bm.ingest([
{"text": "Toyota announced a $13.6B investment in battery production.",
"id": "d1", "timestamp": "2026-03-01"},
{"text": "Honda launches new EV in partnership with CATL.",
"id": "d2", "timestamp": "2026-03-15"},
])
# Sparse retrieval with persona valence
result = bm.search("automakers investing in batteries", persona="investor")
for inf in result.infons[:5]:
print(inf.subject, inf.predicate, inf.object, inf.confidence)
# Dempster-Shafer claim verification
v = bm.verify_claim("Toyota is aggressively investing in batteries.")
print(v.label, v.belief_supports, v.belief_refutes)
# MCTS multi-hop reasoning
m = bm.reason("Does the industry face supply risks?")
print(m.verdict, m.chains_discovered)
# GA imagination — MCTS-shaped output with dual verdicts
im = bm.imagine("What OEM–supplier partnerships might emerge?")
print(im.verdict, im.mcts_verdict)
for inf in im.imagined_infons[:5]:
print(inf.subject, inf.predicate, inf.object,
"fitness=", inf.fitness,
"parents=", inf.parent_infon_ids)
Features
- Sparse retrieval via splade-tiny → typed anchor projection
- Persona valence — investor / engineer / executive / regulator / analyst
- Contrary views — invert the evidential lens at query time
- Temporal graph — NEXT edges link facts across time per shared anchor
- Constraint aggregation — cross-document infon fusion
- Dempster-Shafer claim verification
- Graph MCTS for multi-hop reasoning
- GA imagination (new) — query-scoped genetic algorithm that proposes
plausible counterfactual infons scored by grammar × logic × health,
with output isomorphic to
MCTSResult
When to use cognition vs BlackMagic
| cognition | BlackMagic | |
|---|---|---|
| Languages | EN + JA/KO/ZH/... | EN default, EN/JA/KO via multilingual flag |
| Encoder | splade-tiny or multilingual XLM-R | splade-tiny bundled; any HF SPLADE via config |
| Structural analysis (Kano, Kan, etc.) | Yes | No |
| Category theory extensions | Yes | No |
| Cloud backend (DynamoDB, Lambda) | Yes | No |
| MCP / agent tooling | Yes | No |
| GA imagination | No | Yes |
| Line count | ~5,700 | ~3,900 |
Multilingual (EN / JA / KO)
Set multilingual=True to use the fine-tuned
cp500/opensearch-neural-sparse-en-jp-ko
model. A Japanese or Korean sentence activates the same English anchor
positions as its English parallel — you write your schema once, in English,
and ingestion / search / verify / imagine all work across all three languages.
cfg = BlackMagicConfig(
schema_path="examples/automotive_schema.json",
model_name="cp500/opensearch-neural-sparse-en-jp-ko",
multilingual=True, # self-encode anchors + exclusivity filter
activation_threshold=0.25, # looser than splade-tiny's 0.3
min_confidence=0.15,
)
bm = BlackMagic(cfg)
bm.ingest([
{"id": "en1", "text": "Chevron announced a $15B investment in the Permian Basin.",
"timestamp": "2026-04-01"},
{"id": "ja1", "text": "シェブロンはパーミアン盆地で150億ドルの投資を発表した。",
"timestamp": "2026-04-01"},
{"id": "ko1", "text": "셰브런은 퍼미안 분지에 150억 달러 규모의 투자를 발표했다.",
"timestamp": "2026-04-01"},
])
# All three docs produce infons; a JA query can retrieve KO evidence and vice versa.
How it works. The multilingual model doesn't fire on literal English token IDs; it expands every anchor string (including JA/KO parallels) into the same multilingual subword soup (Latin + CJK + Cyrillic + Arabic). On init, BlackMagic self-encodes each anchor's surface forms through SPLADE, keeps its top-K expansion positions, and subtracts positions that also activate for another same-type anchor (crosstalk filter).
Benchmark. On 200 held-out concepts × 9 language pairs (1,800 query×passage
pairs) from cp500/multilingual-concept-training-kit:
| MRR@10 | Recall@10 | |
|---|---|---|
| en→en | 1.000 | 1.000 |
| en→ja | 0.995 | 1.000 |
| ja→en | 0.998 | 1.000 |
| ko→en | 0.995 | 1.000 |
| ko→ko | 0.998 | 1.000 |
| OVERALL | 0.996 | 1.000 |
EN-vocab ratio on top-50 dims: en 0.57, ja 0.55, ko 0.55 — JA/KO queries project into English-like vocab positions at ~97% of English's rate.
Testing
# splade-tiny only (fast, CI-friendly)
PYTHONPATH=src pytest tests/
# including multilingual integration (downloads ~600MB model)
PYTHONPATH=src RUN_ML_TESTS=1 pytest tests/
# kit-scale retrieval benchmark
PYTHONPATH=src python examples/benchmark_multilingual.py
License
Apache 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file blackmagic_retrieval-0.1.0.tar.gz.
File metadata
- Download URL: blackmagic_retrieval-0.1.0.tar.gz
- Upload date:
- Size: 77.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
249da3ab34fb9f4e195d20d0c308a51308bc22bf9b6e89111711a677b027ddc9
|
|
| MD5 |
01f8e5286bad37896995ec2b23167474
|
|
| BLAKE2b-256 |
fe1e8d65bc9a40958573decc97fd41c4091d1a7b45fc7bad12f2779d2629e3de
|
File details
Details for the file blackmagic_retrieval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: blackmagic_retrieval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 63.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cd5a81440391c90a37e4bb7ac6edb8aa7ba6016c7252f777e8f7d9d8cf0f25b
|
|
| MD5 |
58c677e7cc9815779fd83c89b605d31a
|
|
| BLAKE2b-256 |
18a5e3536cf63b26ed26dce203ea0581df5a1418ef093cf50040d3191a88ce4c
|