Skip to main content

Cross-lingual sparse retrieval + reasoning + GA imagination over typed concept anchors (EN/JA/KO).

Project description

BlackMagic

Cross-lingual sparse retrieval + reasoning + GA imagination over typed concept anchors. Default encoder is a multilingual SPLADE fine-tune (cp500/opensearch-neural-sparse-en-jp-ko) so an English-authored schema retrieves across EN / JA / KO.

Install

pip install blackmagic-retrieval

The multilingual SPLADE model (~670MB) is downloaded from HuggingFace on first use and cached by the transformers library. For development:

git clone <repo>
pip install -e '.[dev]'
PYTHONPATH=src pytest tests/

Quickstart

from blackmagic import BlackMagic, BlackMagicConfig

bm = BlackMagic(BlackMagicConfig(
    schema_path="examples/automotive_schema.json",
    db_path=":memory:",
))

bm.ingest([
    {"text": "Toyota announced a $13.6B investment in battery production.",
     "id": "d1", "timestamp": "2026-03-01"},
    {"text": "Honda launches new EV in partnership with CATL.",
     "id": "d2", "timestamp": "2026-03-15"},
])

# Sparse retrieval with persona valence
result = bm.search("automakers investing in batteries", persona="investor")
for inf in result.infons[:5]:
    print(inf.subject, inf.predicate, inf.object, inf.confidence)

# Dempster-Shafer claim verification
v = bm.verify_claim("Toyota is aggressively investing in batteries.")
print(v.label, v.belief_supports, v.belief_refutes)

# MCTS multi-hop reasoning
m = bm.reason("Does the industry face supply risks?")
print(m.verdict, m.chains_discovered)

# GA imagination — MCTS-shaped output with dual verdicts
im = bm.imagine("What OEM–supplier partnerships might emerge?")
print(im.verdict, im.mcts_verdict)
for inf in im.imagined_infons[:5]:
    print(inf.subject, inf.predicate, inf.object,
          "fitness=", inf.fitness,
          "parents=", inf.parent_infon_ids)

Features

  • Sparse retrieval via splade-tiny → typed anchor projection
  • Persona valence — investor / engineer / executive / regulator / analyst
  • Contrary views — invert the evidential lens at query time
  • Temporal graph — NEXT edges link facts across time per shared anchor
  • Constraint aggregation — cross-document infon fusion
  • Dempster-Shafer claim verification
  • Graph MCTS for multi-hop reasoning
  • GA imagination (new) — query-scoped genetic algorithm that proposes plausible counterfactual infons scored by grammar × logic × health, with output isomorphic to MCTSResult

When to use cognition vs BlackMagic

cognition BlackMagic
Languages EN + JA/KO/ZH/... EN default, EN/JA/KO via multilingual flag
Encoder splade-tiny or multilingual XLM-R splade-tiny bundled; any HF SPLADE via config
Structural analysis (Kano, Kan, etc.) Yes No
Category theory extensions Yes No
Cloud backend (DynamoDB, Lambda) Yes No
MCP / agent tooling Yes No
GA imagination No Yes
Line count ~5,700 ~3,900

Multilingual (EN / JA / KO)

Set multilingual=True to use the fine-tuned cp500/opensearch-neural-sparse-en-jp-ko model. A Japanese or Korean sentence activates the same English anchor positions as its English parallel — you write your schema once, in English, and ingestion / search / verify / imagine all work across all three languages.

cfg = BlackMagicConfig(
    schema_path="examples/automotive_schema.json",
    model_name="cp500/opensearch-neural-sparse-en-jp-ko",
    multilingual=True,         # self-encode anchors + exclusivity filter
    activation_threshold=0.25, # looser than splade-tiny's 0.3
    min_confidence=0.15,
)
bm = BlackMagic(cfg)

bm.ingest([
    {"id": "en1", "text": "Chevron announced a $15B investment in the Permian Basin.",
     "timestamp": "2026-04-01"},
    {"id": "ja1", "text": "シェブロンはパーミアン盆地で150億ドルの投資を発表した。",
     "timestamp": "2026-04-01"},
    {"id": "ko1", "text": "셰브런은 퍼미안 분지에 150억 달러 규모의 투자를 발표했다.",
     "timestamp": "2026-04-01"},
])
# All three docs produce infons; a JA query can retrieve KO evidence and vice versa.

How it works. The multilingual model doesn't fire on literal English token IDs; it expands every anchor string (including JA/KO parallels) into the same multilingual subword soup (Latin + CJK + Cyrillic + Arabic). On init, BlackMagic self-encodes each anchor's surface forms through SPLADE, keeps its top-K expansion positions, and subtracts positions that also activate for another same-type anchor (crosstalk filter).

Benchmark. On 200 held-out concepts × 9 language pairs (1,800 query×passage pairs) from cp500/multilingual-concept-training-kit:

MRR@10 Recall@10
en→en 1.000 1.000
en→ja 0.995 1.000
ja→en 0.998 1.000
ko→en 0.995 1.000
ko→ko 0.998 1.000
OVERALL 0.996 1.000

EN-vocab ratio on top-50 dims: en 0.57, ja 0.55, ko 0.55 — JA/KO queries project into English-like vocab positions at ~97% of English's rate.

Testing

# splade-tiny only (fast, CI-friendly)
PYTHONPATH=src pytest tests/

# including multilingual integration (downloads ~600MB model)
PYTHONPATH=src RUN_ML_TESTS=1 pytest tests/

# kit-scale retrieval benchmark
PYTHONPATH=src python examples/benchmark_multilingual.py

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blackmagic_retrieval-0.1.0.tar.gz (77.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

blackmagic_retrieval-0.1.0-py3-none-any.whl (63.1 kB view details)

Uploaded Python 3

File details

Details for the file blackmagic_retrieval-0.1.0.tar.gz.

File metadata

  • Download URL: blackmagic_retrieval-0.1.0.tar.gz
  • Upload date:
  • Size: 77.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for blackmagic_retrieval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 249da3ab34fb9f4e195d20d0c308a51308bc22bf9b6e89111711a677b027ddc9
MD5 01f8e5286bad37896995ec2b23167474
BLAKE2b-256 fe1e8d65bc9a40958573decc97fd41c4091d1a7b45fc7bad12f2779d2629e3de

See more details on using hashes here.

File details

Details for the file blackmagic_retrieval-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for blackmagic_retrieval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6cd5a81440391c90a37e4bb7ac6edb8aa7ba6016c7252f777e8f7d9d8cf0f25b
MD5 58c677e7cc9815779fd83c89b605d31a
BLAKE2b-256 18a5e3536cf63b26ed26dce203ea0581df5a1418ef093cf50040d3191a88ce4c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page