Modality-Aware Retrieval Engine inspired by IRPAPERS-style multimodal retrieval.
Project description
MARE
MARE is an open-source Python library for evidence-first document retrieval.
It is inspired by the direction highlighted in the IRPAPERS paper, which shows that page-image retrieval and text retrieval have complementary failure modes on scientific documents. Instead of flattening everything into one retrieval path, MARE treats routing, retrieval, fusion, and observability as separate system concerns.
What this repo is
- A lightweight Python package between a query and modality-specific indexes
- A baseline router that decides whether a query should hit text, image, layout, or a hybrid path
- A late-fusion layer that combines modality-specific scores
- An explainable debug surface that tells you why a modality was selected
What this repo is not
- Not a chatbot wrapper
- Not a full PDF parsing stack yet
- Not a claim that heuristic routing is state of the art
Why now
IRPAPERS asks a useful systems question: when should we retrieve over OCR text, page images, layout structure, or some combination? The paper reports that text-based and image-based retrieval each solve queries the other misses, and that fusion improves retrieval quality over either modality alone.
This repo turns that observation into an MVP developer layer.
Paper: https://arxiv.org/pdf/2602.17687
Architecture
query
-> router
-> modality-specific retrievers
-> text index
-> image index
-> layout index
-> fusion
-> explainable results
Current implementation choices:
- Router: keyword heuristic baseline
- Text retrieval: token-overlap cosine baseline
- Image retrieval: caption and visual-tag overlap baseline
- Layout retrieval: layout-hint overlap baseline
- Fusion: weighted late fusion
The point of v0.1 is not raw benchmark quality. It is to package the control plane cleanly enough that stronger models can drop in later.
Repo layout
src/mare/
engine.py
router.py
fusion.py
types.py
retrievers/
examples/
tests/
Quickstart
Clone and install:
git clone https://github.com/SaiSandeepKantareddy/MARE.git
cd MARE
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
Or install directly from GitHub:
pip install "git+https://github.com/SaiSandeepKantareddy/MARE.git"
The intended package install after PyPI release is:
pip install mare-retrieval
Then use it as a library:
from mare import MAREApp
app = MAREApp.from_pdf("manual.pdf", reuse=True)
best = app.best_match("partially reinstall the set screws if they fall out")
print(best.page)
print(best.snippet)
print(best.page_image_path)
Or try the sample corpus from the CLI:
mare-demo --query "show me the architecture diagram of transformer"
Or without installing the package yet:
PYTHONPATH=src python3 -m mare.demo --query "show me the architecture diagram of transformer"
Simplest way to use it
Use one command:
python3 ask.py "MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf" "partially reinstall the set screws if they fall out"
That will:
- ingest the PDF if needed
- retrieve the best matching page
- print the page number
- print the exact snippet
- print the rendered page image path
If you want to reuse a previously generated corpus:
python3 ask.py --reuse "MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf" "partially reinstall the set screws if they fall out"
If the PDF filename is awkward, rename it first:
mv ./*.pdf ./manual.pdf
PYTHONPATH=src python3 ask.py ./manual.pdf "partially reinstall the set screws if they fall out"
Public Python API
The package is meant to be importable, not just runnable from scripts.
from mare import MAREApp, load_corpus, load_pdf
Create an app from a PDF:
app = load_pdf("manual.pdf", reuse=True)
hit = app.best_match("what does MagSafe 3 refer to")
Create an app from an existing JSON corpus:
app = load_corpus("generated/manual.json")
results = app.retrieve("show me the comparison table", top_k=3)
Core methods:
MAREApp.from_pdf(...)MAREApp.from_corpus(...)MAREApp.from_documents(...)app.explain(query)app.retrieve(query)app.best_match(query)
Packaging and release
MARE is now structured as a regular Python package with:
pyproject.tomlmetadata- legacy-friendly
setup.py - console entry points
- a PyPI publishing workflow
Release notes and PyPI steps live in PUBLISHING.md.
Visual demo
If you want to show this to users visually, run the Streamlit demo:
pip install -e ".[ui]"
PYTHONPATH=src streamlit run src/mare/streamlit_app.py
The demo lets a user:
- upload a PDF
- ask a question
- see the best matching page
- read the exact evidence snippet
- view the rendered page image
The technical retrieval plan is hidden under a Debug details expander so the default experience stays user-facing.
Ingest a real PDF
You can convert a PDF into a page-level JSON corpus and then run retrieval on it.
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
mare-ingest "MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf"
mare-demo --corpus "generated/MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.json" --query "what does MagSafe 3 refer to"
Without installing the package first:
PYTHONPATH=src python3 -m mare.ingest "MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf"
PYTHONPATH=src python3 -m mare.demo --corpus "generated/MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.json" --query "what does MagSafe 3 refer to"
What the ingest step does right now:
- reads each PDF page with
pypdf - renders each PDF page to
generated/<pdf-name>/page-N.png - extracts page text
- creates one retrieval document per page
- adds lightweight layout hints when terms like
TableorFigureappear - writes a JSON corpus that the retriever can search immediately
This is still a simple baseline. OCR, figure extraction, and true layout modeling are the next step.
What you get back
The retriever now returns:
- the matching page number
- why that page matched
- a short exact snippet from the page text
- the rendered page image path
That makes it easier to validate whether retrieval found the right instruction and jump to the exact page image.
Example output:
{
"query": "show me the architecture diagram of transformer",
"intent": "visual_lookup",
"selected_modalities": ["image"],
"discarded_modalities": ["text", "layout"],
"confidence": 0.8,
"rationale": "Detected modality cues in query tokens. Selected image based on keyword overlap with routing hints.",
"results": [
{
"doc_id": "paper-transformer-p4",
"title": "Attention Is All You Need",
"page": 4,
"score": 0.6,
"reason": "image:Matched visual cues: architecture, diagram, transformer"
}
]
}
Why the explainability matters
The debug surface is a core feature, not an afterthought. For production retrieval systems, we need to answer:
- Which modality did the router choose?
- Which modalities were skipped?
- Why did a page rank highly?
- What tradeoff did fusion make?
That is the wedge for MARE: make multimodal retrieval inspectable before trying to make it magical.
Local sample data
examples/sample_corpus.json contains a tiny IR-paper-style corpus so the routing and fusion path is runnable out of the box.
There is also a local PDF in this workspace:
MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf
That file can now be ingested into a JSON page corpus with mare-ingest.
Roadmap
v0.1
- text + image + layout routing
- weighted late fusion
- explainable retrieval output
- tests and runnable demo
v0.2
- pluggable embedding backends
- PDF page ingestion
- OCR and caption extraction adapters
- score normalization per modality
v0.3
- learned router
- benchmark harness for IRPAPERS-style evaluation
- cost-aware routing budgets
- reranking and cross-modal evidence aggregation
Suggested next open-source moves
- Add adapters for FAISS, Qdrant, and Weaviate
- Add page extraction from PDFs
- Add a benchmark runner that computes Recall@k per modality
- Add a small web debug UI for route inspection
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mare_retrieval-0.2.0.tar.gz.
File metadata
- Download URL: mare_retrieval-0.2.0.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b49792bb4f20c24b07c916324b5e23a42d9beaed10a9b0a6ad80f21392a16146
|
|
| MD5 |
ed089fc88c35ba2ebf22fc1ee0050788
|
|
| BLAKE2b-256 |
dc3a60a6ec8308993ced538d280a85038663b99480b47feeba4d6fa97a8d276d
|
Provenance
The following attestation bundles were made for mare_retrieval-0.2.0.tar.gz:
Publisher:
publish.yml on SaiSandeepKantareddy/MARE
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mare_retrieval-0.2.0.tar.gz -
Subject digest:
b49792bb4f20c24b07c916324b5e23a42d9beaed10a9b0a6ad80f21392a16146 - Sigstore transparency entry: 1283922624
- Sigstore integration time:
-
Permalink:
SaiSandeepKantareddy/MARE@5ebf94eca32f8bb712d6fe9081f7a985f2accd4e -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/SaiSandeepKantareddy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5ebf94eca32f8bb712d6fe9081f7a985f2accd4e -
Trigger Event:
release
-
Statement type:
File details
Details for the file mare_retrieval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: mare_retrieval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
388a340419b21cb3cd1d8927903ce04a940aa7cf45882f8b411611e68519d2ec
|
|
| MD5 |
a73aea0636b6f5648d71f39879df58b7
|
|
| BLAKE2b-256 |
243f2ad5900e3e4d14ef99ab8073c52c8ad4a752078151afab9b64785c04dc15
|
Provenance
The following attestation bundles were made for mare_retrieval-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on SaiSandeepKantareddy/MARE
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mare_retrieval-0.2.0-py3-none-any.whl -
Subject digest:
388a340419b21cb3cd1d8927903ce04a940aa7cf45882f8b411611e68519d2ec - Sigstore transparency entry: 1283923368
- Sigstore integration time:
-
Permalink:
SaiSandeepKantareddy/MARE@5ebf94eca32f8bb712d6fe9081f7a985f2accd4e -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/SaiSandeepKantareddy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5ebf94eca32f8bb712d6fe9081f7a985f2accd4e -
Trigger Event:
release
-
Statement type: