Modality-Aware Retrieval Engine inspired by IRPAPERS-style multimodal retrieval.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

saisandeep.kantareddy

These details have not been verified by PyPI

Project description

MARE

MARE is an open-source Python library for evidence-first document retrieval.

It is inspired by the direction highlighted in the IRPAPERS paper, which shows that page-image retrieval and text retrieval have complementary failure modes on scientific documents. Instead of flattening everything into one retrieval path, MARE treats routing, retrieval, fusion, and observability as separate system concerns.

What this repo is

A lightweight Python package between a query and modality-specific indexes
A baseline router that decides whether a query should hit text, image, layout, or a hybrid path
A late-fusion layer that combines modality-specific scores
An explainable debug surface that tells you why a modality was selected

What this repo is not

Not a chatbot wrapper
Not a full PDF parsing stack yet
Not a claim that heuristic routing is state of the art

Why now

IRPAPERS asks a useful systems question: when should we retrieve over OCR text, page images, layout structure, or some combination? The paper reports that text-based and image-based retrieval each solve queries the other misses, and that fusion improves retrieval quality over either modality alone.

This repo turns that observation into an MVP developer layer.

Paper: https://arxiv.org/pdf/2602.17687

Architecture

query
  -> router
  -> modality-specific retrievers
     -> text index
     -> image index
     -> layout index
  -> fusion
  -> explainable results

Current implementation choices:

Router: keyword heuristic baseline
Text retrieval: token-overlap cosine baseline
Image retrieval: caption and visual-tag overlap baseline
Layout retrieval: layout-hint overlap baseline
Fusion: weighted late fusion

The point of v0.1 is not raw benchmark quality. It is to package the control plane cleanly enough that stronger models can drop in later.

Repo layout

src/mare/
  engine.py
  router.py
  fusion.py
  types.py
  retrievers/
examples/
tests/

Quickstart

Clone and install:

git clone https://github.com/SaiSandeepKantareddy/MARE.git
cd MARE
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Or install directly from GitHub:

pip install "git+https://github.com/SaiSandeepKantareddy/MARE.git"

The intended package install after PyPI release is:

pip install mare-retrieval

Then use it as a library:

from mare import MAREApp

app = MAREApp.from_pdf("manual.pdf", reuse=True)
best = app.best_match("partially reinstall the set screws if they fall out")

print(best.page)
print(best.snippet)
print(best.page_image_path)

Or try the sample corpus from the CLI:

mare-demo --query "show me the architecture diagram of transformer"

Or without installing the package yet:

PYTHONPATH=src python3 -m mare.demo --query "show me the architecture diagram of transformer"

Simplest way to use it

Use one command:

python3 ask.py "MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf" "partially reinstall the set screws if they fall out"

That will:

ingest the PDF if needed
retrieve the best matching page
print the page number
print the exact snippet
print the rendered page image path

If you want to reuse a previously generated corpus:

python3 ask.py --reuse "MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf" "partially reinstall the set screws if they fall out"

If the PDF filename is awkward, rename it first:

mv ./*.pdf ./manual.pdf
PYTHONPATH=src python3 ask.py ./manual.pdf "partially reinstall the set screws if they fall out"

Public Python API

The package is meant to be importable, not just runnable from scripts.

from mare import MAREApp, load_corpus, load_pdf

Create an app from a PDF:

app = load_pdf("manual.pdf", reuse=True)
hit = app.best_match("what does MagSafe 3 refer to")

Create an app from an existing JSON corpus:

app = load_corpus("generated/manual.json")
results = app.retrieve("show me the comparison table", top_k=3)

Core methods:

MAREApp.from_pdf(...)
MAREApp.from_corpus(...)
MAREApp.from_documents(...)
app.explain(query)
app.retrieve(query)
app.best_match(query)

Packaging and release

MARE is now structured as a regular Python package with:

pyproject.toml metadata
legacy-friendly setup.py
console entry points
a PyPI publishing workflow

Release notes and PyPI steps live in PUBLISHING.md.

Visual demo

If you want to show this to users visually, run the Streamlit demo:

pip install -e ".[ui]"
PYTHONPATH=src streamlit run src/mare/streamlit_app.py

The demo lets a user:

upload a PDF
ask a question
see the best matching page
read the exact evidence snippet
view the rendered page image

The technical retrieval plan is hidden under a Debug details expander so the default experience stays user-facing.

Ingest a real PDF

You can convert a PDF into a page-level JSON corpus and then run retrieval on it.

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
mare-ingest "MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf"
mare-demo --corpus "generated/MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.json" --query "what does MagSafe 3 refer to"

Without installing the package first:

PYTHONPATH=src python3 -m mare.ingest "MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf"
PYTHONPATH=src python3 -m mare.demo --corpus "generated/MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.json" --query "what does MagSafe 3 refer to"

What the ingest step does right now:

reads each PDF page with pypdf
renders each PDF page to generated/<pdf-name>/page-N.png
extracts page text
creates one retrieval document per page
adds lightweight layout hints when terms like Table or Figure appear
writes a JSON corpus that the retriever can search immediately

This is still a simple baseline. OCR, figure extraction, and true layout modeling are the next step.

What you get back

The retriever now returns:

the matching page number
why that page matched
a short exact snippet from the page text
the rendered page image path

That makes it easier to validate whether retrieval found the right instruction and jump to the exact page image.

Example output:

{
  "query": "show me the architecture diagram of transformer",
  "intent": "visual_lookup",
  "selected_modalities": ["image"],
  "discarded_modalities": ["text", "layout"],
  "confidence": 0.8,
  "rationale": "Detected modality cues in query tokens. Selected image based on keyword overlap with routing hints.",
  "results": [
    {
      "doc_id": "paper-transformer-p4",
      "title": "Attention Is All You Need",
      "page": 4,
      "score": 0.6,
      "reason": "image:Matched visual cues: architecture, diagram, transformer"
    }
  ]
}

Why the explainability matters

The debug surface is a core feature, not an afterthought. For production retrieval systems, we need to answer:

Which modality did the router choose?
Which modalities were skipped?
Why did a page rank highly?
What tradeoff did fusion make?

That is the wedge for MARE: make multimodal retrieval inspectable before trying to make it magical.

Local sample data

examples/sample_corpus.json contains a tiny IR-paper-style corpus so the routing and fusion path is runnable out of the box.

There is also a local PDF in this workspace:

MacBook Pro (14-inch, M5 Pro or M5 Max) MagSafe 3 Board - Apple Support.pdf

That file can now be ingested into a JSON page corpus with mare-ingest.

Roadmap

v0.1

text + image + layout routing
weighted late fusion
explainable retrieval output
tests and runnable demo

v0.2

pluggable embedding backends
PDF page ingestion
OCR and caption extraction adapters
score normalization per modality

v0.3

learned router
benchmark harness for IRPAPERS-style evaluation
cost-aware routing budgets
reranking and cross-modal evidence aggregation

Suggested next open-source moves

Add adapters for FAISS, Qdrant, and Weaviate
Add page extraction from PDFs
Add a benchmark runner that computes Recall@k per modality
Add a small web debug UI for route inspection

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

saisandeep.kantareddy

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.1

Apr 29, 2026

0.4.0

Apr 29, 2026

0.3.0

Apr 19, 2026

This version

0.2.0

Apr 13, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mare_retrieval-0.2.0.tar.gz (19.4 kB view details)

Uploaded Apr 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mare_retrieval-0.2.0-py3-none-any.whl (17.7 kB view details)

Uploaded Apr 13, 2026 Python 3

File details

Details for the file mare_retrieval-0.2.0.tar.gz.

File metadata

Download URL: mare_retrieval-0.2.0.tar.gz
Upload date: Apr 13, 2026
Size: 19.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mare_retrieval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b49792bb4f20c24b07c916324b5e23a42d9beaed10a9b0a6ad80f21392a16146`
MD5	`ed089fc88c35ba2ebf22fc1ee0050788`
BLAKE2b-256	`dc3a60a6ec8308993ced538d280a85038663b99480b47feeba4d6fa97a8d276d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mare_retrieval-0.2.0.tar.gz:

Publisher: publish.yml on SaiSandeepKantareddy/MARE

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mare_retrieval-0.2.0.tar.gz
- Subject digest: b49792bb4f20c24b07c916324b5e23a42d9beaed10a9b0a6ad80f21392a16146
- Sigstore transparency entry: 1283922624
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: SaiSandeepKantareddy/MARE@5ebf94eca32f8bb712d6fe9081f7a985f2accd4e
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/SaiSandeepKantareddy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5ebf94eca32f8bb712d6fe9081f7a985f2accd4e
- Trigger Event: release

File details

Details for the file mare_retrieval-0.2.0-py3-none-any.whl.

File metadata

Download URL: mare_retrieval-0.2.0-py3-none-any.whl
Upload date: Apr 13, 2026
Size: 17.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for mare_retrieval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`388a340419b21cb3cd1d8927903ce04a940aa7cf45882f8b411611e68519d2ec`
MD5	`a73aea0636b6f5648d71f39879df58b7`
BLAKE2b-256	`243f2ad5900e3e4d14ef99ab8073c52c8ad4a752078151afab9b64785c04dc15`

See more details on using hashes here.

Provenance

The following attestation bundles were made for mare_retrieval-0.2.0-py3-none-any.whl:

Publisher: publish.yml on SaiSandeepKantareddy/MARE

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: mare_retrieval-0.2.0-py3-none-any.whl
- Subject digest: 388a340419b21cb3cd1d8927903ce04a940aa7cf45882f8b411611e68519d2ec
- Sigstore transparency entry: 1283923368
- Sigstore integration time: Apr 13, 2026
Source repository:
- Permalink: SaiSandeepKantareddy/MARE@5ebf94eca32f8bb712d6fe9081f7a985f2accd4e
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/SaiSandeepKantareddy
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5ebf94eca32f8bb712d6fe9081f7a985f2accd4e
- Trigger Event: release

mare-retrieval 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

MARE

What this repo is

What this repo is not

Why now

Architecture

Repo layout

Quickstart

Simplest way to use it

Public Python API

Packaging and release

Visual demo

Ingest a real PDF

What you get back

Why the explainability matters

Local sample data

Roadmap

v0.1

v0.2

v0.3

Suggested next open-source moves

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance