docuverse

State-of-the-art Retrieval/Search engine models, including ElasticSearch, ChromaDB, Milvus, and PrimeQA

These details have not been verified by PyPI

Project links

Project description

Repository for (almost) all your document search needs.

Part of the Prime Repository for State-of-the-Art Multilingual QuestionAnswering Research and Development.

DocUVerse is a public open source repository that enables researchers and developers to quickly experiment with various search engines (such as ElasticSearch, ChromaDB, Milvus, FAISS, LanceDB) both in direct search and reranking scenarios. By using DocUVerse, a researcher can replicate the experiments outlined in a paper published in the latest NLP conference while also enjoying the capability to download pre-trained models (from an online repository) and run them on their own custom data. DocUVerse is built on top of the Transformers and sentence-transformers toolkits and uses datasets and models that are directly downloadable from the Hugging Face Hub.

✔️ Getting Started

Install

The curated quickstart extra pulls in everything the README example needs (Milvus-Lite + sentence-transformers); no external service required.

pip install -e .[quickstart]

Other extras: elastic, chromadb, faiss, lancedb, extra (pyizumo, huggingface CLI), dev (pytest, ruff, mypy). Combine them as needed: pip install -e .[quickstart,elastic,dev].

Run the quickstart — three equivalent surfaces

The repo ships a 10-passage / 5-query toy corpus under examples/quickstart/. All three of the following produce the same ranked results.

Python — preset with overrides (the headline API):

from docuverse import SearchEngine

engine = SearchEngine.from_preset(
    "milvus-dense",
    model_name="ibm-granite/granite-embedding-small-english-r2",
    index_name="docuverse_quickstart",
    input_passages="examples/quickstart/passages.jsonl",
    input_queries="examples/quickstart/queries.jsonl",
    output_file="examples/quickstart/output.json",
)
engine.ingest(engine.read_data())
queries = engine.read_questions()
results = engine.search(queries)
print(engine.compute_score(queries, results))

Python — explicit YAML (when you want the config in version control):

from docuverse import SearchEngine

engine = SearchEngine(config_or_path="examples/quickstart/recipe.yaml")
engine.ingest(engine.read_data())
queries = engine.read_questions()
print(engine.compute_score(queries, engine.search(queries)))

CLI:

docuverse run --config examples/quickstart/recipe.yaml

In-memory documents and queries (ChromaDB)

When you don't want to write JSONL files at all — e.g. you already have a list of strings in Python — wrap each string in a small dict and hand it straight to read_data / read_questions. The example below uses the chromadb preset, which runs a local persistent ChromaDB instance on disk (no external service). Install with pip install -e .[chromadb].

from docuverse import SearchEngine

documents = [
    "Photosynthesis is the biological process by which green plants convert light into chemical energy stored in glucose.",
    "Mitochondria are membrane-bound organelles that generate most of the cell's ATP, the main currency of cellular energy.",
    "DNA is a double-helix polymer that carries the genetic instructions for the development and function of all known organisms.",
    "Newton's three laws of motion describe the relationship between a body and the forces acting on it; he formulated them in 1687.",
    "World War II was a global conflict from 1939 to 1945 between the Allies and the Axis powers.",
    "The Pacific Ocean is the largest and deepest of Earth's five oceanic divisions.",
]

queries = [
    "How do plants make energy from sunlight?",
    "What organelle produces ATP in cells?",
    "Who discovered the laws of motion?",
    "What is the largest ocean on Earth?",
]

engine = SearchEngine.from_preset(
    "chromadb",
    index_name="docuverse_readme_demo",
    top_k=3,
)

passages = [{"id": f"d{i}", "text": text} for i, text in enumerate(documents)]
engine.ingest(engine.read_data(file=passages))

question_records = [{"id": f"q{i}", "text": q} for i, q in enumerate(queries)]
search_queries = engine.read_questions(file=question_records)
results = engine.search(search_queries)

for query, result in zip(search_queries, results):
    top = result[0]
    print(f"Q: {query.text}\n  → {top.text[:80]}...\n")

The pattern generalises: any read_data / read_questions call accepts a list of dicts in place of a file path, so you can pull documents from a database, an API, or a generator without staging them on disk first. Required keys per record are id and text; title and any other fields are preserved into the index.

Discover presets

docuverse presets list --with-engine    # name + db_engine
docuverse presets show milvus-dense     # parsed config
docuverse presets dump milvus-dense > my-recipe.yaml   # copy-and-edit

In Python: SearchEngine.list_presets().

Configuration

DocUVerse looks for config files under ./config/<rel_path> (with a ./config/<basename> legacy fallback that emits one DeprecationWarning), plus operator-level $DOCUVERSE_HOME/... and per-user ~/.docuverse/... overrides. See config/README.md for the full six-tier resolver and the categorized layout (servers/, engines/, recipes/, data_formats/).

🔭 Learn more

Section	Description
📒 Documentation	DocUVerse API documentation and tutorials
📓 Tutorials: Jupyter Notebooks	Notebooks to get started with retrieval and reranking
🤗 Model sharing and uploading	Upload and share your fine-tuned models with the community
✅ Pull Request	DocUVerse Pull Request template
📄 Generate Documentation	How the documentation is built

❤️ DocUVerse collaborators include: Sara Rosenthal, Parul Awasthy, Scott McCarley, Jatin Ganhotra, and Radu Florian.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Jun 8, 2026

This version

0.1.0

Jun 8, 2026

0.0.13

Dec 10, 2025

0.0.8

Nov 22, 2024

0.0.1

Jun 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docuverse-0.1.0-py3-none-any.whl (541.3 kB view details)

Uploaded Jun 8, 2026 Python 3

File details

Details for the file docuverse-0.1.0-py3-none-any.whl.

File metadata

Download URL: docuverse-0.1.0-py3-none-any.whl
Upload date: Jun 8, 2026
Size: 541.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.5

File hashes

Hashes for docuverse-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b92a961b68cf7858f73db55808d9bd4458b0bce336d7205e2f19dbdb48101596`
MD5	`d1b9bd87084ce454bd25a59142751007`
BLAKE2b-256	`6b56ddda8d73e508299bc4c253053fd097a934931cb228ac347daebc2ee94274`

See more details on using hashes here.

docuverse 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Repository for (almost) all your document search needs.

Part of the Prime Repository for State-of-the-Art Multilingual QuestionAnswering Research and Development.

✔️ Getting Started

Install

Run the quickstart — three equivalent surfaces

In-memory documents and queries (ChromaDB)

Discover presets

Configuration

🔭 Learn more

❤️ DocUVerse collaborators include: Sara Rosenthal, Parul Awasthy, Scott McCarley, Jatin Ganhotra, and Radu Florian.

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

docuverse 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Repository for (almost) *all* your document search needs. Part of the Prime Repository for State-of-the-Art Multilingual QuestionAnswering Research and Development.

✔️ Getting Started

Install

Run the quickstart — three equivalent surfaces

In-memory documents and queries (ChromaDB)

Discover presets

Configuration

🔭 Learn more

❤️ DocUVerse collaborators include: Sara Rosenthal, Parul Awasthy, Scott McCarley, Jatin Ganhotra, and Radu Florian.

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Repository for (almost) all your document search needs.

Part of the Prime Repository for State-of-the-Art Multilingual QuestionAnswering Research and Development.