Quackling enables document-native generative AI applications
Project description
Quackling
Quackling enables document-native generative AI applications, such as RAG, based on Docling.
Features
- 🧠 Enables rich gen AI applications by providing capabilities on native document level — not just plain text / Markdown!
- ⚡️ Leverages Docling's conversion quality and speed.
- ⚙️ Integrates with standard LLM application frameworks, such as LlamaIndex, for building powerful applications like RAG.
Installation
To use Quackling, simply install quackling
from your package manager, e.g. pip:
pip install quackling
Usage
Quackling offers core capabilities (quackling.core
), as well as framework integration components
e.g. for LlamaIndex (quackling.llama_index
). Below you find examples of both.
Basic RAG
Below you find a basic RAG pipeline using LlamaIndex.
[!NOTE] To use as is, first
pip install llama-index-embeddings-huggingface llama-index-llms-huggingface-api
additionally toquackling
to install the models. Otherwise, you can setEMBED_MODEL
&LLM
as desired, e.g. using local models.
import os
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from quackling.llama_index.node_parsers.hier_node_parser import HierarchicalNodeParser
from quackling.llama_index.readers.docling_reader import DoclingReader
DOCS = ["https://arxiv.org/pdf/2311.18481"]
QUERY = "What is DocQA?"
EMBED_MODEL = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
LLM = HuggingFaceInferenceAPI(
token=os.getenv("HF_TOKEN"),
model_name="mistralai/Mixtral-8x7B-Instruct-v0.1",
)
index = VectorStoreIndex.from_documents(
documents=DoclingReader(parse_type=DoclingReader.ParseType.JSON).load_data(DOCS),
embed_model=EMBED_MODEL,
transformations=[HierarchicalNodeParser()],
)
query_engine = index.as_query_engine(llm=LLM)
response = query_engine.query(QUERY)
# > DocQA is a question-answering conversational assistant [...]
Chunking
You can also use Quackling with any pipeline, i.e. independently of frameworks like LlamaIndex. For instance, to split the document to chunks based on document structure and returning pointers to Docling document's nodes:
from docling.document_converter import DocumentConverter
from quackling.core.chunkers.hierarchical_chunker import HierarchicalChunker
doc = DocumentConverter().convert_single("https://arxiv.org/pdf/2206.01062")
chunks = list(HierarchicalChunker().chunk(doc))
# > [
# > ChunkWithMetadata(
# > path='$.main-text[0]',
# > text='DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis',
# > page=1,
# > bbox=[107.59, 672.38, 505.18, 709.08]
# > ),
# > [...]
# > ]
More examples
Check out the examples — showcasing different variants of RAG incl. vector ingestion & retrieval:
- [LlamaIndex] Milvus dense-embedding RAG
- [LlamaIndex] Milvus hybrid RAG, combining dense & sparse embeddings
- [LlamaIndex] Milvus RAG, also fetching native document metadata for search results
- [LlamaIndex] Local node transformations (e.g. embeddings)
- ...
Contributing
Please read Contributing to Quackling for details.
References
If you use Quackling in your projects, please consider citing the following:
@software{Docling,
author = {Deep Search Team},
month = {7},
title = {{Docling}},
url = {https://github.com/DS4SD/docling},
version = {main},
year = {2024}
}
License
The Quackling codebase is under MIT license. For individual component usage, please refer to the component licenses found in the original packages.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for quackling-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a3c459acee1435506c9909054b35bcdf21489d606364ce0d5f66821fd37908b |
|
MD5 | e1fdab842356f7f31d45f620c540dd00 |
|
BLAKE2b-256 | 0321544a600aab6676cfac84094ac1d59ea7fef8c2f407d0ed59e2582136c8e9 |