Framework for building LLM-based apps in Boehringer Ingelheim
Project description
tangbao
This is a Python package for helping with creating LLM-based Streamlit apps in Boehringer.
"tangbao" is Chinese for "soup dumpling", a type of dim sum popular in Shanghai. It is made by filling a dumpling with meat and a mixture of ice and lard. When the dumpling steams, the frozen part melts and imparts the "tang" to the "bao". The word "bao" is also the name for a package or library in coding.
Contact Steven Brooks or Pietro Mascheroni for feedback/support.
For Users
Installation
Everything below assumes you have a python venv already created. If you dont, then run
python -m venv .venv
source .venv/bin/activate
To install the package run pip install tangbao
Other dependencies
You may need to install other dependencies depending on what kinds of documents you want to parse.
See the unstructured documentation here: https://pypi.org/project/unstructured/
In this package, we only install the minimal dependencies necessary.
Configuration
This project requires certain environment variables to be set. These variables are used for connecting to external APIs and services.
- Create a
.envfile in the root directory of the project. - Add the following content to the
.envfile, replacing placeholder values with your actual credentials:
Apollo
Apollo is the focus of the remainder of this guide.
APOLLO_CLIENT_ID=your_client_id
APOLLO_CLIENT_SECRET=your_client_secret
INDEX_NAME="" # Set this to your index name, see below for more details
Azure
To use the Azure endpoints, you'll need the following in your env:
AZURE_BASE_URL=https://azure.example.com
AZURE_API_KEY=your_azure_api_key
AZURE_DEPLOYMENT_VERSION=v1
AZURE_DEPLOYMENT_NAME=model_name
RAG Workflow
Step 1: Parse Documents
Note: This guide, and all following guides assume you've set up your environment properly. See above for instructions.
Before we can build the RAG, we need to parse the documents. This package provides functions to make that easier.
NOTE: PDF images will not be parsed.
We provide a basic chunking strategy, i.e., unstructured chunking. This means that meta-information such as the chapter or section level is missed when chunking the documents.
Two parameters control the chunking structure:
- CHUNK_SIZE: controls the maximum number of characters in one text chunk
- CHUNK_OVERLAP: controls the characters that overlap between following chunks.
The chunk size controls the granularity in which the text is divided: small chunks provide very specific, almost keyword based, matches to the query. Larger chunks allow to grasp more context and subtle meaning of the text.
To start with, we suggest to go for CHUNK_SIZE = 2000, CHUNK_OVERLAP = 500. From our experiments, these values provide a good default for many situations.
The following is a simple example to setup a parsing strategy. Please follow these steps:
- Store PDF documents in a folder named
./documents - Create a script like the following:
from tangbao import parse_docs
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 500
filenames_df = parse_docs.get_filenames("./documents")
processed_docs = parse_docs.process_documents(filenames_df, CHUNK_SIZE, CHUNK_OVERLAP)
processed_docs["Metadata"] = processed_docs["Metadata"].apply(parse_docs.parse_metadata)
# Save file for the next step
processed_docs.to_parquet(f'my_docs_cs_{CHUNK_SIZE}_co_{CHUNK_OVERLAP}.parquet')
Step 2: Index the RAG Database
After we've parsed the documents in Step 1, we can index the RAG's vector database with the document chunks and metadata.
- Make sure to use the same CHUNK_SIZE and CHUNK_OVERLAP values from the previous step.
- Make sure you have the
.parquetfile in your working directory. - The INDEX_NAME can have underscores, dashes, numbers and lower-case characters only.
Note: Its very important that you keep your index name a secret so others won't overwrite it with their documents. Consider using an environment variable for this, similar to how we treat an API Key. Another level of assurance that no one will overwrite your index with their documents would be to generate a unique index name, e.g., with
import uuid
from tangbao.apollo import Apollo
unique_id = str(uuid.uuid4()) # can only include lower case alpha-numeric, underscores, and dashes
apollo = Apollo()
iam = apollo.iam()
INDEX_NAME=f'{iam["id"]}_{unique_id}'
But then just remember to record this index name in your .env for use later on. If you call it INDEX_NAME, then
you can call on it with e.g., os.getenv("INDEX_NAME")
- Index the RAG DB. This can be done following a similar script:
from tangbao.apollo import Apollo
from tangbao.parse_docs import separate_for_indexing
from tangbao import config
import pandas as pd
import os
# use the same values from Step 1
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 500
PARQUET_FILE = f'my_docs_cs_{CHUNK_SIZE}_co_{CHUNK_OVERLAP}.parquet'
INDEX_NAME = os.getenv("INDEX_NAME") # can only include lower case alpha-numeric, underscores, and dashes
EMBEDDING_MODEL = "openai-text-embedding-3-large" # you can see other embedding models with apollo.get_model_info()
processed_docs = pd.read_parquet(PARQUET_FILE, engine='pyarrow')
texts, ids, metadatas = separate_for_indexing(processed_docs)
# this can take a long time to run, depending on how many documents you have
apollo.index_multi_threaded(
texts=texts,
ids=ids,
metadatas=metadatas,
index_name=INDEX_NAME,
embedding_model=EMBEDDING_MODEL,
max_workers=8
)
# if there are any failures in indexing doc chunks, they will be written to a log file.
# You can resubmit those chunks using this method:
if os.path.exists(config.LOG_FILE):
apollo.resubmit_failed_chunks(
log_file=config.LOG_FILE,
texts=texts,
ids=ids,
metadatas=metadatas,
index_name=INDEX_NAME,
embedding_model=EMBEDDING_MODEL,
max_workers=8
)
After the indexing is completed, it is possible to query the RAG dataset with a test question. This can be accomplished using the following script:
apollo.query_index(
user_query="YOUR QUERY HERE",
num_chunks=5,
index_name=INDEX_NAME,
embedding_model=EMBEDDING_MODEL
)
Step 3: Build a Streamlit App
Now that we have indexed our documents in the RAG database, we can build a Streamlit app to let users 'chat' with the document store.
To create the app, follow these steps:
- Make sure you have the INDEX_NAME from the previous step
- Create a file called
app.pyand use the following template. Make sure to change the custom prompt below if needed! Changing the prompt is a crucial step to assure that the generation phase of the RAG conforms to your specific use case. Invest some time in prompt engineering, to get the best out of the LLM used to generate the answers to the user queries.
import streamlit as st
import pandas as pd
from tangbao import utils
from tangbao.apollo import Apollo
import os
INDEX_NAME = os.getenv("INDEX_NAME")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL") # make sure its the same one you used for indexing above!
st.title("Chat with Docs")
# Define Session State
if "messages" not in st.session_state.keys():
st.session_state.messages = [{"role": "assistant", "content": "How may I help you?"}]
if "used_tokens" not in st.session_state:
st.session_state.used_tokens = 0
# Display chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.write(message["content"])
apollo = Apollo()
@st.cache_data
def cached_model_info(_apollo_client):
return _apollo_client.get_model_info()
model_info = cached_model_info(apollo)
chat_models = [model["model_name"] for model in model_info if model["model_info"]["mode"] == "chat"]
# Define Sidebar
with st.sidebar:
selected_model = st.selectbox("Select LLM:", chat_models)
CONTEXT_WINDOW = [model['model_info']['max_input_tokens'] for model in model_info if model['model_name'] == selected_model][0]
token_display = st.empty()
with token_display.container():
st.progress(st.session_state.used_tokens/CONTEXT_WINDOW, text = f"Context window used ({st.session_state.used_tokens} out of {CONTEXT_WINDOW})")
temperature = st.slider("Select model creativity (temperature)", min_value=0.0, max_value=1.0, value = 0.0)
chunk_num = st.slider("Select number of chunks", min_value=1, max_value=8, value=4)
# User Input
if user_query := st.chat_input("Ask a question"):
st.session_state.messages.append({"role": "user", "content": user_query})
with st.chat_message("user"):
st.markdown(user_query)
if st.session_state.messages[-1]["role"] != "assistant":
# RAG Output
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
context = apollo.query_index(user_query, chunk_num, INDEX_NAME)
#### ADAPT THE FOLLOWING PROMPT TO YOUR SPECIFIC NEEDS ####
prompt = f"""\
Use the following CONTEXT delimited by triple backticks to answer the QUESTION at the end.
If you don't know the answer, just say that you don't know.
Use three to five sentences and keep the answer as concise as possible.
You are also a language expert, and so can translate your responses to other languages upon request.
CONTEXT: ```
{context['docs']}
```
QUESTION: ```
{user_query}
```
Helpful Answer:"""
response_full = apollo.chat_completion(
messages=[{'role': 'user', 'content': prompt}] +
[{'role': m['role'], 'content': m['content']} for m in st.session_state.messages],
model=selected_model,
temperature=temperature,
seed=42,
is_stream=False
)
response = apollo.get_content(response_full)
st.session_state.used_tokens = apollo.get_token_usage(response_full)
st.write(response)
with st.sidebar:
with token_display.container():
st.progress(st.session_state.used_tokens/CONTEXT_WINDOW, text = f"Context window used ({st.session_state.used_tokens} out of {CONTEXT_WINDOW})")
sources, titles = utils.extract_source(context)
st.header("Sources:")
st.table(pd.DataFrame({"Documents referenced": titles}))
st.markdown(sources, unsafe_allow_html=True)
st.session_state.messages.append({"role": "assistant", "content": response})
Then run streamlit run app.py to see if it works!
For Developers
Testing
pip install -e .
The -e flag in pip install -e . installs the package in "editable" mode, which means:
- Changes you make to the source code will be reflected immediately without reinstalling
- The package will be available in your Python environment just like a normal installed package
- You can import it with
import tangbaoin your scripts
For unit testing, we'll use the pytest framework.
source .venv/bin/activate
python -m pytest tests/
Build
source .venv/bin/activate
pip install -r requirements.txt
pip install --upgrade build wheel bumpversion
bumpversion patch # or major or minor
rm -rf dist
python setup.py sdist bdist_wheel
Upload to PyPI
Requires a PyPI API Token. Get one at https://pypi.org
Set the token in your environment as TWINE_PASSWORD
source .venv/bin/activate
pip install --upgrade twine
twine upload --repository pypi dist/*
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tangbao-0.1.6.tar.gz.
File metadata
- Download URL: tangbao-0.1.6.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e69344ac072462b9d6f0521c222ac6d03c506a32be5e5ea8221800ff24559f4
|
|
| MD5 |
5bae0e3aaad716fc2d0f50bc57851741
|
|
| BLAKE2b-256 |
03fa3f192a8daf13b04e91d15891976e8daaddd3af0721686a7d028dd855454b
|
File details
Details for the file tangbao-0.1.6-py3-none-any.whl.
File metadata
- Download URL: tangbao-0.1.6-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
815acb8fe15dc5ac1edcfc32ad5fc36f433393d769ac0c2c98c642efb1384790
|
|
| MD5 |
93ff3ab6703c45cd9d073016f5f3529f
|
|
| BLAKE2b-256 |
4ae8f94ee18b2a8fa1486fd3626495273d52621260c64a8a0b21ccf07627a595
|