Skip to main content

A library of community-driven data loaders for LLMs. Use with LlamaIndex and/or LangChain.

Project description

LlamaHub 🦙

Original creator: Jesse Zhang (GH: emptycrown, Twitter: @thejessezhang), who courteously donated the repo to LlamaIndex!

This is a simple library of all the data loaders / readers / tools / llama-packs / llama-datasets that have been created by the community. The goal is to make it extremely easy to connect large language models to a large variety of knowledge sources. These are general-purpose utilities that are meant to be used in LlamaIndex, LangChain and more!.

Loaders and readers allow you to easily ingest data for search and retrieval by a large language model, while tools allow the models to both read and write to third party data services and sources. Ultimately, this allows you to create your own customized data agent to intelligently work with you and your data to unlock the full capability of next level large language models.

For a variety of examples of data agents, see the notebooks directory. You can find example Jupyter notebooks for creating data agents that can load and parse data from Google Docs, SQL Databases, Notion, and Slack, and also manage your Google Calendar, and Gmail inbox, or read and use OpenAPI specs.

For an easier way to browse the integrations available, check out the website here: https://llamahub.ai/.

Screenshot 2023-07-17 at 6 12 32 PM

Usage (Use llama-hub as PyPI package)

These general-purpose loaders are designed to be used as a way to load data into LlamaIndex and/or subsequently used in LangChain.

Installation

pip install llama-hub

LlamaIndex

from llama_index import VectorStoreIndex
from llama_hub.google_docs import GoogleDocsReader

gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
index = VectorStoreIndex.from_documents(documents)
index.query('Where did the author go to school?')

LlamaIndex Data Agent

from llama_index.agent import OpenAIAgent
import openai
openai.api_key = 'sk-api-key'

from llama_hub.tools.google_calendar import GoogleCalendarToolSpec
tool_spec = GoogleCalendarToolSpec()

agent = OpenAIAgent.from_tools(tool_spec.to_tool_list())
agent.chat('what is the first thing on my calendar today')
agent.chat("Please create an event for tomorrow at 4pm to review pull requests")

For a variety of examples of creating and using data agents, see the notebooks directory.

LangChain

Note: Make sure you change the description of the Tool to match your use case.

from llama_index import VectorStoreIndex
from llama_hub.google_docs import GoogleDocsReader
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

# load documents
gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
langchain_documents = [d.to_langchain_format() for d in documents]

# initialize sample QA chain
llm = OpenAI(temperature=0)
qa_chain = load_qa_chain(llm)
question="<query here>"
answer = qa_chain.run(input_documents=langchain_documents, question=question)

Loader Usage (Use download_loader from LlamaIndex)

You can also use the loaders with download_loader from LlamaIndex in a single line of code.

For example, see the code snippets below using the Google Docs Loader.

from llama_index import VectorStoreIndex, download_loader

GoogleDocsReader = download_loader('GoogleDocsReader')

gdoc_ids = ['1wf-y2pd9C878Oh-FmLH7Q_BQkljdm6TQal-c1pUfrec']
loader = GoogleDocsReader()
documents = loader.load_data(document_ids=gdoc_ids)
index = VectorStoreIndex.from_documents(documents)
index.query('Where did the author go to school?')

Llama-Pack Usage

Llama-packs can be downloaded using the llamaindex-cli tool that comes with llama-index:

llamaindex-cli download-llamapack ZephyrQueryEnginePack --download-dir ./zephyr_pack

Or with the download_llama_pack function directly:

from llama_index.llama_packs import download_llama_pack

# download and install dependencies
LlavaCompletionPack = download_llama_pack(
  "LlavaCompletionPack", "./llava_pack"
)

Llama-Dataset Usage

The primary use of llama-dataset is for evaluating the performance of a RAG system. In particular, it serves as a new test set (in traditional machine learning speak) for one to build a RAG over, predict on, and subsequently perform evaluations comparing the predicted response versus the reference response. To perform the evaluation, the recommended usage pattern involves the application of the RagEvaluatorPack. We recommend reading the docs for the "Evaluation" module for more information.

from llama_index.llama_dataset import download_llama_dataset
from llama_index.llama_pack import download_llama_pack
from llama_index import VectorStoreIndex

# download and install dependencies for benchmark dataset
rag_dataset, documents = download_llama_dataset(
  "PaulGrahamEssayDataset", "./data"
)

# build basic RAG system
index = VectorStoreIndex.from_documents(documents=documents)
query_engine = VectorStoreIndex.as_query_engine()

# evaluate using the RagEvaluatorPack
RagEvaluatorPack = download_llama_pack(
  "RagEvaluatorPack", "./rag_evaluator_pack"
)
rag_evaluator_pack = RagEvaluatorPack(
    rag_dataset=rag_dataset,
    query_engine=query_engine
)
benchmark_df = rag_evaluate_pack.run()  # async arun() supported as well

Llama-datasets can also be downloaded directly using llamaindex-cli, which comes installed with the llama-index python package:

llamaindex-cli download-llamadataset PaulGrahamEssayDataset --download-dir ./data

After downloading them from llamaindex-cli, you can inspect the dataset and it source files (stored in a directory /source_files) then load them into python:

from llama_index import SimpleDirectoryReader
from llama_index.llama_dataset import LabelledRagDataset

rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
documents = SimpleDirectoryReader(
    input_dir="./data/source_files"
).load_data()

How to add a loader/tool/llama-pack

Adding a loader/tool/llama-pack simply requires forking this repo and making a Pull Request. The Llama Hub website will update automatically when a new llama-hub release is made. However, please keep in mind the following guidelines when making your PR.

Step 0: Setup virtual environment, install Poetry and dependencies

Create a new Python virtual environment. The command below creates an environment in .venv, and activates it:

python -m venv .venv
source .venv/bin/activate

if you are in windows, use the following to activate your virtual environment:

.venv\scripts\activate

Install poetry:

pip install poetry

Install the required dependencies (this will also install llama_index):

poetry install

This will create an editable install of llama-hub in your venv.

Step 1: Create a new directory

For loaders, create a new directory in llama_hub, for tools create a directory in llama_hub/tools, and for llama-packs create a directory in llama_hub/llama_packs It can be nested within another, but name it something unique because the name of the directory will become the identifier for your loader (e.g. google_docs). Inside your new directory, create a __init__.py file specifying the module's public interface with __all__, a base.py file which will contain your loader implementation, and, if needed, a requirements.txt file to list the package dependencies of your loader. Those packages will automatically be installed when your loader is used, so no need to worry about that anymore!

If you'd like, you can create the new directory and files by running the following script in the llama_hub directory. Just remember to put your dependencies into a requirements.txt file.

./add_loader.sh [NAME_OF_NEW_DIRECTORY]

Step 2: Write your README

Inside your new directory, create a README.md that mirrors that of the existing ones. It should have a summary of what your loader or tool does, its inputs, and how it is used in the context of LlamaIndex and LangChain.

Step 3: Add your loader to the library.json file

Finally, add your loader to the llama_hub/library.json file (or for the equivilant library.json under tools/ or llama-packs/) so that it may be used by others. As is exemplified by the current file, add the class name of your loader or tool, along with its ID, author, etc. This file is referenced by the Llama Hub website and the download function within LlamaIndex.

Step 4: Make a Pull Request!

Create a PR against the main branch. We typically review the PR within a day. To help expedite the process, it may be helpful to provide screenshots (either in the PR or in the README directly) Show your data loader or tool in action!

How to add a llama-dataset

Similar to the process of adding a tool / loader / llama-pack, adding a llama- datset also requires forking this repo and making a Pull Request. However, for a llama-dataset, only its metadata is checked into this repo. The actual dataset and it's source files are instead checked into another Github repo, that is the llama-datasets repository. You will need to fork and clone that repo in addition to forking and cloning this one.

Please ensure that when you clone the llama-datasets repository, that you set the environment variable GIT_LFS_SKIP_SMUDGE prior to calling the git clone command:

# for bash
GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:<your-github-user-name>/llama-datasets.git  # for ssh
GIT_LFS_SKIP_SMUDGE=1 git clone https://github.com/<your-github-user-name>/llama-datasets.git  # for https

# for windows its done in two commands
set GIT_LFS_SKIP_SMUDGE=1  
git clone git@github.com:<your-github-user-name>/llama-datasets.git  # for ssh

set GIT_LFS_SKIP_SMUDGE=1  
git clone https://github.com/<your-github-user-name>/llama-datasets.git  # for https

The high-level steps for adding a llama-dataset are as follows:

  1. Create a LabelledRagDataset (the initial class of llama-dataset made available on llama-hub)
  2. Generate a baseline result with a RAG system of your own choosing on the LabelledRagDataset
  3. Prepare the dataset's metadata (card.json and README.md)
  4. Submit a Pull Request to this repo to check in the metadata
  5. Submit a Pull Request to the llama-datasets repository to check in the LabelledRagDataset and the source files

To assist with the submission process, we have prepared a submission template notebook that walks you through the above-listed steps. We highly recommend that you use this template notebook.

Running tests

python3.9 -m venv .venv
source .venv/bin/activate 
pip3 install -r test_requirements.txt

poetry run pytest tests 

Changelog

If you want to track the latest version updates / see which loaders are added to each release, take a look at our full changelog here!

FAQ

How do I test my loader before it's merged?

There is an argument called loader_hub_url in download_loader that defaults to the main branch of this repo. You can set it to your branch or fork to test your new loader.

Should I create a PR against LlamaHub or the LlamaIndex repo directly?

If you have a data loader PR, by default let's try to create it against LlamaHub! We will make exceptions in certain cases (for instance, if we think the data loader should be core to the LlamaIndex repo).

For all other PR's relevant to LlamaIndex, let's create it directly against the LlamaIndex repo.

Other questions?

Feel free to hop into the community Discord or tag the official Twitter account!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_hub-0.0.52.tar.gz (6.3 MB view details)

Uploaded Source

Built Distribution

llama_hub-0.0.52-py3-none-any.whl (6.6 MB view details)

Uploaded Python 3

File details

Details for the file llama_hub-0.0.52.tar.gz.

File metadata

  • Download URL: llama_hub-0.0.52.tar.gz
  • Upload date:
  • Size: 6.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.13 Darwin/23.0.0

File hashes

Hashes for llama_hub-0.0.52.tar.gz
Algorithm Hash digest
SHA256 c7e49b8507d97b868d38b0937b7d3725cd749c174853ba2a144e95065f45e8f3
MD5 d8a9bff89222e5587e402c0034d9e913
BLAKE2b-256 c602fac11a9d5f7ed2be512ae8d68a5e79ca7996ca2120db3b91ba66d8320d94

See more details on using hashes here.

File details

Details for the file llama_hub-0.0.52-py3-none-any.whl.

File metadata

  • Download URL: llama_hub-0.0.52-py3-none-any.whl
  • Upload date:
  • Size: 6.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.13 Darwin/23.0.0

File hashes

Hashes for llama_hub-0.0.52-py3-none-any.whl
Algorithm Hash digest
SHA256 813956e1d451fe26ac758b8f5dc11c0c5d9657df8176554766bf23b6fbbc5b86
MD5 98a7497b310b75ada42a8aee9ce072cf
BLAKE2b-256 4accf5655762738f442da6d094c57c17a61a9581ad64d7c54cf58e24ddff2931

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page