A simple, easy-to-hack GraphRAG implementation

These details have not been verified by PyPI

Project links

Homepage

Project description

nano-GraphRAG

A simple, easy-to-hack GraphRAG implementation

😭 GraphRAG is good and powerful, but the official implementation is difficult/painful to read or hack.

😊 This project provides a smaller, faster, cleaner GraphRAG, while remaining the core functionality(see benchmark and issues ).

🎁 Excluding tests and prompts, nano-graphrag is about 800 lines of code.

👌 Small yet scalable, asynchronous and fully typed

Install

Install from PyPi

pip install nano-graphrag

Install from source

# clone this repo first
cd nano-graphrag
pip install -e .

Quick Start

[!TIP]

Please set OpenAI API key in environment: export OPENAI_API_KEY="sk-...". If you like to use another LLM, please have a look at LLM component

download a copy of A Christmas Carol by Charles Dickens:

curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt

Use the below python snippet:

from nano_graphrag import GraphRAG, QueryParam

graph_func = GraphRAG(working_dir="./dickens")

with open("./book.txt") as f:
    graph_func.insert(f.read())

# Perform global graphrag search
print(graph_func.query("What are the top themes in this story?"))

# Perform local graphrag search (I think is better and more scalable one)
print(graph_func.query("What are the top themes in this story?", param=QueryParam(mode="local")))

Next time you initialize a GraphRAG from the same working_dir, it will reload all the contexts automatically.

Incremental Insert

nano-graphrag supports incremental insert, no duplicated computation or data will be added:

with open("./book.txt") as f:
    book = f.read()
    half_len = len(book) // 2
    graph_func.insert(book[:half_len])
    graph_func.insert(book[half_len:])

nano-graphrag use md5-hash of the content as the key, so there is no duplicated chunk.

However, each time you insert, the communities of graph will be re-computed and the community reports will be re-generated

Async

For each method NAME(...) , there is a corresponding async method aNAME(...)

await graph_func.ainsert(...)
await graph_func.aquery(...)
...

Available Parameters

GraphRAG and QueryParam are dataclass in Python. Use help(GraphRAG) and help(QueryParam) to see all available parameters!

Advanced

Prompt

nano-graphrag use prompts from nano_graphrag.prompt.PROMPTS dict object. You can play with it and replace any prompt inside.

Some important prompts:

PROMPTS["entity_extraction"] is used to extract the entities and relations from a text chunk.
PROMPTS["community_report"] is used to organize and summary the graph cluster's description.
PROMPTS["local_rag_response"] is the system prompt template of the local search generation.
PROMPTS["global_reduce_rag_response"] is the system prompt template of the global search generation.

Storage

You can replace all storage-related components to your own implementation, nano-graphrag mainly uses three kinds of storage:

base.BaseKVStorage for storing key-json pairs of data.
- By default we use disk file storage as the backend.
- GraphRAG(.., key_string_value_json_storage_cls=YOURS,...)
base.BaseVectorStorage for indexing embeddings.
- By default we use milvus-lite as the backend.
- GraphRAG(.., vector_db_storage_cls=YOURS,...)
base.BaseGraphStorage for storing knowledge graph.
- By default we use networkx as the backend.
- GraphRAG(.., graph_storage_cls=YOURS,...)

You can refer to nano_graphrag.base to see detailed interfaces for each components.

LLM

In nano-graphrag, we requires two types of LLM, a great one and a cheap one. The former is used to plan and respond, the latter is used to summary. By default, the great one is gpt-4o and the cheap one is gpt-4o-mini

You can implement your own LLM function (refer to _llm.gpt_4o_complete):

async def my_llm_complete(
    prompt, system_prompt=None, history_messages=[], **kwargs
) -> str:
  # pop cache KV database if any
  hashing_kv: BaseKVStorage = kwargs.pop("hashing_kv", None)
  # the rest kwargs are for calling LLM, for example, `max_tokens=xxx`
	...
  # YOUR LLM calling
  response = await call_your_LLM(messages, **kwargs)
  return response

Replace the default one with:

# Adjust the max token size or the max async requests if needed
GraphRAG(best_model_func=my_llm_complete, best_model_max_token_size=..., best_model_max_async=...)
GraphRAG(cheap_model_func=my_llm_complete, cheap_model_max_token_size=..., cheap_model_max_async=...)

You can refer to an example that use deepseek-chat as the LLM model.

Embedding

You can replace the default embedding functions with any _utils.EmbedddingFunc instance.

For example, the default one is using OpenAI embedding API:

@wrap_embedding_func_with_attrs(embedding_dim=1536, max_token_size=8192)
async def openai_embedding(texts: list[str]) -> np.ndarray:
    openai_async_client = AsyncOpenAI()
    response = await openai_async_client.embeddings.create(
        model="text-embedding-3-small", input=texts, encoding_format="float"
    )
    return np.array([dp.embedding for dp in response.data])

Replace default embedding function with:

GraphRAG(embedding_func=your_embed_func, embedding_batch_num=..., embedding_func_max_async=...)

Benchmark

Issues

nano-graphrag didn't implement the covariates feature of GraphRAG
nano-graphrag implements the global search different from the original. The original use a map-reduce-like style to fill all the communities into context, while nano-graphrag only use the top-K important and central communites (use QueryParam.global_max_conside_community to control, default to 512 communities).

TODO in Next Version

If a checkbox is filled meaning someone is on it

nano-graphrag's Data Source Id is local, meaning it always starts at 0 at any response and you have to remap it into the current session. So it's kinda useless right now.
nano-graphrag truncates the community's raw description if it exceed the maximun context size when generating community report, while GraphRAG uses a sub-communities iterative summary to include all.
Add real benchmark with GraphRAG
Add new components, see issue

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.7

Sep 9, 2024

0.0.6

Aug 30, 2024

0.0.5

Aug 28, 2024

0.0.5.dev1 pre-release

Aug 26, 2024

0.0.5.dev0 pre-release

Aug 26, 2024

0.0.4

Aug 24, 2024

0.0.3

Aug 16, 2024

This version

0.0.3.dev0 pre-release

Aug 16, 2024

0.0.2

Aug 15, 2024

0.0.2.dev0 pre-release

Aug 16, 2024

0.0.1

Jul 30, 2024

0.0.1.dev0 pre-release

Sep 7, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nano_graphrag-0.0.3.dev0.tar.gz (30.3 kB view hashes)

Uploaded Aug 16, 2024 Source

Built Distribution

nano_graphrag-0.0.3.dev0-py3-none-any.whl (29.6 kB view hashes)

Uploaded Aug 16, 2024 Python 3

Hashes for nano_graphrag-0.0.3.dev0.tar.gz

Hashes for nano_graphrag-0.0.3.dev0.tar.gz
Algorithm	Hash digest
SHA256	`be21cade9f9508f6fd30492467b666462d31195448272034fa9af87ac531ddf4`
MD5	`61abe50ac66789ab96cc0930f339098e`
BLAKE2b-256	`5b415bdaf0e9161b0f21698db6f821e7d637a5e1dc78d4d47370c9716c609d50`

Hashes for nano_graphrag-0.0.3.dev0-py3-none-any.whl

Hashes for nano_graphrag-0.0.3.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a710d00239e2a44de91940ea7881199e44169f639ccf07bf2ddd905fc27a16f4`
MD5	`24210e727d73a87dc239dfc3ae2b803a`
BLAKE2b-256	`3d86a6affa7562a299f9a162258554184f55555eb142e7576784240f45558122`