Skip to main content

Chunking utilities for GraphRAG

Project description

GraphRAG Chunking

This package contains a collection of text chunkers, a core config model, and a factory for acquiring instances.

Examples

Basic sentence chunking with nltk

The SentenceChunker class splits text into individual sentences by identifying sentence boundaries. It takes input text and returns a list where each element is a separate sentence, making it easy to process text at the sentence level.

chunker = SentenceChunker()
chunks = chunker.chunk("This is a test. Another sentence.")
print(chunks) # ["This is a test.", "Another sentence."]

Token chunking

The TokenChunker splits text into fixed-size chunks based on token count rather than sentence boundaries. It uses a tokenizer to encode text into tokens, then creates chunks of a specified size with configurable overlap between chunks.

tokenizer = tiktoken.get_encoding("o200k_base")
chunker = TokenChunker(size=3, overlap=0, encode=tokenizer.encode, decode=tokenizer.decode)
chunks = chunker.chunk("This is a random test fragment of some text")
print(chunks) # ["This is a", " random test fragment", " of some text"]

Using the factory via helper util

The create_chunker factory function provides a configuration-driven approach to instantiate chunkers by accepting a ChunkingConfig object that specifies the chunking strategy and parameters. This allows for more flexible and maintainable code by separating chunker configuration from direct instantiation.

tokenizer = tiktoken.get_encoding("o200k_base")
config = ChunkingConfig(
    strategy="tokens",
    size=3,
    overlap=0
)
chunker = create_chunker(config, tokenizer.encode, tokenizer.decode)
...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

graphrag_chunking-3.0.0.tar.gz (6.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

graphrag_chunking-3.0.0-py3-none-any.whl (9.9 kB view details)

Uploaded Python 3

File details

Details for the file graphrag_chunking-3.0.0.tar.gz.

File metadata

  • Download URL: graphrag_chunking-3.0.0.tar.gz
  • Upload date:
  • Size: 6.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.4

File hashes

Hashes for graphrag_chunking-3.0.0.tar.gz
Algorithm Hash digest
SHA256 3f52eb49df6f5df55309fca9176f77d559ef48f4d4955820f25213ccd1bbba89
MD5 61946d9974c9831d81d90ecda2667e5c
BLAKE2b-256 5c0805e5c02ff88194e83576fd5d3615f59ecba1743762a4538ac68b0e733eb4

See more details on using hashes here.

File details

Details for the file graphrag_chunking-3.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for graphrag_chunking-3.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ffa688d792d9e92d09e5c0a98c5dbc9739ba9bd375c6ae2f6d76af55afd73689
MD5 b705d90836d605a7d45e136cde33aa09
BLAKE2b-256 136b9a01c9cc6a89388a138586fdd19f2008c92811573353982d45fa63e371cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page