Skip to main content

A package for chunking documents from various formats

Project description

Chunk Norris

Goal

This project aims at improving the method of chunking documents from various sources (HTML, PDFs, ...) An optimized chunking method might lead to smaller chunks, meaning :

  • Better relevancy of chunks (and thus easier identification of useful chunks through embedding cosine similarity)
  • Less errors because of chunks exceeding the API limit in terms of number of tokens
  • Less hallucinations of generation models because of superfluous information in the prompt
  • Reduced cost as the prompt would have reduced size

Installation

Using Pypi, just run the following command : pip install chunknorris

Chunkers

The package features multiple chunkers that can be used independently depending on the type of document needed.

All chunkers follow a similar logic :

  • Extract table of contents (= headers)
  • Build chunks using the text content of a part, and put the titles of the parts it belongs to on top

MarkdownChunkNorris

This chunker is meant to be used on markdown-formatted text.

Note: When calling the chunker, you need to specify the header style of your markdown text (ATX or Setext). By default it will consider "Setext" heading style.

Usage

from chunkers import MarkdownChunkNorris

text = """
# This is a header
This is a text
## This is another header
And another text
## With this final header
And this last text
"""
chunker = MarkdownChunkNorris()
header_style = "atx" # or "setext" depending on headers in your text
chunks = chunker(text, header_style=header_style)

HTMLChunkNorris

This chunker is meant to be used on html-formatted text. Behind the scene, it uses markdownify to transform the text to markdown with "setex"-style headers and uses MarkdownChunkNorris to process it.

Usage

from chunkers import HTMLChunkNorris

text = """
<h1>This is 1st level heading</h1>
<p>This is a test paragraph.</p>
<h2>This is 2nd level heading</h2>
<p>This is a test paragraph.</p>
<h2>This is another level heading</h2>
<p>This is another test paragraph.</p>
"""
hcn = HTMLChunkNorris()
chunks = hcn(text)

Advanced usage of chunkers

Additionally, the chunkers can take a number of argument allowing to modifiy its behavior:

from chunkers import MarkdownChunkNorris

mystring = "# header\nThis is a markdown string"

chunker = MarkdownChunkNorris() # or any other chunker
chunks = chunker(
    mystring,
    max_title_level_to_use="h3",
    max_chunk_word_length=200,
    link_placement="in_sentence",
    max_chunk_tokens=8191,
    chunk_tokens_exceeded_handling="split",
    min_chunk_wordcount=15,
    )

max_title_level_to_use (str): The maximum (included) level of headers take into account for chunking. For example, if "h3" is set, then "h4" and "h5" titles won't be used. Must be a string of type "hx" with x being the title level. Defaults to "h4".

max_chunk_word_length (int): The maximum size (soft limit, in words) a chunk can be. Chunk bigger that this size will be chunked using lower level headers, until no lower level headers are available. Defaults to 200.

link_placement (str): How the links should be handled. Defaults to in_sentence. Options :

  • "remove" : text is kept but links are removed
  • "end_of_chunk" : adds a paragraph at the end of the chunk containing all the links
  • "in_sentence" : the links is added between parenthesis inside the sentence

max_chunk_tokens (int): The hard maximum of number of token a chunk can be. Chunks bigger by this limit will be handler according to chunk_tokens_exceeded_handling. Defaults to 8191.

chunk_tokens_exceeded_handling (str): how the chunks bigger that than specified by max_chunk_tokens should be handled. Default to "raise_error". Options:

  • "raise_error": raises an error, indicated the chunk could not be split according to headers
  • "split": split the chunks arbitrarily sothat each chunk has a size lower than max_chunk_tokens

min_chunk_wordcount (int): Minimum number of words to consider keeping the chunks. Chunks with less words will be discarded. Defaults to 15.

PDFChunkNorris

#TODO:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunknorris-0.0.1.tar.gz (12.8 kB view hashes)

Uploaded Source

Built Distribution

chunknorris-0.0.1-py3-none-any.whl (17.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page