Skip to main content

Your ultimate toolkit for text chunking.

Project description

Chunkifyr 📜🔪

Chunkifyr is a powerful and flexible text chunking library designed to split and chunk text into meaningful segments. Whether you're processing large documents, preparing data for NLP tasks, or simply need to manage text in manageable chunks, Chunkifyr provides a range of customizable chunking strategies.

Features ✨

  • Language Model Chunking: Leverage the context understanding of languag models to chunk text based on semantic understanding.
  • Syntactic Chunking: Break down text into syntactically meaningful segments, preserving grammatical structures.
  • Spacy-Enhanced Chunking: Integrate with SpaCy for both semantic and syntactic chunking, taking advantage of SpaCy's powerful NLP pipeline.
  • Customizable Settings: Easily adjust chunk sizes, overlap percentages, and more to fit your specific needs.
  • Robust and Fast: Efficiently handles large texts, ensuring your workflow remains smooth and responsive.

Installation 🛠️

Install Chunkifyr via pip:

pip install chunkifyr

Note: Python 3.8+ is required.

Usage 🚀

Here’s a quick example using LMChunker to get you started:

from chunkifyr import LMChunker
from openai import OpenAI

# this creds can be replaced with your local oai server creds, if your running local OAI server. (llama_cpp, llamafile, ollama)
client = OpenAI(api_key="YOUR_API_KEY", base_url="DEPLOYMENT_URL") 

chunker = LMChunker(model="gpt-3.5-turbo-0125", client=client)
chunks = chunker.from_file('path_to_your_text_file.txt')

for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}: {chunk.text}")
    print(f"Chunk {i} description: {chunk.meta.description}")
    print()

This example demonstrates how to use the LMChunker with an OpenAI client to break a text file into meaningful chunks. Each chunk also includes a description generated by the model. (which can be further used as metadata when embedding)

Available Chunkers

  • LMChunker: Utilizes pre-trained language models for contextual chunking.
  • SimpleSemanticChunker: Groups similar splits together for basic semantic chunking.
  • SimpleSyntacticChunker: Simple syntactic chunking with desired chunk size, overlap and seperator. (very similar to langchain character splitter)
  • SemanticChunker: Groups text semantically using the Adjacent Sentence Clustering process with a configurable similarity threshold.
  • SyntacticChunker: Chunks text based on syntactic structures using customizable tokenization (supports hf_tokenizer, tiktoken). More soon... (Regex based, etc)

Contributing 🤝

Contributions are welcome! If you have ideas for improving Chunkifyr or encounter any issues, feel free to submit a pull request or open an issue.

License 📄

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkifyr-0.1.2.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

chunkifyr-0.1.2-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file chunkifyr-0.1.2.tar.gz.

File metadata

  • Download URL: chunkifyr-0.1.2.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for chunkifyr-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c4ea2b73a539b0b53b3e89a99d864246eb0339a1274aa0bf0ca4545a7f787e73
MD5 aa48175be8972145ba3f52018c8b7a8f
BLAKE2b-256 981a077ef2f6d35514b8296713cc4ff3c03f027e695d14194e68f04bceacfc2c

See more details on using hashes here.

File details

Details for the file chunkifyr-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: chunkifyr-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for chunkifyr-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 855569e885360c1aa342962a47095464e75acc500a0ab7add4c3b03dd7770f0e
MD5 93cefcb06bd86962ab0ca8334b81c8dc
BLAKE2b-256 199691686da176bc3168763509d6a7aa48388500a632fef7ceef84f348dcb6ce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page