Skip to main content

Document chunking and indexing library for vector stores

Project description

Chunkin

A Python library for document chunking and indexing into vector stores, built on LangChain.

Built on LangChain

Chunkin leverages LangChain for:

  • Document Loaders: Load PDF, DOCX, TXT, MD, CSV, XLSX, PPT formats
  • Text Splitters: 6 chunking strategies including semantic chunking
  • Vector Stores: 50+ vector store integrations (FAISS, Chroma, Pinecone, etc.)

Learn more about LangChain's document processing capabilities.

Modules

Module Description
chunkin Document chunking using LangChain text splitters
chunkin_indexer Index chunks to 50+ vector stores via LangChain integrations
chunkin_processor Unified end-to-end processing

Quick Start

from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings

processor = DocProcessor(
    embeddings=OpenAIEmbeddings(),
    vector_store_type="faiss",
    chunk_size=500,
)

processor.process_file("document.pdf")
results = processor.search("your query", k=3)

Installation

# Core only
pip install chunkin

# With OpenAI + FAISS (recommended)
pip install chunkin[core]

# With semantic chunking
pip install chunkin[semantic]

# Local vector stores (Chroma, Milvus, LanceDB, etc.)
pip install chunkin[local]

# Specific cloud providers
pip install chunkin[aws]     # Amazon AWS
pip install chunkin[azure]   # Microsoft Azure
pip install chunkin[gcp]     # Google Cloud

# All vector stores
pip install chunkin[all]

Documentation

Supported Formats

Chunkin uses LangChain document loaders:

Format Extensions
PDF .pdf
Word .docx, .doc
Text .txt
Markdown .md
CSV .csv
Excel .xlsx, .xls
PowerPoint .pptx, .ppt

Supported Vector Stores

Built on LangChain vector store integrations:

Local (No External Service)

FAISS, Chroma, Milvus, LanceDB, LambdaDB, Deep Lake, Annoy

Amazon AWS

OpenSearch, Valkey, DocumentDB

Microsoft Azure

Azure AI Search, Azure Cosmos DB, Azure Cosmos DB NoSQL

Google Cloud

Databricks Vector Search, Vertex AI Vector Search, BigQuery, AlloyDB

Other

Qdrant, Weaviate, Pinecone, MongoDB Atlas, PGVector, Astra DB, Elasticsearch, Oracle, Neo4j, SingleStore, Supabase, MyScale, Zilliz, Marqo, Vectara, Meilisearch, Typesense, and more...

See docs/indexer.md for full list.

Supported Chunking Strategies

Uses LangChain text splitters:

Strategy LangChain Class Description
recursive RecursiveCharacterTextSplitter Recursively splits by paragraphs, sentences, words
character CharacterTextSplitter Simple character-based splitting
markdown MarkdownTextSplitter Markdown-aware splitting
markdown_headers MarkdownHeaderTextSplitter Split by markdown headers
html_headers HTMLHeaderTextSplitter Split by HTML header tags
semantic SemanticChunker Embedding-based semantic splitting

See docs/strategies.md for details.

Project Structure

chunkin/
├── chunkin/                 # Document chunking module
│   └── chunker.py          # DocumentChunker class
├── chunkin_indexer/         # Vector store indexing module
│   └── indexer.py          # DocIndexer class
├── chunkin_processor/       # Unified module
│   └── doc_processor.py    # DocProcessor class
├── docs/                    # MkDocs documentation
├── pyproject.toml           # Package configuration
└── README.md

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Build package
python -m build

# Serve docs locally
cd docs && pip install -r requirements.txt && mkdocs serve

LangChain Resources

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkin-0.1.2.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkin-0.1.2-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file chunkin-0.1.2.tar.gz.

File metadata

  • Download URL: chunkin-0.1.2.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.2.tar.gz
Algorithm Hash digest
SHA256 a3e36235f34728600e0e7ae3172f8a431eb91859ad29d0447e0145aaa2ffecea
MD5 31cc5b38ef098003e0e2e45460999992
BLAKE2b-256 0afea5466a8a8672cabfc8c8d10dda86c9692a1c1c92afd25ca693a11617199e

See more details on using hashes here.

File details

Details for the file chunkin-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: chunkin-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 58f111e0b0432c8c527d45a2b61cb194c1d5474f6a42c3bb54f11e876097b85f
MD5 435a45efee082ff557c55a75fe33fc35
BLAKE2b-256 c88836035ea1456deb0def5f92bff7e946a867bda8f61ea67d623c71b2cccd02

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page