Skip to main content

Document chunking and indexing library for vector stores

Project description

Chunkin

A Python library for document chunking and indexing into vector stores, built on LangChain.

Built on LangChain

Chunkin leverages LangChain for:

  • Document Loaders: Load PDF, DOCX, TXT, MD, CSV, XLSX, PPT formats
  • Text Splitters: 6 chunking strategies including semantic chunking
  • Vector Stores: 50+ vector store integrations (FAISS, Chroma, Pinecone, etc.)

Learn more about LangChain's document processing capabilities.

Modules

Module Description
chunkin Document chunking using LangChain text splitters
chunkin_indexer Index chunks to 50+ vector stores via LangChain integrations
chunkin_processor Unified end-to-end processing

Quick Start

from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings

processor = DocProcessor(
    embeddings=OpenAIEmbeddings(),
    vector_store_type="faiss",
    chunk_size=500,
)

processor.process_file("document.pdf")
results = processor.search("your query", k=3)

Installation

# Core only
pip install chunkin

# With OpenAI + FAISS (recommended)
pip install chunkin[core]

# With semantic chunking
pip install chunkin[semantic]

# Local vector stores (Chroma, Milvus, LanceDB, etc.)
pip install chunkin[local]

# Specific cloud providers
pip install chunkin[aws]     # Amazon AWS
pip install chunkin[azure]   # Microsoft Azure
pip install chunkin[gcp]     # Google Cloud

# All vector stores
pip install chunkin[all]

Documentation

Supported Formats

Chunkin uses LangChain document loaders:

Format Extensions
PDF .pdf
Word .docx, .doc
Text .txt
Markdown .md
CSV .csv
Excel .xlsx, .xls
PowerPoint .pptx, .ppt

Supported Vector Stores

Built on LangChain vector store integrations:

Local (No External Service)

FAISS, Chroma, Milvus, LanceDB, LambdaDB, Deep Lake, Annoy

Amazon AWS

OpenSearch, Valkey, DocumentDB

Microsoft Azure

Azure AI Search, Azure Cosmos DB, Azure Cosmos DB NoSQL

Google Cloud

Databricks Vector Search, Vertex AI Vector Search, BigQuery, AlloyDB

Other

Qdrant, Weaviate, Pinecone, MongoDB Atlas, PGVector, Astra DB, Elasticsearch, Oracle, Neo4j, SingleStore, Supabase, MyScale, Zilliz, Marqo, Vectara, Meilisearch, Typesense, and more...

See docs/indexer.md for full list.

Supported Chunking Strategies

Uses LangChain text splitters:

Strategy LangChain Class Description
recursive RecursiveCharacterTextSplitter Recursively splits by paragraphs, sentences, words
character CharacterTextSplitter Simple character-based splitting
markdown MarkdownTextSplitter Markdown-aware splitting
markdown_headers MarkdownHeaderTextSplitter Split by markdown headers
html_headers HTMLHeaderTextSplitter Split by HTML header tags
semantic SemanticChunker Embedding-based semantic splitting

See docs/strategies.md for details.

Project Structure

chunkin/
├── chunkin/                 # Document chunking module
│   └── chunker.py          # DocumentChunker class
├── chunkin_indexer/         # Vector store indexing module
│   └── indexer.py          # DocIndexer class
├── chunkin_processor/       # Unified module
│   └── doc_processor.py    # DocProcessor class
├── docs/                    # MkDocs documentation
├── pyproject.toml           # Package configuration
└── README.md

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Build package
python -m build

# Serve docs locally
cd docs && pip install -r requirements.txt && mkdocs serve

LangChain Resources

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkin-0.1.0.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkin-0.1.0-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file chunkin-0.1.0.tar.gz.

File metadata

  • Download URL: chunkin-0.1.0.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ddc27c9b89b70d609dbbf20406db8f21aa40a31c934a5857c8f53dfb741461a2
MD5 ba9e367f41decc41d89bf1058d8d04f4
BLAKE2b-256 9db1c87e045361192175a58dd7b2713670e7df41e8ea465fee3df355dfd0ac1a

See more details on using hashes here.

File details

Details for the file chunkin-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chunkin-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2178002d15ed5a47be70112f7ba6ce75574e91199fb1bb38b05c8292f5ac43e0
MD5 305c60a8fbbc44704fdbfebd97ea3881
BLAKE2b-256 d044f789d565a302a37b9eb38c186a6ed3721636228aa913d39760617a35604a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page