Document chunking and indexing library for vector stores
Project description
Chunkin
A Python library for document chunking and indexing into vector stores, built on LangChain.
Built on LangChain
Chunkin leverages LangChain for:
- Document Loaders: Load PDF, DOCX, TXT, MD, CSV, XLSX, PPT formats
- Text Splitters: 6 chunking strategies including semantic chunking
- Vector Stores: 50+ vector store integrations (FAISS, Chroma, Pinecone, etc.)
Learn more about LangChain's document processing capabilities.
Modules
| Module | Description |
|---|---|
chunkin |
Document chunking using LangChain text splitters |
chunkin_indexer |
Index chunks to 50+ vector stores via LangChain integrations |
chunkin_processor |
Unified end-to-end processing |
Quick Start
from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings
processor = DocProcessor(
embeddings=OpenAIEmbeddings(),
vector_store_type="faiss",
chunk_size=500,
)
processor.process_file("document.pdf")
results = processor.search("your query", k=3)
Installation
# Core only
pip install chunkin
# With OpenAI + FAISS (recommended)
pip install chunkin[core]
# With semantic chunking
pip install chunkin[semantic]
# Local vector stores (Chroma, Milvus, LanceDB, etc.)
pip install chunkin[local]
# Specific cloud providers
pip install chunkin[aws] # Amazon AWS
pip install chunkin[azure] # Microsoft Azure
pip install chunkin[gcp] # Google Cloud
# All vector stores
pip install chunkin[all]
Documentation
Supported Formats
Chunkin uses LangChain document loaders:
| Format | Extensions |
|---|---|
.pdf |
|
| Word | .docx, .doc |
| Text | .txt |
| Markdown | .md |
| CSV | .csv |
| Excel | .xlsx, .xls |
| PowerPoint | .pptx, .ppt |
Supported Vector Stores
Built on LangChain vector store integrations:
Local (No External Service)
FAISS, Chroma, Milvus, LanceDB, LambdaDB, Deep Lake, Annoy
Amazon AWS
OpenSearch, Valkey, DocumentDB
Microsoft Azure
Azure AI Search, Azure Cosmos DB, Azure Cosmos DB NoSQL
Google Cloud
Databricks Vector Search, Vertex AI Vector Search, BigQuery, AlloyDB
Other
Qdrant, Weaviate, Pinecone, MongoDB Atlas, PGVector, Astra DB, Elasticsearch, Oracle, Neo4j, SingleStore, Supabase, MyScale, Zilliz, Marqo, Vectara, Meilisearch, Typesense, and more...
See docs/indexer.md for full list.
Supported Chunking Strategies
Uses LangChain text splitters:
| Strategy | LangChain Class | Description |
|---|---|---|
recursive |
RecursiveCharacterTextSplitter | Recursively splits by paragraphs, sentences, words |
character |
CharacterTextSplitter | Simple character-based splitting |
markdown |
MarkdownTextSplitter | Markdown-aware splitting |
markdown_headers |
MarkdownHeaderTextSplitter | Split by markdown headers |
html_headers |
HTMLHeaderTextSplitter | Split by HTML header tags |
semantic |
SemanticChunker | Embedding-based semantic splitting |
See docs/strategies.md for details.
Project Structure
chunkin/
├── chunkin/ # Document chunking module
│ └── chunker.py # DocumentChunker class
├── chunkin_indexer/ # Vector store indexing module
│ └── indexer.py # DocIndexer class
├── chunkin_processor/ # Unified module
│ └── doc_processor.py # DocProcessor class
├── docs/ # MkDocs documentation
├── pyproject.toml # Package configuration
└── README.md
Development
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Build package
python -m build
# Serve docs locally
cd docs && pip install -r requirements.txt && mkdocs serve
LangChain Resources
License
MIT License
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunkin-0.1.4.tar.gz.
File metadata
- Download URL: chunkin-0.1.4.tar.gz
- Upload date:
- Size: 25.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
98885fa6de19c57334932146321b8d5486ac6f1bfb5ffa176bb7b2d8a5dd1d77
|
|
| MD5 |
8d3bc4b1cba266f2811df09e2798524d
|
|
| BLAKE2b-256 |
b8c67454943eecbeab67bcc1126b735f1592f6267137d4fb6e1736afbc8fa668
|
File details
Details for the file chunkin-0.1.4-py3-none-any.whl.
File metadata
- Download URL: chunkin-0.1.4-py3-none-any.whl
- Upload date:
- Size: 32.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7616b5dc025838f97f2b34f933dd34306613e4a163fdeea15c0071ab5f525ca7
|
|
| MD5 |
040ffb9b2036c0c50919fd9c98f3d833
|
|
| BLAKE2b-256 |
387316b1741edcffd85f85fa55b3896627d8b2bc2d7fd2062530d5dae7eff76f
|