Skip to main content

Document chunking and indexing library for vector stores

Project description

Chunkin

A Python library for document chunking and indexing into vector stores, built on LangChain.

Built on LangChain

Chunkin leverages LangChain for:

  • Document Loaders: Load PDF, DOCX, TXT, MD, CSV, XLSX, PPT formats
  • Text Splitters: 6 chunking strategies including semantic chunking
  • Vector Stores: 50+ vector store integrations (FAISS, Chroma, Pinecone, etc.)

Learn more about LangChain's document processing capabilities.

Modules

Module Description
chunkin Document chunking using LangChain text splitters
chunkin_indexer Index chunks to 50+ vector stores via LangChain integrations
chunkin_processor Unified end-to-end processing

Quick Start

from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings

processor = DocProcessor(
    embeddings=OpenAIEmbeddings(),
    vector_store_type="faiss",
    chunk_size=500,
)

processor.process_file("document.pdf")
results = processor.search("your query", k=3)

Installation

# Core only
pip install chunkin

# With OpenAI + FAISS (recommended)
pip install chunkin[core]

# With semantic chunking
pip install chunkin[semantic]

# Local vector stores (Chroma, Milvus, LanceDB, etc.)
pip install chunkin[local]

# Specific cloud providers
pip install chunkin[aws]     # Amazon AWS
pip install chunkin[azure]   # Microsoft Azure
pip install chunkin[gcp]     # Google Cloud

# All vector stores
pip install chunkin[all]

Documentation

Supported Formats

Chunkin uses LangChain document loaders:

Format Extensions
PDF .pdf
Word .docx, .doc
Text .txt
Markdown .md
CSV .csv
Excel .xlsx, .xls
PowerPoint .pptx, .ppt

Supported Vector Stores

Built on LangChain vector store integrations:

Local (No External Service)

FAISS, Chroma, Milvus, LanceDB, LambdaDB, Deep Lake, Annoy

Amazon AWS

OpenSearch, Valkey, DocumentDB

Microsoft Azure

Azure AI Search, Azure Cosmos DB, Azure Cosmos DB NoSQL

Google Cloud

Databricks Vector Search, Vertex AI Vector Search, BigQuery, AlloyDB

Other

Qdrant, Weaviate, Pinecone, MongoDB Atlas, PGVector, Astra DB, Elasticsearch, Oracle, Neo4j, SingleStore, Supabase, MyScale, Zilliz, Marqo, Vectara, Meilisearch, Typesense, and more...

See docs/indexer.md for full list.

Supported Chunking Strategies

Uses LangChain text splitters:

Strategy LangChain Class Description
recursive RecursiveCharacterTextSplitter Recursively splits by paragraphs, sentences, words
character CharacterTextSplitter Simple character-based splitting
markdown MarkdownTextSplitter Markdown-aware splitting
markdown_headers MarkdownHeaderTextSplitter Split by markdown headers
html_headers HTMLHeaderTextSplitter Split by HTML header tags
semantic SemanticChunker Embedding-based semantic splitting

See docs/strategies.md for details.

Project Structure

chunkin/
├── chunkin/                 # Document chunking module
│   └── chunker.py          # DocumentChunker class
├── chunkin_indexer/         # Vector store indexing module
│   └── indexer.py          # DocIndexer class
├── chunkin_processor/       # Unified module
│   └── doc_processor.py    # DocProcessor class
├── docs/                    # MkDocs documentation
├── pyproject.toml           # Package configuration
└── README.md

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Build package
python -m build

# Serve docs locally
cd docs && pip install -r requirements.txt && mkdocs serve

LangChain Resources

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkin-0.1.1.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkin-0.1.1-py3-none-any.whl (31.5 kB view details)

Uploaded Python 3

File details

Details for the file chunkin-0.1.1.tar.gz.

File metadata

  • Download URL: chunkin-0.1.1.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e1936194c3d9e03b45076cc5500d37e5dcb3ab82a64f4b00ee3514a0e9cd1c0c
MD5 4a6b967c4b21bd0da9764f2ad95a5e2e
BLAKE2b-256 ef4c1b9b94d945f18de78a64fd2452815decb24c220b080a7a6bfdd10c34af55

See more details on using hashes here.

File details

Details for the file chunkin-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: chunkin-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 31.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e0bb40a29b65253df2a68c8b091c6c3387c10db34574d1186bbac8f24b261e36
MD5 c7b4f8ef4c5c8b33e7fa5c93b892233d
BLAKE2b-256 b17bcfd63f3fdcc271a81b565a6ecf5a151b84abeb5c606388aea8ceef66db87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page