Skip to main content

Document chunking and indexing library for vector stores

Project description

Chunkin

A Python library for document chunking and indexing into vector stores, built on LangChain.

Built on LangChain

Chunkin leverages LangChain for:

  • Document Loaders: Load PDF, DOCX, TXT, MD, CSV, XLSX, PPT formats
  • Text Splitters: 6 chunking strategies including semantic chunking
  • Vector Stores: 50+ vector store integrations (FAISS, Chroma, Pinecone, etc.)

Learn more about LangChain's document processing capabilities.

Modules

Module Description
chunkin Document chunking using LangChain text splitters
chunkin_indexer Index chunks to 50+ vector stores via LangChain integrations
chunkin_processor Unified end-to-end processing

Quick Start

from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings

processor = DocProcessor(
    embeddings=OpenAIEmbeddings(),
    vector_store_type="faiss",
    chunk_size=500,
)

processor.process_file("document.pdf")
results = processor.search("your query", k=3)

Installation

# Core only
pip install chunkin

# With OpenAI + FAISS (recommended)
pip install chunkin[core]

# With semantic chunking
pip install chunkin[semantic]

# Local vector stores (Chroma, Milvus, LanceDB, etc.)
pip install chunkin[local]

# Specific cloud providers
pip install chunkin[aws]     # Amazon AWS
pip install chunkin[azure]   # Microsoft Azure
pip install chunkin[gcp]     # Google Cloud

# All vector stores
pip install chunkin[all]

Documentation

Supported Formats

Chunkin uses LangChain document loaders:

Format Extensions
PDF .pdf
Word .docx, .doc
Text .txt
Markdown .md
CSV .csv
Excel .xlsx, .xls
PowerPoint .pptx, .ppt

Supported Vector Stores

Built on LangChain vector store integrations:

Local (No External Service)

FAISS, Chroma, Milvus, LanceDB, LambdaDB, Deep Lake, Annoy

Amazon AWS

OpenSearch, Valkey, DocumentDB

Microsoft Azure

Azure AI Search, Azure Cosmos DB, Azure Cosmos DB NoSQL

Google Cloud

Databricks Vector Search, Vertex AI Vector Search, BigQuery, AlloyDB

Other

Qdrant, Weaviate, Pinecone, MongoDB Atlas, PGVector, Astra DB, Elasticsearch, Oracle, Neo4j, SingleStore, Supabase, MyScale, Zilliz, Marqo, Vectara, Meilisearch, Typesense, and more...

See docs/indexer.md for full list.

Supported Chunking Strategies

Uses LangChain text splitters:

Strategy LangChain Class Description
recursive RecursiveCharacterTextSplitter Recursively splits by paragraphs, sentences, words
character CharacterTextSplitter Simple character-based splitting
markdown MarkdownTextSplitter Markdown-aware splitting
markdown_headers MarkdownHeaderTextSplitter Split by markdown headers
html_headers HTMLHeaderTextSplitter Split by HTML header tags
semantic SemanticChunker Embedding-based semantic splitting

See docs/strategies.md for details.

Project Structure

chunkin/
├── chunkin/                 # Document chunking module
│   └── chunker.py          # DocumentChunker class
├── chunkin_indexer/         # Vector store indexing module
│   └── indexer.py          # DocIndexer class
├── chunkin_processor/       # Unified module
│   └── doc_processor.py    # DocProcessor class
├── docs/                    # MkDocs documentation
├── pyproject.toml           # Package configuration
└── README.md

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Build package
python -m build

# Serve docs locally
cd docs && pip install -r requirements.txt && mkdocs serve

LangChain Resources

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkin-0.1.4.tar.gz (25.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkin-0.1.4-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file chunkin-0.1.4.tar.gz.

File metadata

  • Download URL: chunkin-0.1.4.tar.gz
  • Upload date:
  • Size: 25.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.4.tar.gz
Algorithm Hash digest
SHA256 98885fa6de19c57334932146321b8d5486ac6f1bfb5ffa176bb7b2d8a5dd1d77
MD5 8d3bc4b1cba266f2811df09e2798524d
BLAKE2b-256 b8c67454943eecbeab67bcc1126b735f1592f6267137d4fb6e1736afbc8fa668

See more details on using hashes here.

File details

Details for the file chunkin-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: chunkin-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 7616b5dc025838f97f2b34f933dd34306613e4a163fdeea15c0071ab5f525ca7
MD5 040ffb9b2036c0c50919fd9c98f3d833
BLAKE2b-256 387316b1741edcffd85f85fa55b3896627d8b2bc2d7fd2062530d5dae7eff76f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page