Skip to main content

Document chunking and indexing library for vector stores

Project description

Chunkin

A Python library for document chunking and indexing into vector stores, built on LangChain.

Built on LangChain

Chunkin leverages LangChain for:

  • Document Loaders: Load PDF, DOCX, TXT, MD, CSV, XLSX, PPT formats
  • Text Splitters: 6 chunking strategies including semantic chunking
  • Vector Stores: 50+ vector store integrations (FAISS, Chroma, Pinecone, etc.)

Learn more about LangChain's document processing capabilities.

Modules

Module Description
chunkin Document chunking using LangChain text splitters
chunkin_indexer Index chunks to 50+ vector stores via LangChain integrations
chunkin_processor Unified end-to-end processing

Quick Start

from chunkin_processor import DocProcessor
from langchain_openai import OpenAIEmbeddings

processor = DocProcessor(
    embeddings=OpenAIEmbeddings(),
    vector_store_type="faiss",
    chunk_size=500,
)

processor.process_file("document.pdf")
results = processor.search("your query", k=3)

Installation

# Core only
pip install chunkin

# With OpenAI + FAISS (recommended)
pip install chunkin[core]

# With semantic chunking
pip install chunkin[semantic]

# Local vector stores (Chroma, Milvus, LanceDB, etc.)
pip install chunkin[local]

# Specific cloud providers
pip install chunkin[aws]     # Amazon AWS
pip install chunkin[azure]   # Microsoft Azure
pip install chunkin[gcp]     # Google Cloud

# All vector stores
pip install chunkin[all]

Documentation

Supported Formats

Chunkin uses LangChain document loaders:

Format Extensions
PDF .pdf
Word .docx, .doc
Text .txt
Markdown .md
CSV .csv
Excel .xlsx, .xls
PowerPoint .pptx, .ppt

Supported Vector Stores

Built on LangChain vector store integrations:

Local (No External Service)

FAISS, Chroma, Milvus, LanceDB, LambdaDB, Deep Lake, Annoy

Amazon AWS

OpenSearch, Valkey, DocumentDB

Microsoft Azure

Azure AI Search, Azure Cosmos DB, Azure Cosmos DB NoSQL

Google Cloud

Databricks Vector Search, Vertex AI Vector Search, BigQuery, AlloyDB

Other

Qdrant, Weaviate, Pinecone, MongoDB Atlas, PGVector, Astra DB, Elasticsearch, Oracle, Neo4j, SingleStore, Supabase, MyScale, Zilliz, Marqo, Vectara, Meilisearch, Typesense, and more...

See docs/indexer.md for full list.

Supported Chunking Strategies

Uses LangChain text splitters:

Strategy LangChain Class Description
recursive RecursiveCharacterTextSplitter Recursively splits by paragraphs, sentences, words
character CharacterTextSplitter Simple character-based splitting
markdown MarkdownTextSplitter Markdown-aware splitting
markdown_headers MarkdownHeaderTextSplitter Split by markdown headers
html_headers HTMLHeaderTextSplitter Split by HTML header tags
semantic SemanticChunker Embedding-based semantic splitting

See docs/strategies.md for details.

Project Structure

chunkin/
├── chunkin/                 # Document chunking module
│   └── chunker.py          # DocumentChunker class
├── chunkin_indexer/         # Vector store indexing module
│   └── indexer.py          # DocIndexer class
├── chunkin_processor/       # Unified module
│   └── doc_processor.py    # DocProcessor class
├── docs/                    # MkDocs documentation
├── pyproject.toml           # Package configuration
└── README.md

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Build package
python -m build

# Serve docs locally
cd docs && pip install -r requirements.txt && mkdocs serve

LangChain Resources

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkin-0.1.3.tar.gz (24.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkin-0.1.3-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file chunkin-0.1.3.tar.gz.

File metadata

  • Download URL: chunkin-0.1.3.tar.gz
  • Upload date:
  • Size: 24.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.3.tar.gz
Algorithm Hash digest
SHA256 718f99b3af74169b1b9b3e3b642b42e685ea39b36288b294f4e2444adf0d2702
MD5 570aa8fea591c570ab1d1ffd167ad749
BLAKE2b-256 54dd2e96e79034d1e89cf29e1d7514d4d4c5572450ec7f8dcbcd601d1d495cf6

See more details on using hashes here.

File details

Details for the file chunkin-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: chunkin-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for chunkin-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0d0c05f237690e6dcbe5fe7bbbf2ebf7f5e836f38ca16993f53dfc9ee0c3529e
MD5 4f3fb6568138d4f83c76bb8a73056ad4
BLAKE2b-256 d7050f268290f12eeeb8c4dad00f63561b854ae6e62c678634c12825409d44be

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page