COSMIC: Concept-aware Semantic Meta-chunking with Intelligent Classification
Project description
COSMIC: COncept-aware Semantic Meta-chunking with Intelligent Classification
A production-ready intelligent text chunking framework for Retrieval-Augmented Generation (RAG) systems.
Developed by Manceps Research Division
Research Objectives
COSMIC addresses fundamental limitations in existing text chunking approaches for RAG systems:
Problem Statement
Current chunking methods suffer from three critical issues:
- Semantic Fragmentation - Fixed-length chunkers split mid-concept, breaking coherent ideas
- Context Loss - Simple overlap strategies create redundancy without preserving meaning
- Domain Blindness - One-size-fits-all approaches ignore domain-specific structure
Our Approach
COSMIC introduces a 6-stage pipeline that combines:
- Discourse Coherence Scoring (DCS) - Multi-signal boundary detection using topical coherence, coreference density, and discourse markers
- MST-based Domain Clustering - Minimum spanning tree clustering for domain classification
- Adaptive Boundary Fusion - Weighted combination of structural and semantic signals
- LLM Verification - Optional verification of uncertain boundaries
- Zero-Overlap Architecture - Self-contained conceptual chunks without redundant overlap
Target Metrics
| Metric | Target | Description |
|---|---|---|
| Coherence Score | > 0.85 | Semantic unity within chunks |
| Cross-Concept Splits | < 5% | Chunks that break conceptual boundaries |
| Latency | < 150ms/page | Processing speed |
| Fallback Rate | < 15% | Graceful degradation frequency |
Installation
Prerequisites
- Python 3.10+
- CUDA-capable GPU (recommended) or CPU
- 8GB+ RAM
Install from Source
# Clone the repository
git clone https://github.com/manceps/cosmic.git
cd cosmic
# Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/macOS
# or: venv\Scripts\activate # Windows
# Install with all dependencies
pip install -e ".[all]"
# Install spaCy model for coreference resolution
python -m spacy download en_core_web_trf
Docker Installation
# Build container
docker build -t cosmic:latest .
# Run with GPU support
docker run --gpus all -v $(pwd):/workspace cosmic:latest
Configuration
Environment Variables
Create a .env file in the project root (see .env.example):
# LLM Provider: "openai", "ollama", or "auto"
COSMIC_LLM_PROVIDER=openai
# LLM endpoint for Stage 5 verification (OpenAI-compatible API)
COSMIC_LLM_URL=http://localhost:8000/v1
COSMIC_LLM_MODEL=default
# Ollama configuration (when using provider=ollama)
OLLAMA_HOST=http://localhost:11434
COSMIC_OLLAMA_MODEL=auto # "auto" or specific model
# Embedding computation device
COSMIC_EMBEDDING_DEVICE=cuda # Options: cuda, cpu, mps
Using Ollama for LLM Verification
COSMIC integrates with Ollama for local LLM verification. The CLI can automatically detect, start, and stop Ollama:
# Auto-detect and use the best available model
cosmic chunk document.txt --strategy full --ollama auto
# Use a specific model
cosmic chunk document.txt --strategy full --ollama gemma3:latest
# Check Ollama status and available models
cosmic ollama status
cosmic ollama list
When using --ollama:
- COSMIC checks if Ollama is installed and has models available
- If the server isn't running, it starts automatically
- After chunking completes, the server is stopped (if COSMIC started it)
Recommended models for verification (in order of preference):
gemma3- Fast, good quality (3.3 GB)qwen2.5-coder:7b- Good balance (4.7 GB)llama3.2- Versatile (various sizes)
Configuration Files
Default configuration: configs/default.yaml
dcs:
alpha: 0.4 # Topical coherence weight
beta: 0.35 # Coreference density weight
gamma: 0.25 # Discourse signal weight
structure:
heading_weight: 0.4
list_weight: 0.3
table_weight: 0.3
fusion:
structural_weight: 0.6
semantic_weight: 0.4
acceptance_threshold: 0.5
chunk_constraints:
min_tokens: 100
max_tokens: 512
target_tokens: 350
Domain taxonomy: configs/taxonomies/default.yaml
Defines domain-specific terminology and patterns for classification.
Usage
Basic Usage
from cosmic import COSMICChunker, Document
# Initialize chunker with default configuration
chunker = COSMICChunker()
# Create document from text
doc = Document.from_text("""
Your document text here. COSMIC will analyze the structure,
detect semantic boundaries, and create coherent chunks.
""")
# Chunk with automatic strategy selection
chunks = chunker.chunk_document(doc, strategy="auto")
# Access chunk data
for chunk in chunks:
print(f"Domain: {chunk.domain}")
print(f"Coherence: {chunk.coherence_score:.2f}")
print(f"Text: {chunk.text[:100]}...")
print("---")
Strategy Selection
# Full 6-stage pipeline (highest quality)
chunks = chunker.chunk_document(doc, strategy="full")
# Semantic-only (faster, DCS without structure analysis)
chunks = chunker.chunk_document(doc, strategy="semantic")
# Sliding window (basic similarity-based)
chunks = chunker.chunk_document(doc, strategy="sliding")
# Fixed-length (fastest, token-based splitting)
chunks = chunker.chunk_document(doc, strategy="fixed")
# Auto (recommended) - selects based on document structure
chunks = chunker.chunk_document(doc, strategy="auto")
Batch Processing
from cosmic import BatchProcessor, Document, COSMICConfig
# Initialize batch processor
processor = BatchProcessor(
config=COSMICConfig(),
max_workers=4,
)
# Process multiple documents
documents = [Document.from_text(text) for text in texts]
result = processor.process(documents, strategy="auto", show_progress=True)
print(f"Processed: {result.documents_processed}")
print(f"Failed: {result.documents_failed}")
print(f"Total chunks: {result.total_chunks}")
for doc_id, chunks in result.chunks_by_document.items():
print(f"Document {doc_id}: {len(chunks)} chunks")
Custom Configuration
from cosmic import COSMICChunker, COSMICConfig
from cosmic.core.config import DCSConfig, ChunkConstraints
# Create custom configuration
config = COSMICConfig(
dcs=DCSConfig(
alpha=0.5, # Increase topical coherence weight
beta=0.3,
gamma=0.2,
),
chunk_constraints=ChunkConstraints(
min_tokens=50,
max_tokens=1024,
target_tokens=512,
),
)
chunker = COSMICChunker(config=config)
Loading from YAML
from cosmic import COSMICChunker, COSMICConfig
config = COSMICConfig.from_yaml("configs/custom.yaml")
chunker = COSMICChunker(config=config)
Architecture
6-Stage Pipeline
Document → Structure Analysis → Semantic Boundaries → Domain Classification
↓
Reference Linking ← LLM Verification ← Boundary Fusion
↓
COSMICChunks (with rich metadata)
Stage 1: Structure Analysis
- Detects headings, lists, tables, and other structural elements
- Computes structure score (0-1)
- Selects processing pathway based on document structure
Stage 2: Semantic Boundary Detection
- Computes Discourse Coherence Score (DCS) between sentences
- Identifies candidate boundaries where coherence drops
Stage 3: Domain Classification
- Uses MST-based clustering on chunk embeddings
- Matches clusters to domain taxonomy
- Assigns domain labels to chunks
Stage 4: Boundary Fusion
- Merges structural (weight: 0.6) and semantic (weight: 0.4) signals
- Applies acceptance threshold filtering
Stage 5: LLM Verification
- Verifies uncertain boundaries (confidence < 0.8) via external LLM
- Auto-accepts high-confidence boundaries
- Supports OpenAI-compatible APIs and Ollama
- Use
--ollamaflag for automatic Ollama integration - Skipped if no LLM endpoint configured
Stage 6: Reference Linking
- Detects explicit references (regex patterns)
- Resolves coreferences using spaCy
- Links related chunks for retrieval
DCS Formula
DCS = α × topical_coherence + β × coreference_density + γ × discourse_signal
Where:
- α = 0.4: Topical coherence from embedding similarity
- β = 0.35: Coreference density measuring entity continuity
- γ = 0.25: Discourse markers indicating transitions
Lower DCS → Higher boundary confidence
Fallback Chain
COSMIC implements graceful degradation:
Full COSMIC → Semantic-only → Sliding window → Fixed-length
(structure) (DCS only) (basic similarity) (token split)
Each fallback level maintains chunking quality while reducing computational requirements.
Benchmarks
Running Benchmarks
# Run full benchmark suite
python -m benchmarks.runner
# Run with specific datasets
python -m benchmarks.runner --datasets arxiv pubmed
# Run with limited samples
python -m benchmarks.runner --limit 100
Available Baselines
- Fixed-length (512 tokens) - Standard token-based splitting
- LangChain Recursive - RecursiveCharacterTextSplitter
- Semantic Chunking - Embedding similarity-based splitting
- Percentile Semantic - Adaptive threshold semantic chunking
Metrics
- Coherence Score - Average intra-chunk semantic similarity
- Cross-Concept Splits - Percentage of boundaries breaking concepts
- Latency - Processing time per page (ms)
- Throughput - Documents per second
Project Structure
cosmic/
├── src/cosmic/
│ ├── core/ # Data structures
│ │ ├── chunk.py # COSMICChunk dataclass
│ │ ├── config.py # Configuration system
│ │ ├── document.py # Document representation
│ │ └── enums.py # Enumerations
│ │
│ ├── pipeline/ # 6 pipeline stages
│ │ ├── structure.py # Stage 1
│ │ ├── semantic.py # Stage 2
│ │ ├── domain.py # Stage 3
│ │ ├── fusion.py # Stage 4
│ │ ├── verification.py # Stage 5
│ │ └── reference.py # Stage 6
│ │
│ ├── scoring/ # Scoring algorithms
│ │ ├── dcs.py # Discourse Coherence Score
│ │ └── clustering.py # MST clustering
│ │
│ ├── models/ # ML model wrappers
│ │ ├── embeddings.py # Sentence-transformers
│ │ ├── llm.py # LLM client
│ │ ├── ollama.py # Ollama integration
│ │ └── coreference.py # spaCy coreference
│ │
│ ├── fallback/ # Degradation strategies
│ ├── chunker.py # Main entry point
│ ├── cli.py # Command-line interface
│ └── batch.py # Batch processing
│
├── benchmarks/
│ ├── runner.py # Benchmark orchestration
│ ├── metrics/ # Evaluation metrics
│ ├── baselines/ # Comparison methods
│ └── datasets/ # Data loaders
│
├── configs/
│ ├── default.yaml # Default configuration
│ └── taxonomies/ # Domain taxonomies
│
└── tests/ # Unit and integration tests
Development
Running Tests
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=cosmic --cov-report=html
# Run specific test module
pytest tests/unit/test_dcs.py -v
Type Checking
mypy src/cosmic/
Code Style
# Format code
black src/ tests/
# Sort imports
isort src/ tests/
# Lint
ruff check src/ tests/
API Reference
COSMICChunker
class COSMICChunker:
def __init__(
self,
config: Optional[COSMICConfig] = None,
taxonomy_path: Optional[Path] = None,
) -> None: ...
def chunk_document(
self,
document: Document,
strategy: str = "auto",
) -> list[COSMICChunk]: ...
COSMICChunk
@dataclass(frozen=True)
class COSMICChunk:
chunk_id: str
text: str
token_count: int
char_start: int
char_end: int
sentence_indices: tuple[int, ...]
domain: str
coherence_score: float
boundary_confidence: float
cross_references: tuple[str, ...]
intent: Intent
metadata: dict
Document
class Document:
@classmethod
def from_text(
cls,
text: str,
doc_id: Optional[str] = None,
metadata: Optional[dict] = None,
) -> Document: ...
@classmethod
def from_file(cls, path: Path) -> Document: ...
Citation
If you use COSMIC in your research, please cite:
@article{cosmic2026,
title={COSMIC: COncept-aware Semantic Meta-chunking with Intelligent Classification},
author={Al Kari, Manceps Research Division},
journal={arXiv preprint},
year={2026}
}
License
Apache 2.0 License - see LICENSE file for details.
Contributing
Contributions are welcome! Please read our Contributing Guidelines before submitting pull requests.
Documentation
- QUICKSTART.md - Complete API reference and user guide
- CLI.md - Command-line interface reference
- CONTRIBUTING.md - Contributing guidelines
- SECURITY.md - Security policy
- paper/COSMIC_Research_Paper.md - Research background
Acknowledgments
COSMIC builds upon research in:
- Meta-Chunking (Yu et al., 2024)
- S² Chunking (Shi et al., 2024)
- Discourse Coherence Scoring (Ji et al., 2023)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cosmic_chunker-1.1.0.tar.gz.
File metadata
- Download URL: cosmic_chunker-1.1.0.tar.gz
- Upload date:
- Size: 68.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8244296f2669692fd2bd7ab4ec7e6c73bdf63e816ee14b4005917b817227f839
|
|
| MD5 |
d78bff2c31eae09354b3e8dcd4dcd4ef
|
|
| BLAKE2b-256 |
5563fe6a4aeaaffd16a17cf4a9bf9a5fc08e2762673c9b9e8eaa9c58bc567b6d
|
File details
Details for the file cosmic_chunker-1.1.0-py3-none-any.whl.
File metadata
- Download URL: cosmic_chunker-1.1.0-py3-none-any.whl
- Upload date:
- Size: 78.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
68a15f54a68c9a842d6b498b49ec354272dac3adac0a0b7c3bfe044ec8f03ce8
|
|
| MD5 |
140853504e2937b8fd07a34d4c0426fb
|
|
| BLAKE2b-256 |
cf149a3c4f6b3f879595f2c246ffef9c8f42169a4b7cc674f38635fb7f8de2e2
|