COSMIC: Concept-aware Semantic Meta-chunking with Intelligent Classification

These details have not been verified by PyPI

Project links

Project description

COSMIC: COncept-aware Semantic Meta-chunking with Intelligent Classification

A production-ready intelligent text chunking framework for Retrieval-Augmented Generation (RAG) systems.

Developed by Manceps Research Division

Research Objectives

COSMIC addresses fundamental limitations in existing text chunking approaches for RAG systems:

Problem Statement

Current chunking methods suffer from three critical issues:

Semantic Fragmentation - Fixed-length chunkers split mid-concept, breaking coherent ideas
Context Loss - Simple overlap strategies create redundancy without preserving meaning
Domain Blindness - One-size-fits-all approaches ignore domain-specific structure

Our Approach

COSMIC introduces a 6-stage pipeline that combines:

Discourse Coherence Scoring (DCS) - Multi-signal boundary detection using topical coherence, coreference density, and discourse markers
MST-based Domain Clustering - Minimum spanning tree clustering for domain classification
Adaptive Boundary Fusion - Weighted combination of structural and semantic signals
LLM Verification - Optional verification of uncertain boundaries
Zero-Overlap Architecture - Self-contained conceptual chunks without redundant overlap

Target Metrics

Metric	Target	Description
Coherence Score	> 0.85	Semantic unity within chunks
Cross-Concept Splits	< 5%	Chunks that break conceptual boundaries
Latency	< 150ms/page	Processing speed
Fallback Rate	< 15%	Graceful degradation frequency

Installation

Prerequisites

Python 3.10+
CUDA-capable GPU (recommended) or CPU
8GB+ RAM

Install from Source

# Clone the repository
git clone https://github.com/manceps/cosmic.git
cd cosmic

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or: venv\Scripts\activate  # Windows

# Install with all dependencies
pip install -e ".[all]"

# Install spaCy model for coreference resolution
python -m spacy download en_core_web_trf

Docker Installation

# Build container
docker build -t cosmic:latest .

# Run with GPU support
docker run --gpus all -v $(pwd):/workspace cosmic:latest

Configuration

Environment Variables

Create a .env file in the project root (see .env.example):

# LLM Provider: "openai", "ollama", or "auto"
COSMIC_LLM_PROVIDER=openai

# LLM endpoint for Stage 5 verification (OpenAI-compatible API)
COSMIC_LLM_URL=http://localhost:8000/v1
COSMIC_LLM_MODEL=default

# Ollama configuration (when using provider=ollama)
OLLAMA_HOST=http://localhost:11434
COSMIC_OLLAMA_MODEL=auto  # "auto" or specific model

# Embedding computation device
COSMIC_EMBEDDING_DEVICE=cuda  # Options: cuda, cpu, mps

Using Ollama for LLM Verification

COSMIC integrates with Ollama for local LLM verification. The CLI can automatically detect, start, and stop Ollama:

# Auto-detect and use the best available model
cosmic chunk document.txt --strategy full --ollama auto

# Use a specific model
cosmic chunk document.txt --strategy full --ollama gemma3:latest

# Check Ollama status and available models
cosmic ollama status
cosmic ollama list

When using --ollama:

COSMIC checks if Ollama is installed and has models available
If the server isn't running, it starts automatically
After chunking completes, the server is stopped (if COSMIC started it)

Recommended models for verification (in order of preference):

gemma3 - Fast, good quality (3.3 GB)
qwen2.5-coder:7b - Good balance (4.7 GB)
llama3.2 - Versatile (various sizes)

Configuration Files

Default configuration: configs/default.yaml

dcs:
  alpha: 0.4    # Topical coherence weight
  beta: 0.35   # Coreference density weight
  gamma: 0.25  # Discourse signal weight

structure:
  heading_weight: 0.4
  list_weight: 0.3
  table_weight: 0.3

fusion:
  structural_weight: 0.6
  semantic_weight: 0.4
  acceptance_threshold: 0.5

chunk_constraints:
  min_tokens: 100
  max_tokens: 512
  target_tokens: 350

Domain taxonomy: configs/taxonomies/default.yaml

Defines domain-specific terminology and patterns for classification.

Usage

Basic Usage

from cosmic import COSMICChunker, Document

# Initialize chunker with default configuration
chunker = COSMICChunker()

# Create document from text
doc = Document.from_text("""
Your document text here. COSMIC will analyze the structure,
detect semantic boundaries, and create coherent chunks.
""")

# Chunk with automatic strategy selection
chunks = chunker.chunk_document(doc, strategy="auto")

# Access chunk data
for chunk in chunks:
    print(f"Domain: {chunk.domain}")
    print(f"Coherence: {chunk.coherence_score:.2f}")
    print(f"Text: {chunk.text[:100]}...")
    print("---")

Strategy Selection

# Full 6-stage pipeline (highest quality)
chunks = chunker.chunk_document(doc, strategy="full")

# Semantic-only (faster, DCS without structure analysis)
chunks = chunker.chunk_document(doc, strategy="semantic")

# Sliding window (basic similarity-based)
chunks = chunker.chunk_document(doc, strategy="sliding")

# Fixed-length (fastest, token-based splitting)
chunks = chunker.chunk_document(doc, strategy="fixed")

# Auto (recommended) - selects based on document structure
chunks = chunker.chunk_document(doc, strategy="auto")

Batch Processing

from cosmic import BatchProcessor, Document, COSMICConfig

# Initialize batch processor
processor = BatchProcessor(
    config=COSMICConfig(),
    max_workers=4,
)

# Process multiple documents
documents = [Document.from_text(text) for text in texts]
result = processor.process(documents, strategy="auto", show_progress=True)

print(f"Processed: {result.documents_processed}")
print(f"Failed: {result.documents_failed}")
print(f"Total chunks: {result.total_chunks}")

for doc_id, chunks in result.chunks_by_document.items():
    print(f"Document {doc_id}: {len(chunks)} chunks")

Custom Configuration

from cosmic import COSMICChunker, COSMICConfig
from cosmic.core.config import DCSConfig, ChunkConstraints

# Create custom configuration
config = COSMICConfig(
    dcs=DCSConfig(
        alpha=0.5,   # Increase topical coherence weight
        beta=0.3,
        gamma=0.2,
    ),
    chunk_constraints=ChunkConstraints(
        min_tokens=50,
        max_tokens=1024,
        target_tokens=512,
    ),
)

chunker = COSMICChunker(config=config)

Loading from YAML

from cosmic import COSMICChunker, COSMICConfig

config = COSMICConfig.from_yaml("configs/custom.yaml")
chunker = COSMICChunker(config=config)

Architecture

6-Stage Pipeline

Document → Structure Analysis → Semantic Boundaries → Domain Classification
                                                              ↓
              Reference Linking ← LLM Verification ← Boundary Fusion
                      ↓
               COSMICChunks (with rich metadata)

Stage 1: Structure Analysis

Detects headings, lists, tables, and other structural elements
Computes structure score (0-1)
Selects processing pathway based on document structure

Stage 2: Semantic Boundary Detection

Computes Discourse Coherence Score (DCS) between sentences
Identifies candidate boundaries where coherence drops

Stage 3: Domain Classification

Uses MST-based clustering on chunk embeddings
Matches clusters to domain taxonomy
Assigns domain labels to chunks

Stage 4: Boundary Fusion

Merges structural (weight: 0.6) and semantic (weight: 0.4) signals
Applies acceptance threshold filtering

Stage 5: LLM Verification

Verifies uncertain boundaries (confidence < 0.8) via external LLM
Auto-accepts high-confidence boundaries
Supports OpenAI-compatible APIs and Ollama
Use --ollama flag for automatic Ollama integration
Skipped if no LLM endpoint configured

Stage 6: Reference Linking

Detects explicit references (regex patterns)
Resolves coreferences using spaCy
Links related chunks for retrieval

DCS Formula

DCS = α × topical_coherence + β × coreference_density + γ × discourse_signal

Where:

α = 0.4: Topical coherence from embedding similarity
β = 0.35: Coreference density measuring entity continuity
γ = 0.25: Discourse markers indicating transitions

Lower DCS → Higher boundary confidence

Fallback Chain

COSMIC implements graceful degradation:

Full COSMIC → Semantic-only → Sliding window → Fixed-length
(structure)   (DCS only)     (basic similarity) (token split)

Each fallback level maintains chunking quality while reducing computational requirements.

Benchmarks

Running Benchmarks

# Run full benchmark suite
python -m benchmarks.runner

# Run with specific datasets
python -m benchmarks.runner --datasets arxiv pubmed

# Run with limited samples
python -m benchmarks.runner --limit 100

Available Baselines

Fixed-length (512 tokens) - Standard token-based splitting
LangChain Recursive - RecursiveCharacterTextSplitter
Semantic Chunking - Embedding similarity-based splitting
Percentile Semantic - Adaptive threshold semantic chunking

Metrics

Coherence Score - Average intra-chunk semantic similarity
Cross-Concept Splits - Percentage of boundaries breaking concepts
Latency - Processing time per page (ms)
Throughput - Documents per second

Project Structure

cosmic/
├── src/cosmic/
│   ├── core/           # Data structures
│   │   ├── chunk.py    # COSMICChunk dataclass
│   │   ├── config.py   # Configuration system
│   │   ├── document.py # Document representation
│   │   └── enums.py    # Enumerations
│   │
│   ├── pipeline/       # 6 pipeline stages
│   │   ├── structure.py    # Stage 1
│   │   ├── semantic.py     # Stage 2
│   │   ├── domain.py       # Stage 3
│   │   ├── fusion.py       # Stage 4
│   │   ├── verification.py # Stage 5
│   │   └── reference.py    # Stage 6
│   │
│   ├── scoring/        # Scoring algorithms
│   │   ├── dcs.py      # Discourse Coherence Score
│   │   └── clustering.py # MST clustering
│   │
│   ├── models/         # ML model wrappers
│   │   ├── embeddings.py # Sentence-transformers
│   │   ├── llm.py        # LLM client
│   │   ├── ollama.py     # Ollama integration
│   │   └── coreference.py # spaCy coreference
│   │
│   ├── fallback/       # Degradation strategies
│   ├── chunker.py      # Main entry point
│   ├── cli.py          # Command-line interface
│   └── batch.py        # Batch processing
│
├── benchmarks/
│   ├── runner.py       # Benchmark orchestration
│   ├── metrics/        # Evaluation metrics
│   ├── baselines/      # Comparison methods
│   └── datasets/       # Data loaders
│
├── configs/
│   ├── default.yaml    # Default configuration
│   └── taxonomies/     # Domain taxonomies
│
└── tests/              # Unit and integration tests

Development

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=cosmic --cov-report=html

# Run specific test module
pytest tests/unit/test_dcs.py -v

Type Checking

mypy src/cosmic/

Code Style

# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Lint
ruff check src/ tests/

API Reference

COSMICChunker

class COSMICChunker:
    def __init__(
        self,
        config: Optional[COSMICConfig] = None,
        taxonomy_path: Optional[Path] = None,
    ) -> None: ...

    def chunk_document(
        self,
        document: Document,
        strategy: str = "auto",
    ) -> list[COSMICChunk]: ...

COSMICChunk

@dataclass(frozen=True)
class COSMICChunk:
    chunk_id: str
    text: str
    token_count: int
    char_start: int
    char_end: int
    sentence_indices: tuple[int, ...]
    domain: str
    coherence_score: float
    boundary_confidence: float
    cross_references: tuple[str, ...]
    intent: Intent
    metadata: dict

Document

class Document:
    @classmethod
    def from_text(
        cls,
        text: str,
        doc_id: Optional[str] = None,
        metadata: Optional[dict] = None,
    ) -> Document: ...

    @classmethod
    def from_file(cls, path: Path) -> Document: ...

Citation

If you use COSMIC in your research, please cite:

@article{cosmic2026,
  title={COSMIC: COncept-aware Semantic Meta-chunking with Intelligent Classification},
  author={Al Kari, Manceps Research Division},
  journal={arXiv preprint},
  year={2026}
}

License

Apache 2.0 License - see LICENSE file for details.

Contributing

Contributions are welcome! Please read our Contributing Guidelines before submitting pull requests.

Documentation

QUICKSTART.md - Complete API reference and user guide
CLI.md - Command-line interface reference
CONTRIBUTING.md - Contributing guidelines
SECURITY.md - Security policy
paper/COSMIC_Research_Paper.md - Research background

Acknowledgments

COSMIC builds upon research in:

Meta-Chunking (Yu et al., 2024)
S² Chunking (Shi et al., 2024)
Discourse Coherence Scoring (Ji et al., 2023)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.1.0

Jan 31, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cosmic_chunker-1.1.0.tar.gz (68.4 kB view details)

Uploaded Jan 31, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cosmic_chunker-1.1.0-py3-none-any.whl (78.5 kB view details)

Uploaded Jan 31, 2026 Python 3

File details

Details for the file cosmic_chunker-1.1.0.tar.gz.

File metadata

Download URL: cosmic_chunker-1.1.0.tar.gz
Upload date: Jan 31, 2026
Size: 68.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for cosmic_chunker-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8244296f2669692fd2bd7ab4ec7e6c73bdf63e816ee14b4005917b817227f839`
MD5	`d78bff2c31eae09354b3e8dcd4dcd4ef`
BLAKE2b-256	`5563fe6a4aeaaffd16a17cf4a9bf9a5fc08e2762673c9b9e8eaa9c58bc567b6d`

See more details on using hashes here.

File details

Details for the file cosmic_chunker-1.1.0-py3-none-any.whl.

File metadata

Download URL: cosmic_chunker-1.1.0-py3-none-any.whl
Upload date: Jan 31, 2026
Size: 78.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for cosmic_chunker-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68a15f54a68c9a842d6b498b49ec354272dac3adac0a0b7c3bfe044ec8f03ce8`
MD5	`140853504e2937b8fd07a34d4c0426fb`
BLAKE2b-256	`cf149a3c4f6b3f879595f2c246ffef9c8f42169a4b7cc674f38635fb7f8de2e2`

See more details on using hashes here.

cosmic-chunker 1.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

COSMIC: COncept-aware Semantic Meta-chunking with Intelligent Classification

Research Objectives

Problem Statement

Our Approach

Target Metrics

Installation

Prerequisites

Install from Source

Docker Installation

Configuration

Environment Variables

Using Ollama for LLM Verification

Configuration Files

Usage

Basic Usage

Strategy Selection

Batch Processing

Custom Configuration

Loading from YAML

Architecture

6-Stage Pipeline

Stage 1: Structure Analysis

Stage 2: Semantic Boundary Detection

Stage 3: Domain Classification

Stage 4: Boundary Fusion

Stage 5: LLM Verification

Stage 6: Reference Linking

DCS Formula

Fallback Chain

Benchmarks

Running Benchmarks

Available Baselines

Metrics

Project Structure

Development

Running Tests

Type Checking

Code Style

API Reference

COSMICChunker

COSMICChunk

Document

Citation

License

Contributing

Documentation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes