Sentence-aware embeddings using late chunking with transformers.

These details have not been verified by PyPI

Project links

Project description

Afterthoughts

Sentence-aware embeddings with document-level context. Embeds first, chunks second, extracting any number of chunks in one forward pass while preserving full document context. For related work on late chunking, see Günther et al., 2024.

Quick Start

pip install afterthoughts

from afterthoughts import LateEncoder

model = LateEncoder("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")

# One document with three sentences
doc = """
The Amazon rainforest produces 20% of Earth's oxygen.
Deforestation threatens its biodiversity.
Scientists warn of a tipping point.
"""

df, X = model.encode([doc], max_chunk_sents=1)  # 1 sentence per chunk

>>> df
shape: (3, 6)
┌─────┬──────────────┬───────────┬─────────────────┬───────────┬───────────────────────────────────────────┐
│ idx ┆ document_idx ┆ chunk_idx ┆ max_chunk_sents ┆ num_sents ┆ chunk                                     │
│ --- ┆ ---          ┆ ---       ┆ ---             ┆ ---       ┆ ---                                       │
│ u32 ┆ i64          ┆ i64       ┆ i64             ┆ i64       ┆ str                                       │
╞═════╪══════════════╪═══════════╪═════════════════╪═══════════╪═══════════════════════════════════════════╡
│ 0   ┆ 0            ┆ 0         ┆ 1               ┆ 1         ┆ The Amazon rainforest produces 20% of Ea… │
│ 1   ┆ 0            ┆ 1         ┆ 1               ┆ 1         ┆ Deforestation threatens its biodiversity… │
│ 2   ┆ 0            ┆ 2         ┆ 1               ┆ 1         ┆ Scientists warn of a tipping point.       │
└─────┴──────────────┴───────────┴─────────────────┴───────────┴───────────────────────────────────────────┘

>>> X.shape
(3, 384)  # 3 sentence embeddings, each with full document context

What is Late Chunking?

Traditional RAG pipelines split documents into chunks before embedding, which loses contextual information. Consider a technical report that opens with "The new lithium-sulfur battery achieved 400 Wh/kg energy density" and later states "The technology could double EV range" or "Its cycle life remains a challenge." When these sentences are embedded separately, the later chunks lose their connection to lithium-sulfur batteries—a search for "lithium battery limitations" might miss the cycle life sentence entirely.

Late chunking inverts this process:

Embed first: Pass the entire document through the transformer model to get contextually-enriched token embeddings
Chunk second: Pool token embeddings into chunks after the model has established cross-chunk context

This approach ensures that pronouns, references, and contextual cues in each chunk are informed by the full document context.

Why Late Chunking?

The problem: Document-level embeddings are too coarse for long documents. Traditional chunking loses context—pronouns like "it" or "the technology" become meaningless when separated from their referents.

The solution: Late chunking embeds the full document first, then pools token embeddings into chunks. Each chunk retains full document context.

Performance: One forward pass for the entire document, regardless of chunk count.

Features

Late chunking implementation: Embed documents first, then pool into chunks for context-aware embeddings
Flexible chunk configuration: Sentence-based, token-based, or combined chunking strategies
Multi-configuration support: Test multiple chunk sizes in a single pass with aligned list parameters
Sentence boundary detection: Choice of BlingFire (default), NLTK, pysbd, or syntok for accurate sentence segmentation
Query embedding: Embed queries in the same space as chunks for semantic search
HuggingFace integration: Works with any transformer model from the HuggingFace Hub
Automatic mixed precision (AMP): Faster inference with reduced memory footprint
Dynamic batching: Batches by total token count (not sequence count) for optimal GPU utilization
Structured output: Returns chunks and metadata as Polars/pandas DataFrame for easy manipulation
Memory optimizations: Optional float16 embedding conversion and dimension truncation for reduced memory

Usage Guide

Basic Usage

Install the package using pip:
```
pip install afterthoughts
```

Create a LateEncoder object and load a transformer model.

from afterthoughts import LateEncoder

# Choose a model which works well with mean-tokens pooling
model = LateEncoder("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")

Prepare a list of documents docs (strings) from which to extract chunk embeddings.

docs = [
    "I am a document. I have multiple sentences.",
    "I am another document. I also have sentences.",
    "I am yet another document. Sentences are great.",
    "I'm not like the others. I'm special.",
]

Encode and extract chunk embeddings:
```
df, X = model.encode(
    docs,
    max_chunk_sents=[1, 2],  # Extract 1-sentence and 2-sentence chunks
)
```
The encode method returns a tuple containing a Polars DataFrame and a NumPy array of chunk embeddings. Pass return_frame="pandas" for a pandas DataFrame instead.

To use a different sentence tokenizer, pass the sent_tokenizer parameter:
```
df, X = model.encode(
    docs,
    max_chunk_sents=2,
    sent_tokenizer="pysbd",  # Options: "blingfire" (default), "nltk", "pysbd", "syntok"
)
```
The DataFrame contains the following columns:
- idx: Global chunk index (0-based), maps directly to embedding row
- document_idx: The index of the document from which the chunk was extracted
- chunk_idx: The chunk index within each document
- max_chunk_sents: The requested maximum sentences per chunk (only present when specified)
- max_chunk_tokens: The requested maximum tokens per chunk (only present when specified)
- num_sents: The actual number of sentences in the chunk
- chunk: The chunk text
Additional columns are available when debug=True:
- embed_idx: The original embedding index before re-sorting
- sequence_idx: The index of the tokenized sequence (differs from document_idx when long documents are split)
- batch_idx: The index of the batch in which the chunk was processed
To access embeddings for specific chunks, use the idx column:
```
# Get embeddings for all chunks from document 10
doc_df = df[df["document_idx"] == 10]
doc_embeds = X[doc_df["idx"]]
```

Advanced Chunking Strategies

Token-Based Chunking

In addition to sentence-based chunking, you can specify a maximum token count per chunk using max_chunk_tokens. This is useful when you need chunks that fit within a specific token budget:

# Token-based chunking: accumulate sentences until token limit
df, X = model.encode(
    docs,
    max_chunk_tokens=128,  # Maximum 128 tokens per chunk
    max_chunk_sents=None,  # No sentence limit
)

You can also combine both constraints—whichever limit is reached first determines the chunk boundary:

# Combined constraints: "at most 3 sentences AND at most 128 tokens"
df, X = model.encode(
    docs,
    max_chunk_sents=3,
    max_chunk_tokens=128,  # Whichever limit is hit first
)

Handling long sentences: By default, sentences exceeding max_chunk_tokens are split into multiple chunks at token boundaries (split_long_sents=True). Set split_long_sents=False to keep long sentences intact as their own chunks. Note that this parameter only affects chunking at the max_chunk_tokens boundary—sentences exceeding the model's max_length are handled automatically during tokenization through prechunking.

Comparing Multiple Chunk Configurations

To experiment with different chunk sizes simultaneously, pass lists to max_chunk_sents and/or max_chunk_tokens. When both are lists, they must have the same length and will be processed as aligned pairs:

# Multiple sentence sizes
df, X = model.encode(
    docs,
    max_chunk_sents=[1, 2, 3],  # Creates 3 configurations
)

# Multiple token limits
df, X = model.encode(
    docs,
    max_chunk_tokens=[64, 128, 256],  # Creates 3 configurations
)

# Aligned pairs (NOT cartesian product)
df, X = model.encode(
    docs,
    max_chunk_sents=[1, 2, 3],
    max_chunk_tokens=[64, 128, 256],  # Same length required
    # Creates: (1, 64), (2, 128), (3, 256) - only 3 configs!
)

The resulting DataFrame includes max_chunk_sents and max_chunk_tokens columns to identify which configuration produced each chunk:

import polars as pl

# Filter to specific configuration (Polars)
# Cast to int since config columns can contain None
df_small_chunks = df.filter(
    (pl.col("max_chunk_sents").cast(pl.Int64, strict=False) == 1) &
    (pl.col("max_chunk_tokens").cast(pl.Int64, strict=False) == 64)
)

# For pandas DataFrames, direct comparison works:
# df_small_chunks = df[(df["max_chunk_sents"] == 1) & (df["max_chunk_tokens"] == 64)]

Note: Config columns may contain None values when using lists with mixed values (e.g., [1, None, 2]). Cast to Int64 for numeric comparisons in Polars.

Use Case: Retrieval Performance Tuning

Generate multiple chunk configurations in a single embedding pass to systematically optimize your retrieval system. This is particularly valuable for production RAG systems where you want to find the chunk strategy that maximizes retrieval quality for your specific domain and queries.

Workflow:

Benchmark multiple configurations in one pass - no re-embedding overhead:

df, X = model.encode(
    docs,
    max_chunk_sents=[1, 2, 3],
    max_chunk_tokens=[64, 128, 256],  # Aligned pairs: (1,64), (2,128), (3,256)
)

Store each configuration in a separate namespace to compare retrieval quality (e.g. with Pinecone):

from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("your-index-name")

max_chunk_sents = [1, 2, 3]
max_chunk_tokens = [64, 128, 256]

for config_sents, config_tokens in zip(max_chunk_sents, max_chunk_tokens):
    config_df = df.filter(
        (pl.col("max_chunk_sents") == config_sents) &
        (pl.col("max_chunk_tokens") == config_tokens)
    )
    config_embeds = X[config_df["idx"]]

    # Prepare vectors: (id, values, metadata)
    vectors = [
        (str(row["idx"]), embedding.tolist(), row)
        for row, embedding in zip(config_df.to_dicts(), config_embeds)
    ]

    index.upsert(
        vectors=vectors,
        namespace=f"chunks_{config_sents}s_{config_tokens}t",
    )

Other vector databases (Qdrant, Weaviate, Milvus) support similar namespace/collection-based segmentation.

Run your retrieval evaluation against each namespace to measure which configuration performs best on your metrics (MRR, NDCG, recall@k, etc.).

This approach allows you to optimize chunk strategy for your specific use case without the computational cost of multiple embedding passes.

Using Pandas Instead of Polars

Afterthoughts uses Polars by default for its speed and memory efficiency, but pandas is fully supported for users who prefer it or need compatibility with existing code. Simply set return_frame="pandas":

df, X = model.encode(
    docs,
    max_chunk_sents=2,
    return_frame="pandas",  # Return a pandas DataFrame
)

# Use familiar pandas operations
df.groupby("document_idx").size()
df[df["max_chunk_sents"] == 2]

The pandas integration requires pandas to be installed (pip install pandas). The DataFrame schema and all functionality remain identical—only the return type changes.

Memory Optimizations

The LateEncoder class supports two memory optimization parameters:

Dimension Truncation (`truncate_dims`)

For models trained with Matryoshka Representation Learning (MRL), you can truncate embeddings to smaller dimensions with minimal quality loss. No retraining required—just slice the first N dimensions.

from afterthoughts import LateEncoder

# This model was trained with MRL at dimensions [768, 512, 256, 128, 64]
model = LateEncoder(
    "tomaarsen/mpnet-base-nli-matryoshka",
    truncate_dims=256,  # Truncate to 256 dimensions
)

Truncation is applied to token embeddings before pooling, which saves both memory and compute during inference.

Note: Truncation also works on non-MRL models, but may degrade embedding quality since they weren't trained to preserve information in leading dimensions.

Float16 Embeddings (`half_embeds`)

Convert chunk embeddings to float16 for 2x memory reduction:

from afterthoughts import LateEncoder

model = LateEncoder(
    "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
    half_embeds=True,  # Convert embeddings to float16
)

These options can be combined for additional savings.

Performance Optimizations

Using Automatic Mixed Precision (AMP)

To enable automatic mixed precision, set the amp parameter to True during initialization. This will automatically lower the numerical precision of the most numerically stable layers, reducing the memory footprint of the model and increasing inference speed. Using AMP generally lets you increase the batch size.

import torch
from afterthoughts import LateEncoder

model = LateEncoder(
    "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
    amp=True,
    amp_dtype=torch.float16, # Choose the lower-precision data type
)

Running the Model in 16-Bit Precision

To run the model in 16-bit precision, set the torch_dtype parameter to torch.float16 or torch.bfloat16 during initialization. This will reduce the memory footprint of the model and increase inference speed. Using 16-bit precision also generally lets you increase the batch size. This is similar to using AMP, but it is a cruder and more aggressive approach.

import torch
from afterthoughts import LateEncoder
model = LateEncoder(
    "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
    torch_dtype=torch.float16,  # Run the model in 16-bit precision
)

Alternatively, you can convert the model to 16-bit precision after it has been loaded:

from afterthoughts import LateEncoder

model = LateEncoder("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
model.half()  # Convert the model to 16-bit precision

Logging

Afterthoughts uses Python's standard logging module for diagnostic output. By default, logging is silent. To enable logging:

import afterthoughts

# Quick setup with configure_logging
afterthoughts.configure_logging(level="INFO")  # INFO, DEBUG, WARNING, etc.

Or use Python's logging module directly for more control:

import logging

# Enable debug output from Afterthoughts
logging.getLogger("afterthoughts").setLevel(logging.DEBUG)
logging.basicConfig()

Log levels:

INFO: Model loading, compilation, preprocessing time
DEBUG: Batch sizes, token counts, and other diagnostic details

Instruct-Style Embedding Models

Many modern embedding models require instruction prefixes to achieve optimal performance. Afterthoughts supports these models through query_prompt and document_prompt parameters.

E5-Instruct Models

E5-instruct models (e5-mistral-7b-instruct, multilingual-e5-large-instruct) require a task instruction for queries but not for documents:

from afterthoughts import LateEncoder

model = LateEncoder(
    "intfloat/multilingual-e5-large-instruct",
    query_prompt="Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ",
)

# Queries use the prompt automatically
query_embeds = model.encode_queries(["how much protein should a female eat"])

# Documents are encoded without any prompt
df, X = model.encode(docs, max_chunk_sents=2)

BGE Models

BGE models use a simpler prefix for queries:

model = LateEncoder(
    "BAAI/bge-large-en-v1.5",
    query_prompt="Represent this sentence for searching relevant passages: ",
)

Nomic Embed

Nomic requires task prefixes for both queries and documents:

model = LateEncoder(
    "nomic-ai/nomic-embed-text-v1.5",
    query_prompt="search_query: ",
    document_prompt="search_document: ",
)

Instructor Models

Instructor models use domain-specific instructions for both queries and documents:

model = LateEncoder(
    "hkunlp/instructor-large",
    query_prompt="Represent the Wikipedia question for retrieving supporting documents: ",
    document_prompt="Represent the Wikipedia document for retrieval: ",
)

Per-Call Prompt Override

You can override the default prompt for specific calls:

# Use a different task for this specific query
query_embeds = model.encode_queries(
    queries,
    prompt="Represent the sentence for clustering: ",
)

# Override document prompt for a specific encoding
df, X = model.encode(
    docs,
    prompt="Represent the scientific abstract: ",
)

How Prompts Work with Late Chunking

When a document prompt is provided:

The prompt is prepended to each document before tokenization
Sentence boundaries are detected on the original text (without prompt)
Prompt tokens are included in the model input for attention context
Prompt tokens are excluded from chunk mean-pooling (they get sentence_id=-1)

This ensures that document token embeddings benefit from attending to the prompt during the forward pass, while the final chunk embeddings represent only the actual document content.

Differences from the Late Chunking Paper

Afterthoughts implements the core late chunking approach from Günther et al., 2024 with some implementation choices that differ from the paper's recommendations. For details on special token handling, deduplication strategy, and chunk definitions, see docs/gunther-et-al-2024-differences.md.

Known Limitations

Memory Requirements

Since each document can contain many chunks, the memory requirements for this approach can be quite high. Use half_embeds=True and truncate_dims for reduced memory footprint.

Sequence Length

Late chunking's contextual benefits are bounded by the model's maximum sequence length. Documents exceeding this limit are split into overlapping sequences at sentence boundaries, which can reduce cross-chunk context at the boundaries. For best results, use long-context embedding models (e.g., models supporting 8K+ tokens) with documents that fit within the context window.

Future Work

Add paragraph segmentation
Support for additional chunking strategies (e.g., semantic chunking)
Support task-specific LoRA adapters (e.g., jina-embeddings-v3)

References

Late chunking technique:

Günther, M., Milliken, I., Geuter, J., Mastrapas, G., Wang, B., & Xiao, H. (2024). Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models. arXiv:2409.04701. https://arxiv.org/abs/2409.04701

License

This project is licensed under the Apache License 2.0.

You may use, distribute, and modify this project under the terms of the Apache License 2.0. For detailed information, see the LICENSE file included in this repository or visit the official Apache License website.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Feb 1, 2026

0.1.1

Jan 27, 2026

0.1.0

Jan 19, 2026

0.0.1

Jan 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afterthoughts-0.1.2.tar.gz (90.3 kB view details)

Uploaded Feb 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

afterthoughts-0.1.2-py3-none-any.whl (64.9 kB view details)

Uploaded Feb 1, 2026 Python 3

File details

Details for the file afterthoughts-0.1.2.tar.gz.

File metadata

Download URL: afterthoughts-0.1.2.tar.gz
Upload date: Feb 1, 2026
Size: 90.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for afterthoughts-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`73271c9bdd39226f5df23123fa7e85f501ac32c0ead952c70830e28cb9b60e06`
MD5	`40939ee2fb1f781283eb0a93972fe5db`
BLAKE2b-256	`399b013a0be8c686d1185ac7e4d897311a48ceaf0c18452df3bf6de01d5712db`

See more details on using hashes here.

File details

Details for the file afterthoughts-0.1.2-py3-none-any.whl.

File metadata

Download URL: afterthoughts-0.1.2-py3-none-any.whl
Upload date: Feb 1, 2026
Size: 64.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for afterthoughts-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bc7c6260348ea2c37f022f3e6d8ff08f2286fd169a2e7cfe7be39070200cff91`
MD5	`6b96e108def80ce70dbcfde4749dd7a1`
BLAKE2b-256	`b1f285b23198cbcaa50b0c0d502138be6521d1384affb4932c162185370940c0`

See more details on using hashes here.

afterthoughts 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Afterthoughts

Quick Start

What is Late Chunking?

Why Late Chunking?

Features

Usage Guide

Basic Usage

Advanced Chunking Strategies

Token-Based Chunking

Comparing Multiple Chunk Configurations

Use Case: Retrieval Performance Tuning

Using Pandas Instead of Polars

Memory Optimizations

Dimension Truncation (truncate_dims)

Float16 Embeddings (half_embeds)

Performance Optimizations

Using Automatic Mixed Precision (AMP)

Running the Model in 16-Bit Precision

Logging

Instruct-Style Embedding Models

E5-Instruct Models

BGE Models

Nomic Embed

Instructor Models

Per-Call Prompt Override

How Prompts Work with Late Chunking

Differences from the Late Chunking Paper

Known Limitations

Memory Requirements

Sequence Length

Future Work

References

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Dimension Truncation (`truncate_dims`)

Float16 Embeddings (`half_embeds`)