LASER (Least Action Semantic Router) — globally optimal text chunking for RAG
Project description
LASR — Least Action Semantic Router
Pronounced "laser"
Globally optimal text chunking for RAG pipelines. LASR treats chunking as a physics-inspired optimization problem — it considers every possible way to partition a document and selects the one that minimizes a global objective balancing semantic cohesion against boundary cost.
No heuristics. No greedy local decisions. Just dynamic programming that finds the mathematically optimal partition.
Install
pip install lasr
python -m spacy download en_core_web_sm
Quick Start
from lasr import chunk
chunks = chunk(open("document.txt").read())
for c in chunks:
print(f"[{c.start_char}:{c.end_char}] ({c.num_sentences} sentences)")
print(c.text)
print("---")
Control Granularity
from lasr import chunk
# Fewer, larger chunks (higher alpha = more expensive boundaries)
chunks = chunk(document, alpha=3.0)
# More, smaller chunks
chunks = chunk(document, alpha=1.5)
# Adjust sentence constraints
chunks = chunk(document, min_sentences=3, max_sentences=20)
Power-User API
For full control, use LaserPipeline and LaserConfig directly:
from lasr import LaserPipeline, LaserConfig
config = LaserConfig(
alpha_base=2.5, # boundary cost
rho=1.0, # tension coefficient
l_min=5, # min sentences per chunk
l_max=30, # max sentences per chunk
model_name="all-MiniLM-L6-v2",
)
pipeline = LaserPipeline(config)
chunks = pipeline.chunk(text)
# Each chunk has context bleed for richer retrieval
for c in chunks:
print(c.text) # core DP-optimal text
print(c.text_with_context) # with 1-sentence bleed from neighbors
CLI
lasr chunk document.txt --alpha 2.5 --format json
lasr chunk document.txt --format text --output chunks.txt
lasr chunk document.txt --encoder openai --model text-embedding-3-large
Parameters
| Parameter | Default | Effect |
|---|---|---|
alpha / alpha_base |
2.5 | Boundary cost. Higher = fewer, larger chunks. |
rho |
1.0 | Tension coefficient (anchor parameter). |
min_sentences / l_min |
5 | Minimum sentences per chunk. |
max_sentences / l_max |
30 | Maximum sentences per chunk. |
w_struct |
0.25 | Structural discount (headers, double newlines). |
w_bind |
1.0 | Coreference binding penalty (pronouns). |
w_disc |
0.3 | Discourse connective penalty. |
Benchmark Highlights
All results use all-MiniLM-L6-v2 (22M parameters, 384 dimensions) with alpha=2.5.
| Dataset | Domain | LASR Recall@5 | Next Best | Margin |
|---|---|---|---|---|
| MSMARCO | Web passages | 0.999 | 0.985 | +0.014 |
| HotpotQA | Multi-hop QA | 0.974 | 0.972 | +0.002 |
| FinanceBench | SEC filings | 0.930 | 0.629 | +0.301 |
| CUAD | Legal contracts | 0.826 | 0.775 | +0.051 |
LASR places first on every retrieval benchmark tested. On FinanceBench, the margin over the next best method is 30 percentage points.
How It Works
LASR models each document as a chain of semantic units (sentences) and finds the partition that minimizes:
Action = Tension + Boundary Cost
- Tension measures semantic dispersion inside each chunk (cosine distance to centroid via prefix sums)
- Boundary Cost (alpha) penalizes each split, preventing over-fragmentation
The optimization is solved exactly via dynamic programming in O(T * L_max) time, where T is the number of sentences. No approximations, no sampling — the same input always produces the same output.
Development
git clone https://github.com/lasr-chunker/lasr
cd lasr
pip install -e ".[dev]"
python -m spacy download en_core_web_sm
pytest
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lasr-0.1.0-py3-none-any.whl.
File metadata
- Download URL: lasr-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d64e471912f5763ad49ea90c9560cc380d1bc6399c3219319a0d650324e8397d
|
|
| MD5 |
d48125aa166942a433bd896d3142b49e
|
|
| BLAKE2b-256 |
1a833d36eda4bb12ec4011a5b68a0adc1e98a52a5add72c5c6fc147959771558
|