Skip to main content

LASER (Least Action Semantic Router) — globally optimal text chunking for RAG

Project description

LASR — Least Action Semantic Router

Pronounced "laser"

Globally optimal text chunking for RAG pipelines. LASR treats chunking as a physics-inspired optimization problem — it considers every possible way to partition a document and selects the one that minimizes a global objective balancing semantic cohesion against boundary cost.

No heuristics. No greedy local decisions. Just dynamic programming that finds the mathematically optimal partition.

Install

pip install lasr
python -m spacy download en_core_web_sm

Quick Start

from lasr import chunk

chunks = chunk(open("document.txt").read())

for c in chunks:
    print(f"[{c.start_char}:{c.end_char}] ({c.num_sentences} sentences)")
    print(c.text)
    print("---")

Control Granularity

from lasr import chunk

# Fewer, larger chunks (higher alpha = more expensive boundaries)
chunks = chunk(document, alpha=3.0)

# More, smaller chunks
chunks = chunk(document, alpha=1.5)

# Adjust sentence constraints
chunks = chunk(document, min_sentences=3, max_sentences=20)

Power-User API

For full control, use LaserPipeline and LaserConfig directly:

from lasr import LaserPipeline, LaserConfig

config = LaserConfig(
    alpha_base=2.5,     # boundary cost
    rho=1.0,            # tension coefficient
    l_min=5,            # min sentences per chunk
    l_max=30,           # max sentences per chunk
    model_name="all-MiniLM-L6-v2",
)

pipeline = LaserPipeline(config)
chunks = pipeline.chunk(text)

# Each chunk has context bleed for richer retrieval
for c in chunks:
    print(c.text)               # core DP-optimal text
    print(c.text_with_context)  # with 1-sentence bleed from neighbors

CLI

lasr chunk document.txt --alpha 2.5 --format json
lasr chunk document.txt --format text --output chunks.txt
lasr chunk document.txt --encoder openai --model text-embedding-3-large

Parameters

Parameter Default Effect
alpha / alpha_base 2.5 Boundary cost. Higher = fewer, larger chunks.
rho 1.0 Tension coefficient (anchor parameter).
min_sentences / l_min 5 Minimum sentences per chunk.
max_sentences / l_max 30 Maximum sentences per chunk.
w_struct 0.25 Structural discount (headers, double newlines).
w_bind 1.0 Coreference binding penalty (pronouns).
w_disc 0.3 Discourse connective penalty.

Benchmark Highlights

All results use all-MiniLM-L6-v2 (22M parameters, 384 dimensions) with alpha=2.5.

Dataset Domain LASR Recall@5 Next Best Margin
MSMARCO Web passages 0.999 0.985 +0.014
HotpotQA Multi-hop QA 0.974 0.972 +0.002
FinanceBench SEC filings 0.930 0.629 +0.301
CUAD Legal contracts 0.826 0.775 +0.051

LASR places first on every retrieval benchmark tested. On FinanceBench, the margin over the next best method is 30 percentage points.

How It Works

LASR models each document as a chain of semantic units (sentences) and finds the partition that minimizes:

Action = Tension + Boundary Cost

  • Tension measures semantic dispersion inside each chunk (cosine distance to centroid via prefix sums)
  • Boundary Cost (alpha) penalizes each split, preventing over-fragmentation

The optimization is solved exactly via dynamic programming in O(T * L_max) time, where T is the number of sentences. No approximations, no sampling — the same input always produces the same output.

Development

git clone https://github.com/lasr-chunker/lasr
cd lasr
pip install -e ".[dev]"
python -m spacy download en_core_web_sm
pytest

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lasr-0.1.0-py3-none-any.whl (18.9 kB view details)

Uploaded Python 3

File details

Details for the file lasr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lasr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 18.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.1

File hashes

Hashes for lasr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d64e471912f5763ad49ea90c9560cc380d1bc6399c3219319a0d650324e8397d
MD5 d48125aa166942a433bd896d3142b49e
BLAKE2b-256 1a833d36eda4bb12ec4011a5b68a0adc1e98a52a5add72c5c6fc147959771558

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page