Skip to main content

A semantic linebreaker powered by transformers

Project description

⚡️ Semantic Line Breaker (SemBr)

GitHub python pytorch PyPI

> When writing text
> with a compatible markup language,
> add a line break
> after each substantial unit of thought.

What is SemBr?

SemBr is a command-line tool powered by Transformer models that performs semantic linebreaks to breaks lines in a text file at semantic boundaries. It supports multiple file types including LaTeX, Markdown, and plain text, with automatic file type detection.

[20 Jun 2026] :rocket: It now supports MLX + NVFP4 on macOS which is incredibly fast: it now uses only <6 seconds to process 100k words on an old M2 MacBook Pro.

Installation

SemBr is available as a Python package on PyPI.

macOS on Apple Silicon with MLX

On Apple Silicon Macs, SemBr can use the MLX backend with NVFP4 quantization, which is ~30x faster than torch+MPS! Install the MLX extra:

uv tool install "sembr[mlx]"

Linux/Windows with CUDA support

For CUDA on Linux, install the CUDA extra:

uv tool install "sembr[cuda]"

CPU (Linux/Windows) or MPS (macOS) only

Install with uv:

uv tool install sembr[cpu]

From GitHub (Latest Development Version)

To install the latest development version directly from GitHub:

# Install from GitHub main branch
uv tool install git+https://github.com/admk/sembr.git

# Run directly without installing
uvx --from git+https://github.com/admk/sembr.git sembr

Note that the development version may include experimental features and could be less stable than the PyPI release.

Development

To develop this project, clone and install in development mode:

git clone https://github.com/admk/sembr.git
cd sembr
SEMBR_VERSION_SUFFIX=.dev0 \
  uv tool install --editable . --force --refresh-package sembr

Supported Platforms

SemBr is supported on Linux, macOS and Windows (well-tested on macOS). On machines with CUDA devices, or on Apple Silicon Macs, SemBr will use the GPU / Apple Neural Engine to accelerate inference.

Usage

Command Line Interface

To use SemBr, run the following command in your terminal:

sembr -i <input_file> -o <output_file>

where <input_file> and <output_file> are the paths to the input and output files respectively.

On the first run, it will download the SemBr model and cache it in ~/.cache/huggingface. Subsequent runs will check for updates and use the cached model if it is up-to-date.

Alternatively, you can pipe the input into sembr, and the output can also be printed to the terminal:

cat <input_file> | sembr

This is especially useful if you want to use SemBr with clipboard managers, for instance, on a Mac:

pbpaste | sembr | pbcopy

Or on Linux:

xclip -o | sembr | xclip -i

You can also specify the following command-line options:

  • -l, --listen: Serves the SemBr API on a local server.
    • Each instance of sembr run will detect if the API is accessible, and if not it will run the model on its own.
    • This option is useful to avoid the time taken to initialize the model by keeping it in memory in a separate process.
  • --file-type <type>: File type (plaintext, latex, markdown, etc.). Auto-detected using Magika if not provided.
  • --mcp: Start MCP server mode instead of processing text.

Configurations

Additionally, you can configure SemBr by creating $XDG_CONFIG_HOME/sembr/config.toml. If XDG_CONFIG_HOME is not set, SemBr reads ~/.config/sembr/config.toml. The complete commented defaults are stored in sembr/default.toml. Copy that file to your config path and edit only the values you want to change.

To use it offline, you can download the model from Hugging Face and set model.name to the model directory, or prepend TRANSFORMERS_OFFLINE=1 to the command to use the cached model.

You can override config values for a single run with -c or --config:

sembr \
  -c model.name=/path/to/model \
  -c optimize.algorithm=balanced_linebreaks \
  -c optimize.preferred_min_tokens_per_line=8 \
  -c optimize.preferred_max_tokens_per_line=10 \
  -c optimize.line_length_penalty_weight=0.05

The supported config keys are:

  • model.name: The name of the Hugging Face model to use.
  • model.backend: Inference backend to use. torch is the default. cuda uses the torch backend and requires a CUDA-capable torch install. Choose mlx on Apple Silicon.
  • model.bits: Quantization bits for model weights (4 or 8). Requires CUDA. Not supported on MPS.
  • model.dtype: Data type for model weights (e.g. float16, bfloat16). Default is float32.
  • model.quantization: MLX weight quantization mode. Set model.backend=mlx and model.quantization=nvfp4 to use MLX NVFP4 quantized linear layers. The default is none.
  • inference.batch_size: The number of lines to process in a batch. Default is 8.
  • inference.overlap_divisor: The overlap divisor for tiled inference. Default is 8.
  • optimize.algorithm: The prediction function to use. Options are argmax, logit_adjustment, greedy_linebreaks, and balanced_linebreaks. Default is balanced_linebreaks.
  • optimize.preferred_min_tokens_per_line: Preferred lower line length target. Default is 8.
  • optimize.preferred_max_tokens_per_line: Preferred upper line length target. Default is 10.
  • optimize.line_length_penalty_weight: Penalty weight for line lengths outside the preferred range. The default is 0.05.
  • format.num_spaces: Number of spaces represented by one indentation level, or auto to detect 2, 4, or 8 from the input. The default is auto.
  • format.indent_type: Indentation unit to emit. Options are space, tab, and auto. auto detects space or tab indentation from the input. The default is space.
  • listen.host: The host address of the SemBr API server. The default is 127.0.0.1.
  • listen.port: The port for the SemBr API server. The default is 8384.

Balanced line breaks

The balanced_linebreaks algorithm optimizes line breaks with dynamic programming over each parsed paragraph.

It precomputes token costs from the model log probabilities. A no-break token costs -log P(off), and a break token costs -log P(breaks). After choosing break positions, it uses the highest-scoring indent level at each chosen position to recover the break type.

The objective also adds a quadratic penalty when the token count falls outside optimize.preferred_min_tokens_per_line and optimize.preferred_max_tokens_per_line. Larger optimize.line_length_penalty_weight values make the algorithm favor the preferred range more strongly.

For a paragraph with n tokens, the implementation uses prefix sums, a monotonic queue for the no-penalty range, and a Li Chao tree for long-line penalties. The optimization complexity is O(n * l + n log n), where l is optimize.preferred_min_tokens_per_line. Memory usage is O(n) per paragraph.

MCP Server

Alternatively, you can run sembr as an MCP server. Simply add the following configuration to your MCP server configuration:

"mcpServers": {
  "sembr": {
    "type": "stdio",
    "command": "uvx",
    "args": [
      "sembr",
      "--mcp"
    ],
  }
}

The server also supports the formatting options described above. It will expose a wrap_text tool for the MCP client to use.

What are Semantic Line Breaks?

Semantic Line Breaks or Semantic Linefeeds describe a set of conventions for using insensitive vertical whitespace to structure prose along semantic boundaries.

Why use Semantic Line Breaks?

Semantic Line Breaks has the following advantages:

  • Breaking lines by splitting clauses reflects the logical, grammatical and semantic structure of the text.

  • It enhances the ease of editing and version control for a text file. Merge conflicts are less likely to occur when small changes are made, and the changes are easier to identify.

  • Documents written with semantic line breaks are easier to navigate and edit with Vim and other text editors that use Vim keybindings.

  • Semantic line breaks are invisible to readers. The final rendered output shows no changes to the source text.

Why SemBr?

Converting existing text not written with semantic line breaks takes a long time to do it manually, and it is surprisingly difficult to do it automatically with rule-based methods.

Challenges of rule-based methods

Rule-based heuristics do not work well with the actual semantic structure of the text, often leading to incorrect semantic boundaries. Moreover, these boundaries are hierarchical and nested, and a rule-based approach cannot capture this structure. A semantic line break may occur after a dependent clause, but where to break clauses into lines is challenging to determine without syntactic and semantic reasoning capabilities. For examples:

  • A rule that breaks lines at punctuation marks will not work well with sentences that contain periods in abbreviations or mathematical expressions.

  • Syntactic or semantic structures are not always easy to determine. "I like to eat apples and oranges because they are healthy." should be broken into lines as follows:

    > I like to eat apples and oranges
    > because they are healthy.
    

    rather than:

    > I like to eat apples
    > and oranges because they are healthy.
    

For this reason, I have created SemBr, which uses finetuned Transformer models to predict line breaks at semantic boundaries.

How does SemBr work?

SemBr uses a Transformer model to predict line breaks at semantic boundaries.

A small dataset of text with semantic line breaks was created from my existing LaTeX documents. The dataset was split into training (46,295 lines, 170,681 words and 1,492,952 characters) and test (2,187 lines, 7,564 words and 72,231 characters) datasets.

The data was prepared by extracting line breaks and indent levels from the files, and then converting the result into strings of paragraphs with line breaks removed. The data can then be tokenized using the tokenizer and converted into a dataset with tokens, where each token has a label denoting if there is line break before it, and the indent level of the token.

For LaTeX documents, there are two types of line breaks: one with a normal line break that adds implicit spacing (e.g. line a⏎line b) and one with no spacing (e.g. line a%⏎line b). The data processor also tries to preserve the LaTeX syntax of the text by adding and removing comment symbols (%), if necessary.

The pretrained masked language model is then finetuned as a token classifier on the training dataset to predict the labels of the tokens. We save the model with the best F1 score on correctly predicting the existence of a line break on the test set. The finetuning logs for the following models can be found on this WandB report:

Performance

We now ship an MLX NVFP4 variant that is about 26k words per second, with a much fast model load time (4 seconds) and only about 130 MB of memory usage! Inference speed for the old torch+MPS backend on an M2 Macbook Pro is about 850 words per second on bert-small with the default options, the memory usage is about 1.70 GB.

The link breaking accuracy is difficult to measure, and the locations of line breaks could also be subjective. On the test set, the per-token line break accuracy of the models are >95%, with ~80% F1 scores. Because of the sparse nature of line breaks, the accuracy is not a good metric to measure the performance of the model, and I used the F1 score instead to save best models.

Improvements and TODOs

  • Features:
    • Natural language support:
      • Support natural languages other than English.
    • Typesetting languages support:
      • Markdown.
      • Typst.
      • LaTeX.
    • Usability:
      • Inference queue.
      • Daemon with model unloading.
    • Editor integration:
      • NeoVim plugin.
      • VSCode extension.
      • MCP server.
    • Use the Hugging Face API for inference.
  • Accuracy:
    • Some lines are too short or too long:
      • Long lines can be penalized greedily by breaking lines with token counts more than optimize.preferred_max_tokens_per_line.
      • Support optimize.preferred_(min|max)_words_per_line.
      • Improve the algorithm to penalize short and long lines with a more sophisticated method.
    • Improve indent level prediction.
    • Performance and accuracy benchmarking, and comparisons with related works.
  • Performance:
    • Improve inference speed.
    • Reduce memory usage.

Related Projects and References

Sentence splitting:

Semantic line breaking:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sembr-0.4.2.tar.gz (57.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sembr-0.4.2-py3-none-any.whl (48.5 kB view details)

Uploaded Python 3

File details

Details for the file sembr-0.4.2.tar.gz.

File metadata

  • Download URL: sembr-0.4.2.tar.gz
  • Upload date:
  • Size: 57.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sembr-0.4.2.tar.gz
Algorithm Hash digest
SHA256 7e1607340cfe9f0afddf1f58f48a8d2dbbbe0f72894e278c8224d8b80a1f7ec4
MD5 7f7e16c490983cce1e69b125650ab74e
BLAKE2b-256 a4ab78a004d00ce8c0b6b32e774a69263dc41368165231bdcce2e2e65418c24d

See more details on using hashes here.

Provenance

The following attestation bundles were made for sembr-0.4.2.tar.gz:

Publisher: publish.yml on admk/sembr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sembr-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: sembr-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 48.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sembr-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7c6ea022bc266c2ed3122be266307e0fe7a0663498962766f07b8b4f041ded62
MD5 357a7bb7f4ab2c3cb5a09e2002ca854e
BLAKE2b-256 1eceb8e8802bd1a94152cabf4e9bac4d00526a97b5ea779ddd47d7567d0048b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for sembr-0.4.2-py3-none-any.whl:

Publisher: publish.yml on admk/sembr

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page