Skip to main content

Free and open source library for fast sentence boundary detection

Project description

MiniSBD

Free and open source Python library for fast sentence boundary detection. It uses 8bit quantized ONNX models for inference, thus making it fast and lightweight.

The only dependency is onnxruntime / onnxruntime-gpu.

Installation

pip install -U minisbd

Usage

from minisbd import SBDetect

text = """
La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle. Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII). En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.
"""

detector = SBDetect("fr", use_gpu=True)
for sent in detector.sentences(text):
    print(f"--> {sent}")

# --> La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle.
# --> Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII).
# --> En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.

By default models are downloaded from GitHub and stored in the user's ~/.cache/minisbd folder. You can change this at runtime via:

from minisbd import models
models.cache_dir = '/path/to/cache'

You can optionally specify a path to a ONNX model instead of having MiniSBD download the model for you:

from minisbd import SBDetect
detector = SBDetect("/path/to/model.onnx")
# ...

Language Support

from minisbd.models import list_models
print(list_models())
Language Code
Afrikaans af
Ancient Greek grc
Ancient Hebrew hbo
Arabic ar
Armenian hy
Basque eu
Belarusian be
Bulgarian bg
Buryat bxr
Catalan ca
Chinese (Simplified) zh-hans
Chinese (Traditional) zh-hant
Classical Chinese lzh
Coptic cop
Croatian hr
Czech cs
Danish da
Dutch nl
English en
Erzya myv
Estonian et
Faroese fo
Finnish fi
French fr
Galician gl
German de
Gothic got
Greek el
Hebrew he
Hindi hi
Hungarian hu
Icelandic is
Indonesian id
Irish ga
Italian it
Japanese ja
Kazakh kk
Korean ko
Kurmanji kmr
Kyrgyz ky
Latin la
Latvian lv
Ligurian lij
Lithuanian lt
Maghrebi Arabic French qaf
Maltese mt
Manx gv
Marathi mr
Naija pcm
North Sami sme
Norwegian nb
Norwegian Nynorsk nn
Old Church Slavonic cu
Old East Slavic orv
Old French fro
Persian fa
Polish pl
Pomak qpm
Portuguese pt
Romanian ro
Russian ru
Sanskrit sa
Scottish Gaelic gd
Serbian sr
Slovak sk
Slovenian sl
Spanish es
Swedish sv
Tamil ta
Telugu te
Thai th
Turkish tr
Turkish German qtd
Ukrainian uk
Upper Sorbian hsb
Urdu ur
Uyghur ug
Vietnamese vi
Welsh cy
Western Armenian hyw
Wolof wo

Converting Stanza Models

The extract.py script can be used to extract existing Stanza models and convert them to ONNX. See the source code.

Credits

MiniSBD is a port of Stanza's tokenizer models to ONNX. The models are the same as those from Stanza, but have been converted to ONNX and quantized for faster inference and smaller size.

License

AGPLv3

Some code has been originally modified from Stanza.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minisbd-0.9.5.tar.gz (39.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minisbd-0.9.5-py3-none-any.whl (41.1 kB view details)

Uploaded Python 3

File details

Details for the file minisbd-0.9.5.tar.gz.

File metadata

  • Download URL: minisbd-0.9.5.tar.gz
  • Upload date:
  • Size: 39.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.5.tar.gz
Algorithm Hash digest
SHA256 b40d0a32ad0b3aeb8477e06a9e4723b7fa34a30121731f6a5063d39b7b3a464e
MD5 9f6f3a2a9ab0ac05ea2797a6d46f56cc
BLAKE2b-256 8a09648d539139188925d3c58272ccae9d8b0d51513a384e422ec1d4e4501ed1

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.5.tar.gz:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file minisbd-0.9.5-py3-none-any.whl.

File metadata

  • Download URL: minisbd-0.9.5-py3-none-any.whl
  • Upload date:
  • Size: 41.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a573a3bb62349a5a7f22f6e12c8305814dbe6765d83bcb034c6f780d1e69d1df
MD5 4485441a24037a359b9736455996c53d
BLAKE2b-256 d2477b13b62127a6e1460e1bc3fabf3aab0cf281f73cad81d5972b3aef40c0ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.5-py3-none-any.whl:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page