Skip to main content

Free and open source library for fast sentence boundary detection

Project description

MiniSBD

Free and open source Python library for fast sentence boundary detection. It uses 8bit quantized ONNX models for inference, thus making it fast and lightweight.

The only dependency is onnxruntime / onnxruntime-gpu.

Installation

pip install -U minisbd

Usage

from minisbd import SBDetect

text = """
La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle. Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII). En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.
"""

detector = SBDetect("fr", use_gpu=True)
for sent in detector.sentences(text):
    print(f"--> {sent}")

# --> La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle.
# --> Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII).
# --> En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.

By default models are downloaded from GitHub and stored in the user's ~/.cache/minisbd folder. You can change this at runtime via:

from minisbd import models
models.cache_dir = '/path/to/cache'

You can optionally specify a path to a ONNX model instead of having MiniSBD download the model for you:

from minisbd import SBDetect
detector = SBDetect("/path/to/model.onnx")
# ...

Language Support

from minisbd.models import list_models
print(list_models())
Language Code
Afrikaans af
Ancient Greek grc
Ancient Hebrew hbo
Arabic ar
Armenian hy
Basque eu
Belarusian be
Bulgarian bg
Buryat bxr
Catalan ca
Chinese (Simplified) zh-hans
Chinese (Traditional) zh-hant
Classical Chinese lzh
Coptic cop
Croatian hr
Czech cs
Danish da
Dutch nl
English en
Erzya myv
Estonian et
Faroese fo
Finnish fi
French fr
Galician gl
German de
Gothic got
Greek el
Hebrew he
Hindi hi
Hungarian hu
Icelandic is
Indonesian id
Irish ga
Italian it
Japanese ja
Kazakh kk
Korean ko
Kurmanji kmr
Kyrgyz ky
Latin la
Latvian lv
Ligurian lij
Lithuanian lt
Maghrebi Arabic French qaf
Maltese mt
Manx gv
Marathi mr
Naija pcm
North Sami sme
Norwegian nb
Norwegian Nynorsk nn
Old Church Slavonic cu
Old East Slavic orv
Old French fro
Persian fa
Polish pl
Pomak qpm
Portuguese pt
Romanian ro
Russian ru
Sanskrit sa
Scottish Gaelic gd
Serbian sr
Slovak sk
Slovenian sl
Spanish es
Swedish sv
Tamil ta
Telugu te
Turkish tr
Turkish German qtd
Ukrainian uk
Upper Sorbian hsb
Urdu ur
Uyghur ug
Vietnamese vi
Welsh cy
Western Armenian hyw
Wolof wo

Converting Stanza Models

The extract.py script can be used to extract existing Stanza models and convert them to ONNX. See the source code.

Credits

MiniSBD is a port of Stanza's tokenizer models to ONNX. The models are the same as those from Stanza, but have been converted to ONNX and quantized for faster inference and smaller size.

License

AGPLv3

Some code has been originally modified from Stanza.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minisbd-0.9.0.tar.gz (39.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minisbd-0.9.0-py3-none-any.whl (40.8 kB view details)

Uploaded Python 3

File details

Details for the file minisbd-0.9.0.tar.gz.

File metadata

  • Download URL: minisbd-0.9.0.tar.gz
  • Upload date:
  • Size: 39.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.0.tar.gz
Algorithm Hash digest
SHA256 c5ca65221e790e030683d2c1f99201e8f583800a028111bdb96fc8b32cc8a553
MD5 e3e22c51130c8bc6d561d58b4b61cd59
BLAKE2b-256 4c0082d9d4f9ff33a89bca1371f6465b4f008fec229f06035173c50a99bdb016

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.0.tar.gz:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file minisbd-0.9.0-py3-none-any.whl.

File metadata

  • Download URL: minisbd-0.9.0-py3-none-any.whl
  • Upload date:
  • Size: 40.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3f25df60327b9bd51b547f8b889db03a4fe6a318b5d381a64bcb36031d5597d0
MD5 0583a60af0891cdbf08802af7752421d
BLAKE2b-256 be3affb39def533885342120aa1c8e7c273a9769e0a209eb0d65f6c36015088c

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.0-py3-none-any.whl:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page