Skip to main content

Free and open source library for fast sentence boundary detection

Project description

MiniSBD

Free and open source Python library for fast sentence boundary detection. It uses 8bit quantized ONNX models for inference, thus making it fast and lightweight.

The only dependency is onnxruntime / onnxruntime-gpu.

Installation

pip install -U minisbd

Usage

from minisbd import SBDetect

text = """
La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle. Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII). En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.
"""

detector = SBDetect("fr", use_gpu=True)
for sent in detector.sentences(text):
    print(f"--> {sent}")

# --> La Révolution française (1789-1799) est une période de bouleversements politiques et sociaux en France et dans ses colonies, ainsi qu'en Europe à la fin du XVIIIe siècle.
# --> Traditionnellement, on la fait commencer à l'ouverture des États généraux le 5 mai 1789 et finir au coup d'État de Napoléon Bonaparte le 9 novembre 1799 (18 brumaire de l'an VIII).
# --> En ce qui concerne l'histoire de France, elle met fin à l'Ancien Régime, notamment à la monarchie absolue, remplacée par la monarchie constitutionnelle (1789-1792) puis par la Première République.

By default models are downloaded from GitHub and stored in the user's ~/.cache/minisbd folder. You can change this at runtime via:

from minisbd import models
models.cache_dir = '/path/to/cache'

You can optionally specify a path to a ONNX model instead of having MiniSBD download the model for you:

from minisbd import SBDetect
detector = SBDetect("/path/to/model.onnx")
# ...

Language Support

from minisbd.models import list_models
print(list_models())
Language Code
Afrikaans af
Ancient Greek grc
Ancient Hebrew hbo
Arabic ar
Armenian hy
Basque eu
Belarusian be
Bulgarian bg
Buryat bxr
Catalan ca
Chinese (Simplified) zh-hans
Chinese (Traditional) zh-hant
Classical Chinese lzh
Coptic cop
Croatian hr
Czech cs
Danish da
Dutch nl
English en
Erzya myv
Estonian et
Faroese fo
Finnish fi
French fr
Galician gl
German de
Gothic got
Greek el
Hebrew he
Hindi hi
Hungarian hu
Icelandic is
Indonesian id
Irish ga
Italian it
Japanese ja
Kazakh kk
Korean ko
Kurmanji kmr
Kyrgyz ky
Latin la
Latvian lv
Ligurian lij
Lithuanian lt
Maghrebi Arabic French qaf
Maltese mt
Manx gv
Marathi mr
Naija pcm
North Sami sme
Norwegian nb
Norwegian Nynorsk nn
Old Church Slavonic cu
Old East Slavic orv
Old French fro
Persian fa
Polish pl
Pomak qpm
Portuguese pt
Romanian ro
Russian ru
Sanskrit sa
Scottish Gaelic gd
Serbian sr
Slovak sk
Slovenian sl
Spanish es
Swedish sv
Tamil ta
Telugu te
Thai th
Turkish tr
Turkish German qtd
Ukrainian uk
Upper Sorbian hsb
Urdu ur
Uyghur ug
Vietnamese vi
Welsh cy
Western Armenian hyw
Wolof wo

Converting Stanza Models

The extract.py script can be used to extract existing Stanza models and convert them to ONNX. See the source code.

Credits

MiniSBD is a port of Stanza's tokenizer models to ONNX. The models are the same as those from Stanza, but have been converted to ONNX and quantized for faster inference and smaller size.

License

AGPLv3

Some code has been originally modified from Stanza.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minisbd-0.9.2.tar.gz (39.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

minisbd-0.9.2-py3-none-any.whl (40.9 kB view details)

Uploaded Python 3

File details

Details for the file minisbd-0.9.2.tar.gz.

File metadata

  • Download URL: minisbd-0.9.2.tar.gz
  • Upload date:
  • Size: 39.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.2.tar.gz
Algorithm Hash digest
SHA256 52ce472957e010ece55835f5470f905a47bfd9647415028b6b5176aa35be01db
MD5 449aa749cbfbbd9423784586e3246995
BLAKE2b-256 3fc7f34bd31d437a1c213222a652c1f336a23203cd4f44c485f64e35325d158a

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.2.tar.gz:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file minisbd-0.9.2-py3-none-any.whl.

File metadata

  • Download URL: minisbd-0.9.2-py3-none-any.whl
  • Upload date:
  • Size: 40.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for minisbd-0.9.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b965aed629a484967e9a7bff5b97d5189b8ec364d817ead41e43266e9ef98c2d
MD5 dee532d5cc6a2d6ac3f9d280c4c195a2
BLAKE2b-256 7be68ac77707b555540f816d19a87c503fd2daf66131c5724cc80565080c7596

See more details on using hashes here.

Provenance

The following attestation bundles were made for minisbd-0.9.2-py3-none-any.whl:

Publisher: publish.yml on LibreTranslate/MiniSBD

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page